Category Archives: Database

Fast Data

During the 1980s and 1990s, online transaction processing (OLTP) was critical for banks, airlines, and telcos for core business functions. This was a big step-up from batch systems of the early days. We learnt the importance of sub-second response time and continuous availability with the goal of five-nines (99.999% uptime). The yearly tolerance of system outage was like 5 minutes. During my days at IBM, we had to face the fire from a bank in Japan that had an hour long outage resulting in a long queue in front of the ATM machine (unlike here, the Japanese stand very patiently until the system came back after what felt like an eternity). They were using IBM’s IMS Fast Path software and the blame was first put on that software, which subsequently turned out to be some other issue.

Advance the clock to today. Everything is real-time and one can not talk about real-time without discussing the need for “fast data” – data that has to travel very fast for real time decision making. Here are some reasons for fast data:

  • These days, it is important for businesses to be able to quickly sense and respond to events that are affecting their markets, customers, employees, facilities, or internal operations. Fast data enables decision makers and administrators to monitor, track, and address events as they occur.
  • Leverage the Internet of Things – for example, an engine manufacturer will embed sensors within its products, which then will provide continuous feeds back to the manufacturer to help spot issues and better understand usage patterns.
  • An important advantage that fast data offers is enhanced operational efficiency, since events that could negatively affect processes—such as inventory shortages or production bottlenecks—can not only be detected and reported, but remedial action can be immediately prescribed or even launched. Realtime analytical data can be measured against the patterns determined to predict problems, and systems can respond with appropriate alerts or automated fixes.
  • Assure greater business continuity – Fast data plays a role in bringing systems—and all data still in the pipeline—back up and running quickly, before the business suffers from a catastrophic event.
  • Fast data is critical for supporting Artificial Intelligence and machine learning. As a matter of fact, data is the fuel for machine learning (recommendation engines, fraud detection systems, bidding systems, automatic decision making systems, chatbots, and many more).

Now let us look at the constellation of technologies enabling fast data management and analytics. Fast data is the data that moves almost instantaneously from source to processing to analysis to action, courtesy of framework and pipelines such as Apache Spark, Apache Storm, Apache Kafka, Apache Kudu, Apache Cassandra, and in-memory data grids. Here is a brief outline on each of these.

Apache Spark – open source toolset now supported by most major database vendors. It offers streaming and SQL libraries to deliver real-time data processing. Spark Streaming offers data as it is created, enabling analysis for critical areas like real-time analytics and fraud detection. It’s structured streaming API opens up this capability to enterprises of all sizes.

Apache Storm is an open source distributed real-time computation system designed to enable processing of data streams.

Apache Cassandra is an open source low-latency data replication engine.

Apache Kafka is an open source toolset designed for real-time data streaming – employed for data pipelines and streaming apps. Kafka Connect API helps connect it to other environments. It originated at Linked-In.

Apache Kudu is an open source storage engine to support real-time analytics on commodity hardware.

In addition to powerful open source tools and frameworks, there are in-memory data grids that provides a hardware-enabled fast data enabler to deliver blazing speeds to meet today’s needs such as the IoT management and deployment of AI and machine learning and responding to events in real-time.

Yes, we have come a long way from those OLTP days! Fast data management and analytics is becoming a key area for businesses to survive and grow.

Advertisements

Netflix Technology

I attended a meetup at Netflix last evening titled “Polyglot Persistence at Netflix”. The Cloud Development Engineering (CDE) team presented various aspects of building and maintaining a highly distributed system to meet its ever-growing customer needs. There are almost 160 million users and with its growing popularity of streaming movies and TV shows (many are produced now by Netflix), the demand on its systems is growing rapidly. Polyglot implies the coexistence of many databases and associated software systems.

Netflix cloud platform forms a layer of services, tools, frameworks and technologies that run on top of AWS EC2 in order to implement an efficient and nimble (fast reacting), highly available, globally distributed, scalable and performant solution. They switched over to AWS Cloud over a seven-year period starting back in 2009. It uses Amazon’s RDS and DynamoDB besides S3 for lower cost storage. The front-end is Node.js while the backend uses Java, Python, and Javascript. The team also described how they are using SSD (solid state device) besides memory cache. The main thrust of the evening talk was their use of Cassandra as the distributed database solution.

Apache Cassandra was originally developed at Facebook as a free, open-source, highly scalable, high performance distributed database to handle large amounts of data across many servers with no single point of failure. This global network of storage servers caches content locally to where it will be viewed. This local caching reduces bandwidth costs, reduces latency, and makes it easier to scale the service over a wide area, in this case globally. Here are the key reasons Netflix is a major user of Cassandra (besides others like eBay, Apple, Comcast, Instagram and Reddit):

  • Very large production deployment – 2500 nodes, 420 TB, over 1 Trillion user requests per day. Cassandra is a NOSQL, distributed, document-oriented database that scales horizontally and dynamically as more servers are added without need to re-shard or reboot.
  • Strong write performance with no network performance bottleneck.
  • It’s data model is highly flexible. A sparse 2-dimensional “super column family” architecture allows for rich data model representation (and better performance) beyond just key-value lookup.
  • It’s geographic capabilities – single global cluster can simultaneously replicate data asynchronously as well as service applications across multiple locations. The team last evening showed how users can seamlessly switch over to another data center if failure occurs. Cassandra has been a good choice for cross-data center and cross-regional deployment as customisable replication helps determine which cluster nodes to designate as replicas.

Like Youtube, Netflix has been growing its global reach and customers in providing streaming contents. The key success factor is the database technology to enable such high scale and performance. Other databases like RDS, DynamoDB and MySQL are providing varieties of function such as analytics and metadata store. One impressive part of last evening’s presentation was how they repair any damage to data on the fly by embedding it into the database itself.

The New AI Economy

The convergence of technology leaps, social transformation, and genuine economic needs is catapulting AI (Artificial Intelligence) from its academic roots & decades of inertia to the forefront of business and industry. There has been a growing noise since last couple of years on how AI and its key subsets like Machine Learning and Deep Learning will affect all walks of life. Another phrase “Pervasive AI” is becoming part of our tech lexicon after the popularity of Amazon Echo and Google Home devices.

So what are the key factors pushing this renaissance of AI? We can quickly list them here:

  • Rise of Data Science from the basement to the boardroom of companies. Everyone saw the 3V’s of Big Data (volume, velocity, and variety). Data is called by many names – oxygen, the new oil, new gold, or the new currency.
  • Open source software such as Hadoop sparked this revolution in analytics using lots of unstructured data. The shift from retroactive to more predictive and prescriptive analytics is growing, for actionable business insights. Real-time BI is also taking a front seat.
  • Arrival of practical frameworks for handling big data revived AI (Machine Learning and Deep Learning) which fed happily on big data.
  • Existing CPU’s were not powerful for the fast processing needs of AI. Hence GPU (Graphical Processing Units) offered faster and more powerful chips. NVIDIA provided a positive force in this area. It’s ability to provide a full range of components (systems, servers, devices, software, and architecture) is making NVIDIA an essential player in the emerging AI economy. IBM’s neuromorphic computing project provides notable success in the area of perception, speech and image recognition.

Leading software vendors such as Google have numerous projects on AI ranging from speech and image recognition, language translation, and varieties of pattern matching. Facebook, Amazon, Uber, Netflix, and many others are racing to deploy AI into their products.

Paul Allen, co-founder of Microsoft is pumping $125M into his research lab Allen Institute of AI. The focus is to digitize common sense. Let me quote from today’s New York Times, “Today, machines can recognize nearby objects, identify spoken words, translate one language into another and mimic other human tasks with an accuracy that was not possible just a few years ago. These talents are readily apparent in the new wave of autonomous vehicles, warehouse robotics, smartphones and digital assistants. But these machines struggle with other basic tasks. Though Amazon’s Alexa does a good job of recognizing what you say, it cannot respond to anything more than basic commands and questions. When confronted with heavy traffic or unexpected situations, driverless cars just sit there”. Paul Allen added, “To make real progress in A.I., we have to overcome the big challenges in the area of common sense”.

Welcome to the new AI economy!

Big Data & Analytics – what’s ahead?

Recently I read somewhere this statement – As we end 2017 and look ahead to 2018, topics that are top of mind for data professionals are the growing range of data management mandates, including the EU’s new General Data Protection Regulation that is directed at personal data and privacy, the growing role of artificial intelligence (AI) and machine learning in enterprise applications, the need for better security in light of the onslaught of hacking cases, and the ability to leverage the expanding Internet of Things.

Here are the key areas as we look ahead:

  • Business owners demand outcomes – not just a data lake to store all kinds of data in its native format and API’s.
  • Data Science must produce results – Play and Explore is not enough. Learn to ask the right questions. Visualization of analytics from search.
  • Everyone wants Real Time – Days and weeks too slow, need immediate actionable outcomes. Analytics & recommendations based on real time data.
  • Everyone wants AI (artificial intelligence) – Tell me what I don’t know.
  • Systems must be secure – no longer a mere platitude.
  • ML (machine learning) and IoT at massive scale – Thousands of ML models. Need model accuracy.
  • Blockchain – need to understand its full potential to business – since it’s not transformational, but a foundational technology shift.

In the area of big data, a combination of new and long-established technologies are being put to work. Hadoop and Spark are expanding their roles within organizations. NoSQL and NewSQL databases bring their own unique attributes to the enterprise, while in-memory capabilities (such as Redis) are increasingly being utilized to deliver insights to decision makers faster. And through it all, tried-and-true relational databases continue to support many of the most critical enterprise data environments.

Cloud becomes the de-facto deployment choice for both users and developers. Serverless technology with FaaS (Function as a Service) is getting rapid adoption amongst developers. According to IDC, enterprises are undergoing IT transformation as they rethink their business operations, including how they use information and what technology to deploy. In line with that transformation, nearly 80% of large organizations already have a hybrid cloud strategy in place. The modern application architecture, sometimes referred to as SMAC (social, mobile, analytics, cloud) is becoming standard everywhere.

The DBaaS (database as a service) is still not as widespread as other cloud services. Microsoft is arguably making the strongest explicit claim for a converged database system with its Azure Cosmo DB as DBaaS. Cosmo DB claims to support four data models – key-value, column-family, document, and graph. However, databases have been slower to migrate to the cloud than other elements of computing infrastructure mainly for security and performance reasons. But DBaaS adoption is poised to accelerate. Some of these cloud based DBaaS systems – Cosmo DB, Spanner from Google, and AWS DynamoDB – now offer significant advantages over their on-premise counterparts.

One thing for sure, big data and analytics will continue to be vibrant and exciting in 2018.

AWS re:Invent 2017

In a few decades when the history of computing will be written, a major section will be devoted to cloud computing. The headline of the first section would read something like this – How did a dot-com era book-selling company became the father of cloud computing? While the giants like IBM, HP, and Microsoft were sleeping, Amazon started a new business eleven years ago in 2006 called AWS (Amazon Web Services). I still remember the afternoon when I had spent couple of hours with the CTO of Amazon (not Werner Vogel, his predecessor, a dutch gentleman) back in 2004 discussing the importance of SOA (Service Oriented Architecture). When I asked why was he interested, he mentioned how CEO Jeff Bezos has given a marching order to monetize the under-utilized infrastructure in their data centers. Thus AWS arrived in 2006 with S3 for storage and EC2 for computing.

Advance the clock by 11 years. At this week’s AWS Re-Invent event in Las Vegas it was amazing to listen to Andy Jassy, CEO of AWS who gave a 2.5 hour keynote on how far AWS has come. There were 43,000 people attending this event (in its 6th year) and another 60,000 were tuned in via the web. AWS has a revenue run rate of $18B with a 42% Year-to-Year growth. It’s profit is over 60% thus contributing significantly to Amazon’s bottom line. It has hundreds of thousands of customers starting from majority web startups to Fortune 500 enterprise players in all verticals. It has the strongest partner ecosystem. Garter group said AWS has a market share of 44.1% (39% last year), larger than all others combined. Customers like Goldman Sachs, Expedia, and National Football League were on stage showing how they fully switched to AWS for all their development and production.

Andy covered four major areas – computing, database, analytics, and machine learning with many new announcement of services. AWS already offers over 100 services. Here is a brief overview.

  • Computing – 3 major areas: Instances of EC2 including new GPU processor for AI, Containers (services such as Elastic Container Services and new ones like EKS – Elastic Kubernetes Services), and Serverless (Function as a Service with its Lambda services). The last one, Serverless is gaining fast traction in just last 12 months.
  • Database – AWS is starting to give real challenge to incumbents like Oracle, IBM and Microsoft. It has three offerings – AWS Aurora RDBMS for transaction processing, DynamoDB and Redshift. Andy announced Aurora Multi-Master for replicated read and writes across data centers and zones. He claims it is the first RDBMS with scale-out across multiple data centers and is lot cheaper than Oracle’s RAC solution. They also announced Aurora Serverless for on-demand, auto-scaling app dev. For No-SQL, AWS has DynamoDB (key-value store). They also have Amazon Elastic Cache for in-memory DB. Andy announced Dynamo DB Global Tables as a fully-managed, multi-master, multi-region DB for customers with global users (such as Expedia). Another new service called Amazon Neptune was announced for highly connected data (fully managed Graph database). They also have Redshift for data warehousing and analytics.
  • Analytics – AWS provides Data Lake service on S3 which enables API access to any data in its native form. They have many services like Athena, Glue, Kinesis to access the data lake. Two new services were announced – S3 Select (a new API to select and retrieve S3 data from within an object), Glacier Select (access less frequently used data in the archives).
  • Machine Learning – Amazon claims it has been using machine learning for 20 years in its e-commerce business to understand user’s preferences. A new service was announced called Amazon Sagemaker which brings storage, data movement, management of hosted notebook, and ML algorithms like 10 top commonly used ones (eg. Time Series Forecasting). It also accommodates other popular libraries like Tensorflow, Apache MxNet, and Caffe2. Once you pick an algorithm, training is much easier with Sagemaker. Then with one-click, the deployment happens. Their chief AI fellow Dr. Matt Wood demonstrated on stage how this is all done. They also announced AWS DeepLens, a video camera for developers with a computer vision model. This can detect facial recognition and image recognition for apps. New services announced besides the above two are – Amazon Kinesis Video streams (video ingestion), Amazon Transcribe (automatic speech recognition), Amazon Translate (between languages), and Amazon Comprehend (fully managed NLP – Natural Language Processing).

It was a very impressive and powerful presentation and shows how deeply committed and dedicated the AWS team is. Microsoft Azure cloud, Google’s computing cloud, IBM’s cloud and Oracle’s cloud all seem way behind in terms of AWS’s breadth and depth. It will be to customer’s benefit to have couple of AWS alternatives as we march along the cloud computing highway. Who wants a single-vendor lock-in?

Blockchain 101

There is a lot of noise on Blockchain these days. Back in May, 2015 The Economist wrote a whole special on Bockchain and it said, “The “blockchain” technology that underpins bitcoin, a sort of peer-to-peer system of running a currency, is presented as a piece of innovation on a par with the introduction of limited liability for corporations, or private property rights, or the internet itself”. It all started after the 2008 financial crisis, when a seminal paper written by Satoshi Nakamoto on Halloween day (Oct 31, 2008) caught the attention of many (the real identity of the author is still unknown). The name of the paper was “Bitcoin: A peer to peer electronic cash system”. Thus began a cash-less, bank-less world of money exchange over the internet using blockchain technology. Bitcoin’s value has exceeded $6000 and market cap is over $100B. VC’s are rushing to invest in cryptocurrency like never before.

The September 1, 2017 issue of Fortune magazine’s cover page screamed “Blockhain Mania”. The article said, “A blockchain is a kind of ledger, a table that businesses use to track credits and debits. But it’s not just any run-of-the-mill financial database. One of blockchain’s distinguishing features is that it concatenates (or “chains”) cryptographically verified transactions into sequences of lists (or “blocks”). The system uses complex mathematical functions to arrive at a definitive record of who owns what, when. Properly applied, a blockchain can help assure data integrity, maintain auditable records, and contracts into programmable software. It’s a ledger, but on the bleeding edge”.

So welcome to the new phase of network computing where we switch from “transfer of information” to “transfer of values”. Just as TCP/IP became the fundamental protocol for communication and helped create today’s internet with the first killer app Email (SMTP), blockchain will enable exchange of assets (the first app being Bitcoin for money). So get used to new terms like cryptocurrency, DLS (distributed ledger stack), nonce, ethereum, smart contracts, pseudo anonymity, etc. The “information internet” becomes the “value internet”. — Patrick Byrne, CEO of Overstock said, “Over the next decade, what the internet did to communications, blockchain is going to do to about 150 industries”. — In a recent article in Harvard Business Review, authors Joi Ito, Neha Narula, and Robleh Ali said, “The blockchain will do to the financial system what the internet did to media”.

The key elements of blockchain are the following:

  • Distributed Database – each party on a blockchain has access to entire DB and its complete history. No single party controls the data or the info. Each party can verify records without an intermediary.
  • Peer-to-Peer Transmission (P2P) – communication directly between peers instead of thru a central node.
  • Transparency with Pseudonymity – each transaction and associated value are visible to anyone with access to system. Each node/user has a unique 30-plus-character alphanumeric address. Users can choose to be anonymous or provide proof of identity. Transactions occur between blockchain addresses.
  • Irreversibility of records – once a transaction is entered in the DB, they can not be altered, because they are linked to every xaction record before them (hence the term ‘chain’).
  • Computational Logic – blockchain transactions can be tied to computational logic and in essence programmed.

The heart of the system is a distributed database that is write-once, read-many with a copy replicated at each node. It is transaction processing in a highly distributed network with guaranteed data integrity, security, and trust. Blockchain also provides automated, secure coordination system with remuneration and tracking. Even if it started with “money transfer” via Bitcoin, the underpinnings can be applied to any assets. The need for a central coordinating agency such as bank becomes unnecessary. Assets such as mortgages, bonds, stocks, loans, home titles, auto registries, birth and death certificates, passport, visa, etc. can all be exchanged without intermediaries. The Feb, 2017 HBR article said, “Blockchain is a foundational technology (not disruptive). It has the potential to create new foundations for our economic & social systems.”

We did not get into the depth of the technology here, but plenty of literature is available for you to read. Major vendors such as IBM, Microsoft, Oracle, HPE are offering blockchain as an infrastructure service for enterprise asset management.

Splice Machine – What is it?

Those of you who have never heard of Splice Machine, don’t worry. You are in the company of many. So I decided to listen to a webinar last week that said the following in its announcement: learn about benefits of a modern IoT application platform that can capture, process, store, analyze and act on the large streams of data generated by IoT devices. The demonstration will include:

  • High Performance Data Ingestion
  • Analytics and Transformation on Data-In-Motion
  • Relational DBMS, Supporting Hybrid OLTP and OLAP Processing
  • In-Memory and Non-Volatile, Row-based and Columnar Storage mechanisms
  • Machine Learning to support decision making and problem resolution

That was a tall order. Gartner has a new term HTAP – Hybrid Transactional and Analytical Processing. Forrester uses “Translytical” to describe this platform where you could do both OLTP and OLAP. I had written a blog on Translytical database almost two years back. So I did attend the webinar and it was quite impressive. The only confusion was the liberal use of IoT in its marketing slogan. By that they want to emphasize “streaming data” (ingest, store, manage).

In Splice Machine’s website, you see four things: Hybrid RDBMS, ANSI SQL, ACID Transactions, and Real-Time Analytics. A white paper advertisement says, “Your IoT applications deserve a better data platform”. In looking at the advisory board members, I recognized 3 names – Roger Bamford, ex-Oracle and an investor, Ken Rudin, ex-Oracle, and Marie-Anne Niemet, ex-TimeTen. The company is funded by Mohr Davidow Ventures, and Interwest Partners amongst others.

There is a need for bringing together the worlds of OLTP (Transaction workloads) and Analytics or OLAP workloads into a common platform. They have been separated for decades and that’s how the Data Warehouse, MDM, OLAP cubes, etc. got started. The movement of data between the OLTP world and OLAP has been handled by ETL vendors such as Informatica. With the popularity of Hadoop, the DW/Analytics world is crowded with terms like Data Lake, ELT (first load, then transform), Data Curation, Data Unification, etc. A new architecture called Lambda (not to be confused with AWS Lambda for serverless computing) claims to unify the two worlds – OLTP and real-time streaming and analytics.

Into this world, comes Splice Machine with its scale-out data platform. You can do your standard ACID-compliant OLTP processing, data ingestion via Spark streaming and Kafka topics, query processing via ANSI SQL, and get your analytical workload without ETL. They even claim support of procedural language like PL/SQL for Oracle data. With their support of machine learning, they demonstrated predictive analytics. The current focus is on verticals like Healthcare, Telco, Retail, and Finance (Wells fargo), etc.

In the cacophony of Big Data and IoT noise, it is hard to separate facts from fiction. But I do see a role for a “unified” approach like Splice Machine. Again, the proof is always in the pudding – some real-life customer deployment scenarios with performance numbers will prove the hypothesis and their claim of 10x faster speed with one-fourth the cost.