Fast Data

During the 1980s and 1990s, online transaction processing (OLTP) was critical for banks, airlines, and telcos for core business functions. This was a big step-up from batch systems of the early days. We learnt the importance of sub-second response time and continuous availability with the goal of five-nines (99.999% uptime). The yearly tolerance of system outage was like 5 minutes. During my days at IBM, we had to face the fire from a bank in Japan that had an hour long outage resulting in a long queue in front of the ATM machine (unlike here, the Japanese stand very patiently until the system came back after what felt like an eternity). They were using IBM’s IMS Fast Path software and the blame was first put on that software, which subsequently turned out to be some other issue.

Advance the clock to today. Everything is real-time and one can not talk about real-time without discussing the need for “fast data” – data that has to travel very fast for real time decision making. Here are some reasons for fast data:

  • These days, it is important for businesses to be able to quickly sense and respond to events that are affecting their markets, customers, employees, facilities, or internal operations. Fast data enables decision makers and administrators to monitor, track, and address events as they occur.
  • Leverage the Internet of Things – for example, an engine manufacturer will embed sensors within its products, which then will provide continuous feeds back to the manufacturer to help spot issues and better understand usage patterns.
  • An important advantage that fast data offers is enhanced operational efficiency, since events that could negatively affect processes—such as inventory shortages or production bottlenecks—can not only be detected and reported, but remedial action can be immediately prescribed or even launched. Realtime analytical data can be measured against the patterns determined to predict problems, and systems can respond with appropriate alerts or automated fixes.
  • Assure greater business continuity – Fast data plays a role in bringing systems—and all data still in the pipeline—back up and running quickly, before the business suffers from a catastrophic event.
  • Fast data is critical for supporting Artificial Intelligence and machine learning. As a matter of fact, data is the fuel for machine learning (recommendation engines, fraud detection systems, bidding systems, automatic decision making systems, chatbots, and many more).

Now let us look at the constellation of technologies enabling fast data management and analytics. Fast data is the data that moves almost instantaneously from source to processing to analysis to action, courtesy of framework and pipelines such as Apache Spark, Apache Storm, Apache Kafka, Apache Kudu, Apache Cassandra, and in-memory data grids. Here is a brief outline on each of these.

Apache Spark – open source toolset now supported by most major database vendors. It offers streaming and SQL libraries to deliver real-time data processing. Spark Streaming offers data as it is created, enabling analysis for critical areas like real-time analytics and fraud detection. It’s structured streaming API opens up this capability to enterprises of all sizes.

Apache Storm is an open source distributed real-time computation system designed to enable processing of data streams.

Apache Cassandra is an open source low-latency data replication engine.

Apache Kafka is an open source toolset designed for real-time data streaming – employed for data pipelines and streaming apps. Kafka Connect API helps connect it to other environments. It originated at Linked-In.

Apache Kudu is an open source storage engine to support real-time analytics on commodity hardware.

In addition to powerful open source tools and frameworks, there are in-memory data grids that provides a hardware-enabled fast data enabler to deliver blazing speeds to meet today’s needs such as the IoT management and deployment of AI and machine learning and responding to events in real-time.

Yes, we have come a long way from those OLTP days! Fast data management and analytics is becoming a key area for businesses to survive and grow.

Advertisements

When to think of Blockchain?

Blockchain is going thru its hype cycle. It’s not magic nor is it a solution looking for a problem. It is important to know what they can and can’t do. So let us revisit the definition again. Blockchain is a distributed ledgershared by untrusted participants with strong guarantee about accuracy and consistency. Now let us dissect the highlighted words.

  • Ledger: Manual ledgers go back to the 19th. century in which accountants entered transactions by hand. They are list of transactions: items sold/purchased, price, date, etc. Those transactions are dated (timestamped). Ledgers are strictly append only: transactions can be added, but old entries can neither be deleted or modified. Blockchain can have ledger entries that are significantly more complex, but the concept is the same.
  • Shared: Anyone with the appropriate software can put entries into a pool of entries that will eventually be checked for consistency and added to the ledger.
  • Distributed: Blockchains are not centralized. There is no central administration to decide who has access and what rules to follow. Hence there is no single point of control nor single point of failure.Many participants in the blockchain have copies of the entire ledger which gets updated whenever blocks are added. This dis-intermediation was fundamental when the Bitcoin movement started back in 2008.
  • Untrusted Participants: This is the most radical idea of blockchain. In enterprise applications, requiring a certain amount of trust allows some important optimizations, but the concept of “untrusted participants” is fundamental to a blockchain. Anyone can add entries. The protocol that produces agreement among untrusted partners is called BFT (Byzantine Fault Tolerance) or byzantine agreement.
  • Accuracy & Consistency: Despite untrusted participants, blockchain makes strong guarantees about ledger’s accuracy. The replicated copies are not always in agreement, but disagreements are quickly resolved automatically via algorithms and voting.

Blockchain is often a shorthand for “how Bitcoin is implemented”, but its scope is much broader than Bitcoin (like the first app. on blockchain much like email was the first app on the Internet). Blockchain introduces the era of “exchange of values/assets” whereas the Internet gave us the era of “exchange of information”.

If you are building applications that span enterprises and that need to keep accurate records in the presence of untrusted partners, you should be thinking about blockchains.

Microsoft buys GitHub for $7.5B

This morning Microsoft announced the acquisition of Github for $7.5B. Why this makes immense sense? There are three main angles here: winning developers mindset and loyalty, pushing them closer to adopting Azure cloud at runtime, and a skills hookup with LinkedIn, another Microsoft acquisition from last year. Let’s see each point.

Developer love-fest

On the most surface level, the logic of buying GitHub is pretty clear. Developers love GitHub, and Microsoft needs the love of developers. Github is an online service that allows developers to host their software projects. From there, anyone can download those projects and submit improvements. That functionality has made GitHub the center of the open-source software development world. Microsoft offers a whole swath of tools for developers, including the increasingly popular Visual Studio Code software and the open-source .NET Core programming framework. The popularity of these kinds of tools provides a gentle, but apparently effective, funnel toward the Microsoft Azure cloud and other Microsoft products and services — if you like one Microsoft product, it’s more likely that you’ll choose other Microsoft products, especially if they integrate cleanly.

GitHub would just add to that strategy: Developers already love GitHub — in fact, in 2017, Microsoft killed Codeplex, its own GitHub competitor, saying GitHub’s popularity made its own efforts redundant and unnecessary. By owning GitHub, Microsoft would have a direct line to millions of highly engaged developers. We’ve already seen baby steps in this direction, as GitHub and Microsoft just this month announced integrations between their services.

Push Azure Cloud against AWS

AWS is the leader in cloud deployment with a run-rate of over $20B ($5.6B in first quarter revenue yielding 73% of Amazon’s total operating income). GitHub users get software developed and ready using the open source tools available, but running it is another story. Often they go to AWS as the default run-time platform.

Microsoft is laser-focused on the continued growth of its cloud-computing business. So the opportunity for Microsoft is fairly straightforward. If it can get the Microsoft Azure cloud tightly integrated with GitHub — basically, give developers an easy way to get a GitHub project up and running in the cloud — it can kill two birds with one stone. Developers could love GitHub even more, and it would drive more use of Microsoft Azure. It would be a weapon in Microsoft’s arsenal to close the gap between Azure and Amazon Web Services.

LinkedIn + GitHub

What does this mean? When Microsoft spent over $26 billion on LinkedIn last year, CEO Satya Nadella said the company was investing heavily in making sure that current and future workers had the skills they needed to succeed in the modern economy. In Silicon Valley, at least, it’s not uncommon for an employer to ask for a GitHub profile alongside — or instead of — a traditional resumé. If Microsoft is trying to understand the modern skills economy, GitHub could provide a compelling glimpse. So the GitHub push can be about helping developers work together as software becomes key to doing business at almost every company.

I am glad that Microsoft is making aggressive moves to give a fight to AWS. It’s time we see a tough competition for AWS as it gets bigger and stronger of monopolistic proportion.

Blockchain in Healthcare

The application of blockchain technology in the healthcare industry will bring great benefits, most important being accuracy of data and lowering of cost.

Just as a reminder, blockchain technology provides these key facets:

  • a low-cost decentralized ledger approach to managing information (replicated at each node without any central hub),
  • giving simultaneous access to all parties a single body of strongly encrypted data (almost impossible for hackers to get to data),
  • creates an audit trail each time data is changed helping to ensure the integrity and authenticity of the information,
  • patients can see their data and will authorize other parties (doctors, hospitals, insurers).

Current problems in the healthcare industry is all about multiple sources of data for the patients and hence incorrect information which adds to cost. The various entities like hospital, doctor’s office, and insurance all maintain their own database and synchronization becomes a real issue and often causes error. Blockchain is a real solution to these ills. Several examples of applying blockchain are in the development stage.

  • Change Healthcare, a Nashville-based health network has introduced a blockchain system for processing insurance claims. While not all providers in the system are using it yet, the shared ledger of encrypted data represents a “single source of truth”. All involved parties can see the same accurate information about a claim in real time (rather than sending data back and forth). This relieves a patient from having to call multiple parties to verify information (a practice we all are familiar with). Each time the data is changed, a record of it is shown on the digital ledger identifying the responsible party. Any changes also require verification by each party involved, again enforcing the record’s accuracy
  • Last April, a group of companies like Humana Inc., Multiplan Inc., Quest Diagnostics Inc., and United Health Group’s Optum announced a pilot project using blockchain to have online directories of doctors and healthcare providers. Typically doctor groups, hospitals, insurers and diagnostic companies maintain their own online listings of contacts, practices and biographical details. Not only it is expensive, but they have to continually check and verify the accuracy of these directories. Using blockchain, a substantial saving (almost 75%) will occur. The goal of the pilot program is for providers to update their information themselves into the blockchain where all parties in the network can view it.
  • The MIT Media lab is developing a system called MedRec based on blockchain. Patients can manage their own records and give permission to doctors and providers to access and update the records. Success of the system or any such system will depend on large number of providers and doctors opting in to the program.

Most of the early efforts are in the “proof of concept” stage. But the potential of blockchain to help lower the healthcare cost and provide timely accurate information is very promising.

The battle for India’s e-commerce market

Many experts predict that India will become a battleground for e-commerce by the second half of this year. According to a 7ParkData study, the share of monthly mobile e-commerce in March, 2017 looked as follows:

  • Flipkart – 30.7%, Amazon – 30.3%, Snapdeal – 10.8%, Others – 28.2%

Amazon launched its own e-commerce site in India in 2013, and is already fighting for the top spot against local player FlipKart. The Wall Street Journal in an article today said, “Walmart is nearing a deal to acquire a majority stake in India’s leading online retailer, a bold move that would open another front in its escalating war with Amazon. The Indian company, Flipkart, is a start-up that sells everything from clothing to smartphones in the country’s blossoming e-commerce market…Two of the people said that the deal would value Flipkart at about $20 billion and that Walmart was looking to acquire a stake of at least 60 percent. The deal would be an aggressive and, some analysts say, risky foray by Walmart into one of the world’s last great open markets for online retailing”. Softbank of Japan has invested in both FlipKart and Snapdeal.

India may be the first neutral territory where Amazon and Alibaba go head-to-head, as Alibaba’s Paytm Mall seems to be gaining traction quickly in the country. Paytm is an online payment application, much like Apple Pay. Paytm Mall, which launched earlier this year, reportedly garnered 15-20% of all e-commerce sales during the festive season (September 20 to mid-October of 2017). Currently, its value proposition — product quality, availability, and pricing — is nearly on par with Amazon’s and FlipKart’s, but its brand recognition is still low in the country. However, with Alibaba’s backing, it could easily invest heavily in marketing initiatives. And, given how quickly Paytm has already progressed, it will likely gain enough market share to threaten Amazon and Flipkart this year.

Amazon is not sitting idle and is aggressively planning its growth in the India market. It dominates the US e-commerce market with a 44% share in 2017. In India, it just added 5 more fulfillment centers (total of 67 across 13 states, several ones dedicated for grocery or large appliances). It has also introduced a new mobile web browser for Android, simply called “Internet”. It’s said to be “extra small,” taking up a minimal amount of smartphone memory, and also consumes less data when in use than other browsers — features Amazon believes people will find enticing, as it offers the equivalent of additional internet access at no additional cost. It is also building its India-first software experience for its Echo smart players.

It will be quite interesting to see all major players fighting out to capture the huge Indian market as e-commerce gets more mainstream over next few years.

Netflix Technology

I attended a meetup at Netflix last evening titled “Polyglot Persistence at Netflix”. The Cloud Development Engineering (CDE) team presented various aspects of building and maintaining a highly distributed system to meet its ever-growing customer needs. There are almost 160 million users and with its growing popularity of streaming movies and TV shows (many are produced now by Netflix), the demand on its systems is growing rapidly. Polyglot implies the coexistence of many databases and associated software systems.

Netflix cloud platform forms a layer of services, tools, frameworks and technologies that run on top of AWS EC2 in order to implement an efficient and nimble (fast reacting), highly available, globally distributed, scalable and performant solution. They switched over to AWS Cloud over a seven-year period starting back in 2009. It uses Amazon’s RDS and DynamoDB besides S3 for lower cost storage. The front-end is Node.js while the backend uses Java, Python, and Javascript. The team also described how they are using SSD (solid state device) besides memory cache. The main thrust of the evening talk was their use of Cassandra as the distributed database solution.

Apache Cassandra was originally developed at Facebook as a free, open-source, highly scalable, high performance distributed database to handle large amounts of data across many servers with no single point of failure. This global network of storage servers caches content locally to where it will be viewed. This local caching reduces bandwidth costs, reduces latency, and makes it easier to scale the service over a wide area, in this case globally. Here are the key reasons Netflix is a major user of Cassandra (besides others like eBay, Apple, Comcast, Instagram and Reddit):

  • Very large production deployment – 2500 nodes, 420 TB, over 1 Trillion user requests per day. Cassandra is a NOSQL, distributed, document-oriented database that scales horizontally and dynamically as more servers are added without need to re-shard or reboot.
  • Strong write performance with no network performance bottleneck.
  • It’s data model is highly flexible. A sparse 2-dimensional “super column family” architecture allows for rich data model representation (and better performance) beyond just key-value lookup.
  • It’s geographic capabilities – single global cluster can simultaneously replicate data asynchronously as well as service applications across multiple locations. The team last evening showed how users can seamlessly switch over to another data center if failure occurs. Cassandra has been a good choice for cross-data center and cross-regional deployment as customisable replication helps determine which cluster nodes to designate as replicas.

Like Youtube, Netflix has been growing its global reach and customers in providing streaming contents. The key success factor is the database technology to enable such high scale and performance. Other databases like RDS, DynamoDB and MySQL are providing varieties of function such as analytics and metadata store. One impressive part of last evening’s presentation was how they repair any damage to data on the fly by embedding it into the database itself.

Crypto Hype vs. Blockchain

There is a lot of crypto hype these days from crypto currencies like Bitcoin to fundraising efforts like ICO (Initial Coin Offering) similar to an IPO. All this noise has obscured the real benefits of the underlying technology – Blockchain. The Internet brought us the “exchange of information” over last 3 decades. Blockchain will give us the new era of “exchange of values” or “exchange of assets” without an intermediary via highly secure transactions in a peer to peer network. New ways of transferring real estate titles, managing cargo on shipping vehicles, guaranteeing the safety of food we eat and much more mundane activities will be enabled by Blockchain. An article in today’s WSJ by Christopher Mims covers this in more detail.

Briefly Blockchain is essentially a secure database (or ledger) spread across multiple computers. Everybody has the same record of all transactions, so tampering with one instance of it will be meaningless. “Crypto” describes the cryptography that underlies it, which allows agents to securely interact (e.g. transfer assets) while also guaranteeing that once a transaction has been made, the Blockchain keeps an immutable record of it. This technology is well suited to transactions that require trust and a permanent record for traceability. It also requires the cooperation of many different parties. Here are some examples of actual deployment of Blockchain so far:

  • At Walmart 1.1 million items are on Blockchain helping the company to trace the item’s journey from manufacturer to store shelf. Global shipping company Maersk is tracking shipping containers making it faster and easier to transfer them and get them thru customs. Other companies using Blockchain technology for tracking are Kroger, Nestle, Tyson Foods and Unilever. In all these cases, IBM is providing the Blockchain technology.
  • CartaSense, an Israeli company uses Blockchain database for its customers to track every stage of the journey of a package, pallet or shipping container.
  • Everledge, a company started in 2014 uses a Blockchain-based registry of every certified diamond in the world (already 2.2. million in its registry). By recording 40 different measures of each stone, it is able to trace the journey of a stone from its source to the final sale to a customer.
  • Dubai has declared its goal to make itself a Blockchain powered government in the world by 2020. They want to streamline real estate transactions for faster and easier transfer of property titles. Other assets like birth/death certificates, passports, visa, etc. can also be managed at low cost with better efficiency.

It is a bit early to claim that Blockchain will revolutionize every industry including government, but it has that potential. It poses a tremendous challenge for the hackers to break into. It can impact on how we vote to whom we connect to what we buy.