The world of Hadoop and beyond

Hortonworks went through an IPO last Friday, December 12, 2014. It’s initial price of $16 soared by 60% immediately after. Today the stock price is $24.70 with a market cap of $1.02B. Another billion dollar club member. They compete with Cloudera and MapR in packaging the open source Hadoop platform for customers. How do they make money when the basic Apache Hadoop is free? – by offering added services like training and consulting. They can also add auxiliary products (not open source) that customers must pay for. The interesting fact is that Hortonworks’s CEO had claimed a $100m revenue this year, but looks like he is way short of that – $33m during first nine months. The future is quite uncertain!

In the mean time, several new start-ups have come up in the Hadoop-sphere:

  • Databrick with venture funding of $47m so far (includes Andreesen Horowitz, Ben Horowitz on board). This is a Berkeley-based company that delivers Spark. Apache Spark is a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010.
  • Altiscale with $42m venture funding (founder is ex Yahoo CTO Ray Stata, Sequoia Capital investor among others). This company promises to deliver Hadoop in the cloud, so as to relieve customers from all the complexities of Hadoop cluster management which can be highly non-trivial.
  • SpliceMachine with venture funding of $22m (Mohr Davidow, Interwest ventures). They claim to bring an RDBMS face to Hadoop, so that SQL programs may not have to change, clearly targeting against incumbents such as Oracle. Interestingly, Oracle does provide similar function with their universal Oracle API.
  • Metanautix with venture funding from Sequoia and Stanford Endowment fund plus other individuals. The founders are from Google and Facebook and it is an implementation of Google’s Dremel idea. This will completely supplant Hadoop.

So there we go, a line-up of companies besides Cloudera, MapR, and Hortonworks. Out of this chaos many are going to disappear and only a couple will survive, as always. This still addresses the batch-oriented data analytics space, even though Mike Olson of Cloudera thinks Hadoop will replace the OLTP DBMS some day.

The polyglot existence of varieties of data management solutions will stay for some time.

Huge Valuations – only for a category king!

Only couple of weeks ago, Uber got $1.2B funding on a valuation  of $41B! Earlier this year, Facebook paid over $19B to acquire WhatsApp. Dropbox is valued at $10B. All these three companies are less than five years old. What is going on? Is it a tech bubble seen earlier? No, it’s a start-up wealth gap as per Newsweek article by Kevin Maney.

Here is an interesting finding! A new study, which analyzed valuation data on thousands of tech startups, found that winning companies born since 2009 get to super-high valuations three times faster than companies started in the early 2000s. Looked at another way, this says that if a company is going to reach a $1 billion value, it will do so in one-third the time that climb typically took just a decade ago.  Of the 80 companies that hit $1 billion, half are what the study calls Category Kings—companies that define and dominate a new category of business. Uber is a good example of a Category King: It helped create a new kind of business, took the lead in defining it and became the dominant player. A Category King typically takes 70 percent of the total market value of its category. All the rest of the entrants split the remaining 30 percent. Examples of category kings are: Facebook, Google (new page ranking search algorithm), Linked-In, Twitter, Airbnb, Snapchat, Cloudera, Dropbox, Pinterest, etc.

Even worse news for the second-tier companies: The study found that a six-year-old startup that isn’t yet a Category King has almost zero chance of becoming one. Hundreds of companies are left to forever survive on their category’s scraps. That explains why Uber was valued at $41 billion earlier this month, while at about the same time the No. 2 in that space, Lyft, was valued about 40 times lower, at $1.2 billion. The rest of that category is barely noticeable. Investors look at the future value of that category and see one company taking most of it.

What about enterprise companies? Their valuation seems much lower than the consumer companies. The study pointed out the truth of this fact –  a typical venture backed consumer company is growing their market cap at more than $600 million per year compared to a typical venture backed enterprise company that is growing their market cap at $100 million per year.

The same study concluded that 35 Category Kings dominate the valuation of venture backed technology companies founded since 2000. These companies are more valuable than all the other companies combined and have taken more than 70 percent of the total available market cap of any category or era since 2000.

This is certainly a new economy for leading consumer companies!

Data Lake and Data Refinery – Gartner controversy!

Much discussion has been going on the new phrase called Data Lake. Gartner wrote a report on the ‘Data Lake’ fallacy, saying to be careful about ‘data lake’ or ‘data swamp’. Then Andrew Oliver wrote in the InfoWorld these beginning words, “For $200, Gartner tells you ‘data lakes’ are bad and advises you to try real hard, plan far in advance, and get governance correct”. Wow, what an insight!

During my days at IBM and Oracle, Gartner wanted to get time on my calendar to talk about database futures. Then afterwards, I realized that I paid significant fee to attain the Gartner conference to hear back what I had told them. Good business of information gathering and selling back. Without meaning any disrespect, many analysts like to create controversial statements to stay relevant. Here is such a case with Gartner.

The concept of a ‘data lake’ was coined by James Dixon of Pentaho Corp. and this is what he said – If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. Think of a data lake as an unstructured data warehouse, a place where you pull in all of your different sources into one large “pool” of data. In contrast to a data mart, a data lake won’t “wash” the data or try to structure it or limit the use cases. Sure, you should have some use cases in mind, but the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.

The data lake strategy is part of a greater movement toward data liberalization. Given the exponential growth of data (specially with IoT and myriads of sensors), there is need for storing data in its native format for further analysis. Of course you can drown in a data lake! But that’s why you build safety nets like security procedures (for example, access is allowed only via Knox), documentation (what goes where in what directory and what roles you need to find it), and  governance.

Without offering any concrete alternative, Gartner seems to say that a new layer (call it data refinery if you like) is needed to make sense of this ‘raw’ data, thus heading back to the ETL days of data warehousing. Gartner loves to scare clients (so that they seek help for a fee) on new technology and would want everyone to stay with the classic data warehousing business. This is not in line with the Big Data movement which involves some risk, as always with any new technology.

An Eulogy to my professor Dr. H.K.Kesavan (University of Waterloo, Canada)

I just heard that my professor from graduate school days at the University of Waterloo, Dr. H.K.Kesavan passed away earlier today (Nov 26, 2014). He was 88 years old. I was his student in early 1970s. During his illustrious career at the University of Waterloo, numerous students got their Masters and Ph.D degrees under his guidance. He was the founding chairman of a new interdisciplinary department called Systems Design engineering, where I also got my post-graduate education.

I had my undergraduate degree in mechanical engineering and this new discipline fascinated me then. It’s main theme was the use of computers for simulating varieties of complex systems and drew students from all sorts of engineering disciplines. His positive encouragement was the reason I joined such an unusual field then. There were civil engineers modeling water distribution systems. Electrical engineers modeled power distribution networks. He wrote a book called Analysis of Discrete Physical Systems using linear graph theory. We applied that theory in writing programs for complex networks.

University of Waterloo was a pioneering school in Canada, in the use of early computers and the Wat-IV and Wat-V compilers came from there during the 1960s. The IBM data center was quite a showpiece for visitors with all the latest mainframe 370 computers. Dr. Kesavan got his Ph.D. from Michigan State University back in 1959, where he also taught for a few years. He shifted to India during 1964-68 to be the chairman of electrical engineering department at IIT, Kanpur. After returning to the University of Waterloo, he served there from 1968 until his retirement in 1991. He continued as professor emeritus until the end.

During my student days, he would invite us to his house many times for dinner and his affection for his students was very deep and genuine. After many years, once I ran into him and his wife at the Toronto airport. His joy knew no bounds and he complimented me on my professional success. His work covered many areas from systems theory, linear graph theory to entropy optimization.

As his soul rests in eternal abode, I salute my teacher with reverence.

Data Landscape at Facebook

What does the data landscape look like at Facebook with its 1.3 billion users across the globe? They classify small data referring to OLTP-like queries that process and retrieve a small amount of data, usually 1-1000 objects requested by their id. Indexes limit the amount of data accessed during a single query, regardless of the total volume. Big data refers to queries that process large amounts of data, usually for analysis: trouble-shooting, identifying trends, and making decisions. The total volume of data is massive for both small and big data, ranging from hundreds of terabytes to hundreds of petabytes on disk.

The heart of Facebook’s core data is TAO (The Association of Objects) – distributed data store for the social graph. The workload on this is extremely demanding. Every time any one of over a billion active users visits Facebook through a desktop browser or on a mobile device, they are presented with hundreds of pieces of information from the social graph. Users see News Feed stories; comments, likes, and shares for those stories; photos and check-ins from their friends — the list goes on. The high degree of output customization, combined with a high update rate of a typical user’s News Feed, makes it impossible to generate the views presented to users ahead of time. Thus, the data set must be retrieved and rendered on the fly in a few hundred milliseconds. TAO provides access to tens of petabytes of data, but answers most queries by checking a single page in a single machine. The challenge here is how to optimize for mobile devices which have intermittent connectivity and higher network latencies than most web clients.

Big data stores are the workhorses for data analysis and they grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores are: 1) ODS (Operational Data Store) has 2 billion time series of counters (used mostly in alerts and dashboards) and processes 40000 queries per second. 2) Scuba is the fast slice-and-dice data store with 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second scanning 100 billion rows per second. 3) Hive is the data warehouse with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabytes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.

What is interesting to note here is that ODS, Scuba, and Hive are not traditional relational databases. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage and retrieval. Instead, challenge arises from high data insertion rates and massive data quantities.

Facebook represents the new generation Big Data user demanding innovative solutions not seen in the traditional database world. A big challenge is how to accommodate the seismic shift to mobile devices with their peculiar intermittent network connectivity.

No wonder, Facebook  hosted a data faculty summit last September with many known academic researchers from around the country to find answers to its future data challenges.

Industry Bifurcation?

Apple reported record sales in the most recent quarter, thanks to its upgraded iPhone line. But it was almost alone among the big technology firms in doing so. Profits reported for the same period, have fallen at Google as well as at IBM, SAP and VMWare.

IBM has done more financial engineering than the real kind in recent years by its $100 billion buyback of its own shares since year 2000. It has shed less profitable assets, but now lacks a big fast-growing business. It even started layoffs in India recently. The earnings at VMWare dropped because of a recent acquisition. Some firms are having difficulty shifting to new trends such as cloud computing. SAP and Oracle are seeing more of their business shifting to the cloud. That requires big investments in data centers and yields lower margins, at least initially. Workday, a cloud-based HR products company has been enticing existing customers (of SAP and Oracle) due to its low cost and better usability.

Google, for example, is hit by the shift of users to mobile devices, where advertising rates are lower. Apple and even Yahoo are taking advantage by their shift to mobile-advertising sales (Yahoo got 17% of its revenue of $1.1 billion on these sales in the past quarter). More fundamentally,  the IT industry is maturing and seeing slow revenue growth of only 3% (according to Boston Consulting Group). The biggest sectors such as hardware, business software, and IT services are growing slowly or even shrinking.

This is the “bifurcation” (between new emerging trends and old traditional segments) which will lead to big restructuring of the industry. HP recently talked about breaking up its business to separate entities. IBM continues to shed unprofitable businesses. Others like SAP are buying web-based travel and expense management company, Concur (for $8.3 billion!). Oracle continues to acquire more and more companies with cloud technology for the growth of its cloud business. During the Oracle Open World in late September, the whole theme centered around Cloud and to some degree mobility.

The recent disappointing results are another harbinger of an unbundling and rebuilding of the IT industry. It will be hard to tell how the new landscape will look like at the end of the process. But Apple continues to prove that bold innovations and winning user’s mindshare can bring big rewards. Older incumbents must learn a few lessons from Apple’s success.

Fast Data vs. Big Data – how to combine?

Today, all the discussion on Big Data centers around “static data” in a data lake (old Data Warehouse) accessed by BI tools or SQL on Hadoop (Hawk, Impala) or Map/Reduce algorithms (MapR) for analysis. This is looking at historical data and finding trends. Some new tools are trying to provide predictive analysis based on past trends. This area deals with mostly the volume and variety aspect of Big Data, but not the velocity or for “data in motion”.

The term “Fast Data” is applied to data that is in motion. This component is getting more and more significant as there is a constant streaming of data coming from edge devices such as sensors, smart phones and connected devices. As these devices explode (10 Billion now going towards 50B in a few years, according to market analysis), there will be a data explosion and that is not going to be addressed by current Big Data products and tools. What is needed is capture of this data at ingestion points, efficient storage plus management and doing real time analytics for faster decisions. Streaming data has been around for a while, but we are talking about two-way sensors where constant feedback and aggregation is needed. For example, smart meters in the utilities industry can provide readings from individual homes, but aggregating it at the transformer level is important to predict seasonality and other trends. With fast data, things that were not possible before become achievable: instant decisions can be made on realtime data to drive sales, connect with customers, inform business processes, and create value.

Fast data is the payoff for big data. While much can be accomplished by mining data to derive insights that enable a business to grow and change, looking into the past provides only hints about the future. Simply collecting vast amounts of data for exploration and analysis will not prepare a business to act in real time, as data flows into the organization from millions of endpoints. The IoT (Internet of Things) makes this much more significant.

Enterprises have to figure out a combined architecture for Fast Data as well as Big Data. Streams of data from edge devices will eventually migrate to the data lake, but much realtime analysis have to happen before. Technologies such as in-memory databases and complex event processing are needed to handle the performance. This space is still new and much more work is needed in the area of analytics that is real-time. The older OLTP systems will be inadequate to handle the demands of ingestion and analysis at an affordable cost.

It is time to look at the world of data with a wider lens than just Hadoop-centric Big Data!