Lambda Architecture

I attended a Meetup yesterday in Mountain View, hosted by The Hive group on the subject of Lambda Architecture. Since I had never heard about this new phrase, my curiosity took me there. There was a panel discussion and panelists came from Hortonworks, Cloudera, MapR, Teradata, etc.

Lambda Architecture is a useful framework to think about designing big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter. Some of the key requirements in building this architecture include:

  • Fault-tolerance against hardware failures and human errors
  • Support for a variety of use cases that include low latency querying as well as updates
  • Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done
  • Extensibility so that the system is manageable and can accommodate newer features easily

The following pictures summarizes the framework.

Overview of the Lambda Architecture

The Lambda Architecture as seen in the picture has three major components.

  1. Batch layer that provides the following functionality
    1. managing the master dataset, an immutable, append-only set of raw data
    2. pre-computing arbitrary query functions, called batch views.
  2. Serving layer—This layer indexes the batch views so that they can be queried in ad hoc with low latency.
  3. Speed layer—This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths. Yet attempting to abstract the code bases into a single framework puts many of the specialized tools in the batch and real-time ecosystems out of reach.

The panelists rambled on details without addressing real challenges on combining two very different approaches, thus compromising the benefits of stream with added latency of the batch world. However, there is merit to the thought process of unification of the two disparate worlds into a common framework. Real deployment will be the proof point.

Amazing Apple with its record-breaking earnings!

Yesterday Apple disclosed its fourth quarter financial results and it was simply astounding sending wall street analysts scrambling with their way-off forecasts. In the last quarter of 2014, Apple made a stunning profit of $18B (38% growth from a year ago) on a revenue base of $74.6B. This is the record profit made by any company ever! During that quarter iPhone sales reached 75m with a hefty growth in China market, and iPhone revenue reached $51.18B (70% of total revenue). This is more than Google and Microsoft revenue combined for that quarter. As Tim Cook said, they sold 34000 iPhones every hour, 24 hours a day, every day of the quarter. Apple now is the most valuable company in the planet!

The sales were fueled largely by the larger screen iPhone6 and 6Plus, introduced last fall. They were of higher price and demand exceeded supply, as usual. This growth again affected the tablet sales which declined, clearly showing customer’s preference of large screen smartphones with higher memory. Interestingly Mac sales also went up during the quarter.

Apple sits on $187B cash now and maintaining this growth would be a challenge. But the new Apple Watch is coming out in April, adding a new revenue stream. Hopefully it will be another product success in a new category. What one must appreciate is the relentless excellence in quality and delivery besides the unique design of Apple’s products. Unlike Samsung, Apple figured out how to crack the China market which has a huge appetite for large screen smartphones such as iPhone6 and 6Plus.

Hats off to Apple!

Big Data coverage at CES 2015

I saw more discussion on big data at CES 2015 this week, compared to previous years. Everyone talked about data as the central core of everything. The IoT (Internet of Things), NFC (near Field Communication) and M2M (Machine to Machine) communication are enabling pieces for many industries  – security monitoring, asset and inventory tracking, healthcare, process control, building environment monitoring, vehicle tracking and telemetry, in-store customer engagement and digital signage. Big data is the big deal here.

The Big Data ecosystem includes – cloud computing, M2M/IoT, dumb terminal 2.0 (devices getting dumber – more cloud, better broadband, less about storage and more about broadband access and high quality display), and analysis. The big data opportunity is slated to be a $200B business in 2015. Every company must insert the big data ecosystem into their future roadmap or get left out. The key here is not the technology, but its business value.

The progression goes like this: Big Data ->Big Info -> Big Knowledge -> Big Insight. For example Big Data says 60 (not much meaning) , then Big Info says “Steve is 60″ adding context. Then Big Knowledge says “Steve can’t hear very well” followed by Big Insight like “maybe we give Steve a hearing aid”, an actionable item. So we go from Big Data to Big Insight that becomes very useful. Several industry examples can be given:

  • Retail iBeacon technology – Apple’s technology allows smartphones to be tracked geographically. This will provide vector info about shoppers and hence allows for predictive service experience in combination with smart mirrors.
  • Insurance companies – by collecting information on drivers behavior the premiums can be adjusted by individual.
  • Medical event tracking – big data has crucial role here providing relevant information by patient.
  • Asset tracking in oil fields can help reduce costs and increase efficiency.
  • Smart cities – like San Francisco parking system SFPARK, every sensor-based parking space  can be efficiently used. You can use your smartphone to find available parking quickly.
  • and many more..

Big Data is the heart of it all – efficiently ingest, store, process, and manage unstructured data and provide meaningful analysis. Using an oil industry analogy, in the next 3-5 years we will see Big Data as the crude oil and Analytics is the new refinery.

From CES 2015 – Disruptive Technologies over next five years

This year’s CES expects to have 160,000 attendees and tonight’s keynote by the CEO of Samsung Mr. BK Yoon was “unlocking infinite possibilities of IoT”. The Internet of Things seems to be the overall theme this year.

Today I listened to an interesting panel on disruptive technologies over next five years. Here is a brief summary.

  1. 3D Printing: This year expects to see 300,000 desktop 3D printers in the US. Mainstream consumer adoption is doubtful. Someone jokingly said that you can build a statue of yourself and install it in your yard. Another term for 3D printing is additive manufacturing. Most likely it will be adopted by small industries providing repair service (by building plastic parts for a washing machine, for example). Many such 3-D printing devices are on display.
  2. Wearables: This is a diverse market of connecting the unconnected ($2B market). Healthcare seems to lead the usage via the health and fitness wearables, such as the Apple Watch. There are two values – quantified self (with context) and notification bits (of relevance). This technology will be quite disruptive over next five years. Apple explained what an wearable can be in their announcement last year. If they can galvanize the developer community, then huge value will be realized. Just like many PC functions got migrated to the smartphone, we will see similar migration of smartphone stuff to the wearable (e.g. notification, alerts, short messages,..).
  3. Drones: This is similar to 3D printing, with questionable mass adoption. Maybe over next ten years, serious adoption will take place. Immediate application may be video photography and surveillance. There are many regulatory and policy hurdles before drones can be mainstream.
  4. Self-driving cars: Here engineering is way ahead of the policy curve. While full adoption may not happen in the near future, semi-autonomous systems can be of help – such as self parking, and adaptive cruise control, tasks that can be turned over to the car. The panel felt that next five years will be the “preparation phase” and adoption will come in ten years.

Other technologies covered were: the huge growth in Internet users from 2B now to the over 5B. This will bring new cultural, political and economic ramifications. Smartphones will continue to be disruptive with newer and newer usage across the world impacting our daily lives. Robotics, specially home robots doing several tasks will become relevant.

The big question was on the ownership of data created by all these devices. This year’s CES has a bigger presence of automobile companies and both BMW and Mercedes Benz executives appeared in keynotes. The connected home and the connected car have bigger presence.

Musings on 2014

Another year comes to a close. What did we see as significant technology events?  In the disruption category, we saw Uber getting valued at $41B even with all its issues in the news. When you disrupt an entrenched business such as taxi service, it is only natural that resistance will happen. But consumers like me love the value-added service from Uber. This is unstoppable as evident from the investor’s confidence in providing $1.2B funding. In the disruption category, companies like Snapchat, Instagram, Airbnb, Instacart, and others made good progress. Re-imagination is the catchword here. See my blog on that topic.

Big Data continued getting more momentum in 2014. We saw Hortonworks (Hadoop packaging) had its IPO. Cloudera  continued its momentum. NoSQL products like MongoDB and Datastax (Cassandra) moved into mainstream enterprise deployment. The first MongoDB World summit in new York city in June saw 2000 attendees, not bad for a six-year-old company. VoltDB made lots of claim in realtime, in-memory, stream processing. Phrases like Datalake, and Data Refinery entered our lexicon. Data stored in its native format and extracted for analysis became a hot discussion point. The incumbents like Oracle, IBM, HP and Microsoft were not sitting idle. They all introduced their NoSQL and Hadoop offerings, besides the data warehouse appliances (e.g. IBM Netizza, Oracle Exadata and Exalytics, HP Vertica, EMC Greenplum, etc.). SQL interface for Hadoop took front stage with several offerings. The space got more confusing with so many products and vendors. Personally I spoke at several conferences on how to look at the broad landscape and make some sense, so that customers do not equate Big Data with just Hadoop. Analytics is another hot space where meaningful information can be extracted to impact business decisions. Here, we have a long way to go, but this space will certainly grow fast in 2015, with increasing demand on data scientists and data engineers.

Cloud computing inched forward in the maturity curve. Oracle made several announcements at their Open World conference. They continue to acquire new companies (e.g. Datalogix last week) to gain better foothold on cloud-based solutions. Even their last quarterly finance showed significant growth in cloud product revenue. IBM also pushed cloud in a big way and so did Microsoft under its new CEO. The Azure cloud solution is starting to gain customer acceptance, a good alternative to Amazon’s AWS. GCE (Google computing engine) is yet to impact the enterprise market, but making headway.

The big news from Apple was the introduction of the Apple Watch. Wearable computing is coming in a big way and Apple’s product will be available in 2015. I am heading off to CES (Consumer Electronic Show) next week in Las Vegas to see firsthand all these new gadgets for connecting home, cars, etc. – the real Internet of Things (IoT). At the first IoT Expo in San Francisco, I spoke on the topic “Data – the oxygen of IoT”. IoT makes big data even more critical.

Overall, 2014 was another exciting year in technology for consumers. The enterprise space continues to struggle with injecting new technology such as cloud, mobility into their old archaic applications and systems. I am hoping this will pick up momentum in 2015.

2014 in review

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

A New York City subway train holds 1,200 people. This blog was viewed about 5,400 times in 2014. If it were a NYC subway train, it would take about 5 trips to carry that many people.

Click here to see the complete report.

The world of Hadoop and beyond

Hortonworks went through an IPO last Friday, December 12, 2014. It’s initial price of $16 soared by 60% immediately after. Today the stock price is $24.70 with a market cap of $1.02B. Another billion dollar club member. They compete with Cloudera and MapR in packaging the open source Hadoop platform for customers. How do they make money when the basic Apache Hadoop is free? – by offering added services like training and consulting. They can also add auxiliary products (not open source) that customers must pay for. The interesting fact is that Hortonworks’s CEO had claimed a $100m revenue this year, but looks like he is way short of that – $33m during first nine months. The future is quite uncertain!

In the mean time, several new start-ups have come up in the Hadoop-sphere:

  • Databrick with venture funding of $47m so far (includes Andreesen Horowitz, Ben Horowitz on board). This is a Berkeley-based company that delivers Spark. Apache Spark is a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010.
  • Altiscale with $42m venture funding (founder is ex Yahoo CTO Ray Stata, Sequoia Capital investor among others). This company promises to deliver Hadoop in the cloud, so as to relieve customers from all the complexities of Hadoop cluster management which can be highly non-trivial.
  • SpliceMachine with venture funding of $22m (Mohr Davidow, Interwest ventures). They claim to bring an RDBMS face to Hadoop, so that SQL programs may not have to change, clearly targeting against incumbents such as Oracle. Interestingly, Oracle does provide similar function with their universal Oracle API.
  • Metanautix with venture funding from Sequoia and Stanford Endowment fund plus other individuals. The founders are from Google and Facebook and it is an implementation of Google’s Dremel idea. This will completely supplant Hadoop.

So there we go, a line-up of companies besides Cloudera, MapR, and Hortonworks. Out of this chaos many are going to disappear and only a couple will survive, as always. This still addresses the batch-oriented data analytics space, even though Mike Olson of Cloudera thinks Hadoop will replace the OLTP DBMS some day.

The polyglot existence of varieties of data management solutions will stay for some time.