Big Data Visualization

Recently I listened to a discussion on Big Data Visualization hosted by Bill McKnight of the McKnight Consulting group. The panelists agreed that Big Data is shifting from the hype state to an “imperative” state. For start-up companies, there are more Big Data projects whereas true big data is still a small part of the enterprise practice. At many companies, Big Data is moving from POC (Proof of Concept) to production. Interest in visualization of data from different sources is certainly increasing. There is a growth in data-driven decision-making as evidenced by the increasing use of platforms like YARN, HIVE, and Spark. The traditional approach of RDBMS platform can not scale to meet the needs of rapidly growing volume and varieties of Big Data.

So what is the difference between Data Exploration vs. Data Visualization? Data exploration is more analytical and is used to test hypothesis, whereas visualization is used to profile data and is more structured. The suggestion is to bring visualization to the beginning of data cycle (not the end) to do better data exploration. For example, in a personalized cancer treatment, the finding and examining of output of white blood counts and cancer cells can be done upfront using data visualization. In Internet e-commerce, billions of rows of data can be analyzed to understand consumer behavior. One customer uses Hadoop and Tableau’s visualization software to do this. Tableau enables visualization of all kinds of data sources from three scenarios – cold data from a data lake on Hadoop (where source data in native format can be located); warm data from a smaller set of data; or hot data served in-memory for faster processing.

Data format can be a challenge. How do you do visualization of NoSQL data? For example, JSON data (supported by MongoDB) is nested and schema-less and is hard for BI tools. Understanding data is crucial and flattening of nested hierarchies will be needed. Nested arrays can be broken as foreign keys. Graph data is another special case, where visualization of the right amount of graphs data is critical (good UX).

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.

Apache Spark is another exciting new approach to speed up queries by utilizing memory. It consists of Spark SQL (SQL like queries), Spark string, MLLib, and GraphX. It leverages Python, Scala, and Java to do the processing. It enables users of Hadoop to have more fun with data analysis and visualization.

Big Data Visualization is emerging to be a critical component for extracting business value from data.

Lambda Architecture

I attended a Meetup yesterday in Mountain View, hosted by The Hive group on the subject of Lambda Architecture. Since I had never heard about this new phrase, my curiosity took me there. There was a panel discussion and panelists came from Hortonworks, Cloudera, MapR, Teradata, etc.

Lambda Architecture is a useful framework to think about designing big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter. Some of the key requirements in building this architecture include:

  • Fault-tolerance against hardware failures and human errors
  • Support for a variety of use cases that include low latency querying as well as updates
  • Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done
  • Extensibility so that the system is manageable and can accommodate newer features easily

The following pictures summarizes the framework.

Overview of the Lambda Architecture

The Lambda Architecture as seen in the picture has three major components.

  1. Batch layer that provides the following functionality
    1. managing the master dataset, an immutable, append-only set of raw data
    2. pre-computing arbitrary query functions, called batch views.
  2. Serving layer—This layer indexes the batch views so that they can be queried in ad hoc with low latency.
  3. Speed layer—This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths. Yet attempting to abstract the code bases into a single framework puts many of the specialized tools in the batch and real-time ecosystems out of reach.

The panelists rambled on details without addressing real challenges on combining two very different approaches, thus compromising the benefits of stream with added latency of the batch world. However, there is merit to the thought process of unification of the two disparate worlds into a common framework. Real deployment will be the proof point.

Amazing Apple with its record-breaking earnings!

Yesterday Apple disclosed its fourth quarter financial results and it was simply astounding sending wall street analysts scrambling with their way-off forecasts. In the last quarter of 2014, Apple made a stunning profit of $18B (38% growth from a year ago) on a revenue base of $74.6B. This is the record profit made by any company ever! During that quarter iPhone sales reached 75m with a hefty growth in China market, and iPhone revenue reached $51.18B (70% of total revenue). This is more than Google and Microsoft revenue combined for that quarter. As Tim Cook said, they sold 34000 iPhones every hour, 24 hours a day, every day of the quarter. Apple now is the most valuable company in the planet!

The sales were fueled largely by the larger screen iPhone6 and 6Plus, introduced last fall. They were of higher price and demand exceeded supply, as usual. This growth again affected the tablet sales which declined, clearly showing customer’s preference of large screen smartphones with higher memory. Interestingly Mac sales also went up during the quarter.

Apple sits on $187B cash now and maintaining this growth would be a challenge. But the new Apple Watch is coming out in April, adding a new revenue stream. Hopefully it will be another product success in a new category. What one must appreciate is the relentless excellence in quality and delivery besides the unique design of Apple’s products. Unlike Samsung, Apple figured out how to crack the China market which has a huge appetite for large screen smartphones such as iPhone6 and 6Plus.

Hats off to Apple!

Big Data coverage at CES 2015

I saw more discussion on big data at CES 2015 this week, compared to previous years. Everyone talked about data as the central core of everything. The IoT (Internet of Things), NFC (near Field Communication) and M2M (Machine to Machine) communication are enabling pieces for many industries  – security monitoring, asset and inventory tracking, healthcare, process control, building environment monitoring, vehicle tracking and telemetry, in-store customer engagement and digital signage. Big data is the big deal here.

The Big Data ecosystem includes – cloud computing, M2M/IoT, dumb terminal 2.0 (devices getting dumber – more cloud, better broadband, less about storage and more about broadband access and high quality display), and analysis. The big data opportunity is slated to be a $200B business in 2015. Every company must insert the big data ecosystem into their future roadmap or get left out. The key here is not the technology, but its business value.

The progression goes like this: Big Data ->Big Info -> Big Knowledge -> Big Insight. For example Big Data says 60 (not much meaning) , then Big Info says “Steve is 60″ adding context. Then Big Knowledge says “Steve can’t hear very well” followed by Big Insight like “maybe we give Steve a hearing aid”, an actionable item. So we go from Big Data to Big Insight that becomes very useful. Several industry examples can be given:

  • Retail iBeacon technology – Apple’s technology allows smartphones to be tracked geographically. This will provide vector info about shoppers and hence allows for predictive service experience in combination with smart mirrors.
  • Insurance companies – by collecting information on drivers behavior the premiums can be adjusted by individual.
  • Medical event tracking – big data has crucial role here providing relevant information by patient.
  • Asset tracking in oil fields can help reduce costs and increase efficiency.
  • Smart cities – like San Francisco parking system SFPARK, every sensor-based parking space  can be efficiently used. You can use your smartphone to find available parking quickly.
  • and many more..

Big Data is the heart of it all – efficiently ingest, store, process, and manage unstructured data and provide meaningful analysis. Using an oil industry analogy, in the next 3-5 years we will see Big Data as the crude oil and Analytics is the new refinery.

From CES 2015 – Disruptive Technologies over next five years

This year’s CES expects to have 160,000 attendees and tonight’s keynote by the CEO of Samsung Mr. BK Yoon was “unlocking infinite possibilities of IoT”. The Internet of Things seems to be the overall theme this year.

Today I listened to an interesting panel on disruptive technologies over next five years. Here is a brief summary.

  1. 3D Printing: This year expects to see 300,000 desktop 3D printers in the US. Mainstream consumer adoption is doubtful. Someone jokingly said that you can build a statue of yourself and install it in your yard. Another term for 3D printing is additive manufacturing. Most likely it will be adopted by small industries providing repair service (by building plastic parts for a washing machine, for example). Many such 3-D printing devices are on display.
  2. Wearables: This is a diverse market of connecting the unconnected ($2B market). Healthcare seems to lead the usage via the health and fitness wearables, such as the Apple Watch. There are two values – quantified self (with context) and notification bits (of relevance). This technology will be quite disruptive over next five years. Apple explained what an wearable can be in their announcement last year. If they can galvanize the developer community, then huge value will be realized. Just like many PC functions got migrated to the smartphone, we will see similar migration of smartphone stuff to the wearable (e.g. notification, alerts, short messages,..).
  3. Drones: This is similar to 3D printing, with questionable mass adoption. Maybe over next ten years, serious adoption will take place. Immediate application may be video photography and surveillance. There are many regulatory and policy hurdles before drones can be mainstream.
  4. Self-driving cars: Here engineering is way ahead of the policy curve. While full adoption may not happen in the near future, semi-autonomous systems can be of help – such as self parking, and adaptive cruise control, tasks that can be turned over to the car. The panel felt that next five years will be the “preparation phase” and adoption will come in ten years.

Other technologies covered were: the huge growth in Internet users from 2B now to the over 5B. This will bring new cultural, political and economic ramifications. Smartphones will continue to be disruptive with newer and newer usage across the world impacting our daily lives. Robotics, specially home robots doing several tasks will become relevant.

The big question was on the ownership of data created by all these devices. This year’s CES has a bigger presence of automobile companies and both BMW and Mercedes Benz executives appeared in keynotes. The connected home and the connected car have bigger presence.

Musings on 2014

Another year comes to a close. What did we see as significant technology events?  In the disruption category, we saw Uber getting valued at $41B even with all its issues in the news. When you disrupt an entrenched business such as taxi service, it is only natural that resistance will happen. But consumers like me love the value-added service from Uber. This is unstoppable as evident from the investor’s confidence in providing $1.2B funding. In the disruption category, companies like Snapchat, Instagram, Airbnb, Instacart, and others made good progress. Re-imagination is the catchword here. See my blog on that topic.

Big Data continued getting more momentum in 2014. We saw Hortonworks (Hadoop packaging) had its IPO. Cloudera  continued its momentum. NoSQL products like MongoDB and Datastax (Cassandra) moved into mainstream enterprise deployment. The first MongoDB World summit in new York city in June saw 2000 attendees, not bad for a six-year-old company. VoltDB made lots of claim in realtime, in-memory, stream processing. Phrases like Datalake, and Data Refinery entered our lexicon. Data stored in its native format and extracted for analysis became a hot discussion point. The incumbents like Oracle, IBM, HP and Microsoft were not sitting idle. They all introduced their NoSQL and Hadoop offerings, besides the data warehouse appliances (e.g. IBM Netizza, Oracle Exadata and Exalytics, HP Vertica, EMC Greenplum, etc.). SQL interface for Hadoop took front stage with several offerings. The space got more confusing with so many products and vendors. Personally I spoke at several conferences on how to look at the broad landscape and make some sense, so that customers do not equate Big Data with just Hadoop. Analytics is another hot space where meaningful information can be extracted to impact business decisions. Here, we have a long way to go, but this space will certainly grow fast in 2015, with increasing demand on data scientists and data engineers.

Cloud computing inched forward in the maturity curve. Oracle made several announcements at their Open World conference. They continue to acquire new companies (e.g. Datalogix last week) to gain better foothold on cloud-based solutions. Even their last quarterly finance showed significant growth in cloud product revenue. IBM also pushed cloud in a big way and so did Microsoft under its new CEO. The Azure cloud solution is starting to gain customer acceptance, a good alternative to Amazon’s AWS. GCE (Google computing engine) is yet to impact the enterprise market, but making headway.

The big news from Apple was the introduction of the Apple Watch. Wearable computing is coming in a big way and Apple’s product will be available in 2015. I am heading off to CES (Consumer Electronic Show) next week in Las Vegas to see firsthand all these new gadgets for connecting home, cars, etc. – the real Internet of Things (IoT). At the first IoT Expo in San Francisco, I spoke on the topic “Data – the oxygen of IoT”. IoT makes big data even more critical.

Overall, 2014 was another exciting year in technology for consumers. The enterprise space continues to struggle with injecting new technology such as cloud, mobility into their old archaic applications and systems. I am hoping this will pick up momentum in 2015.

2014 in review

The stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

A New York City subway train holds 1,200 people. This blog was viewed about 5,400 times in 2014. If it were a NYC subway train, it would take about 5 trips to carry that many people.

Click here to see the complete report.