Category Archives: Hadoop

Splice Machine – What is it?

Those of you who have never heard of Splice Machine, don’t worry. You are in the company of many. So I decided to listen to a webinar last week that said the following in its announcement: learn about benefits of a modern IoT application platform that can capture, process, store, analyze and act on the large streams of data generated by IoT devices. The demonstration will include:

  • High Performance Data Ingestion
  • Analytics and Transformation on Data-In-Motion
  • Relational DBMS, Supporting Hybrid OLTP and OLAP Processing
  • In-Memory and Non-Volatile, Row-based and Columnar Storage mechanisms
  • Machine Learning to support decision making and problem resolution

That was a tall order. Gartner has a new term HTAP – Hybrid Transactional and Analytical Processing. Forrester uses “Translytical” to describe this platform where you could do both OLTP and OLAP. I had written a blog on Translytical database almost two years back. So I did attend the webinar and it was quite impressive. The only confusion was the liberal use of IoT in its marketing slogan. By that they want to emphasize “streaming data” (ingest, store, manage).

In Splice Machine’s website, you see four things: Hybrid RDBMS, ANSI SQL, ACID Transactions, and Real-Time Analytics. A white paper advertisement says, “Your IoT applications deserve a better data platform”. In looking at the advisory board members, I recognized 3 names – Roger Bamford, ex-Oracle and an investor, Ken Rudin, ex-Oracle, and Marie-Anne Niemet, ex-TimeTen. The company is funded by Mohr Davidow Ventures, and Interwest Partners amongst others.

There is a need for bringing together the worlds of OLTP (Transaction workloads) and Analytics or OLAP workloads into a common platform. They have been separated for decades and that’s how the Data Warehouse, MDM, OLAP cubes, etc. got started. The movement of data between the OLTP world and OLAP has been handled by ETL vendors such as Informatica. With the popularity of Hadoop, the DW/Analytics world is crowded with terms like Data Lake, ELT (first load, then transform), Data Curation, Data Unification, etc. A new architecture called Lambda (not to be confused with AWS Lambda for serverless computing) claims to unify the two worlds – OLTP and real-time streaming and analytics.

Into this world, comes Splice Machine with its scale-out data platform. You can do your standard ACID-compliant OLTP processing, data ingestion via Spark streaming and Kafka topics, query processing via ANSI SQL, and get your analytical workload without ETL. They even claim support of procedural language like PL/SQL for Oracle data. With their support of machine learning, they demonstrated predictive analytics. The current focus is on verticals like Healthcare, Telco, Retail, and Finance (Wells fargo), etc.

In the cacophony of Big Data and IoT noise, it is hard to separate facts from fiction. But I do see a role for a “unified” approach like Splice Machine. Again, the proof is always in the pudding – some real-life customer deployment scenarios with performance numbers will prove the hypothesis and their claim of 10x faster speed with one-fourth the cost.

A conference in Bangalore

I was invited to speak at a conference called Solix Empower 2017 held in Bangalore, India on April 28th, 2017. It was an interesting experience. The conference focused on Big Data, Analytics, and Cloud. Over 800 people attended the one-day event with keynotes and parallel tracks on wide-ranging subjects.

I did three things. First, I was part of the inaugural keynote where I spoke on “Data as the new Oxygen” showing the emergence of data as a key platform for the future. I emphasized the new architecture of containers and micro-services on which are machine learning libraries and analytic tool kits to build modern big data applications.

Then I moderated two panels. The first was titled, ” The rise of real-time data architecture for streaming applications” and the second one was called, “Top data governance challenges and opportunities”. In the first panel, the members came from Hortonworks, Tech Mahindra, and ABOF (Aditya Birla Fashion). Each member described the criticality of real-time analytics where trends/anomalies are caught on the fly and action is taken immediately in a matter of seconds/minutes. I learnt that for online e-commerce players like ABOF, a key challenge is identifying customers most likely to refuse goods delivered at their door (many do not have credit cards, hence there is COD or cash on delivery). Such refusal causes major loss to the company. They do some trend analysis to identify specific customers who are likely to behave that way. By using real-time analytics, ABOF has been able to reduce such occurrences by about 4% with significant savings. The panel also discussed technologies for data ingestion, streaming, and building stateful apps. Some comments were made on combining Hadoop/EDW(OLAP) plus streaming(OLTP) into one solution like the Lambda architecture.

The second panel on data governance had members from Wipro, Finisar, Solix and Bharti AXA Insurance. These panelists agreed that data governance is no longer viewed as the “bureaucratic police and hence universally disliked” inside the company and it is taken seriously by the upper management. Hence policies for metadata management, data security, data retirement, and authorization are being put in place. Accuracy of data is a key challenge. While organizational structure for data governance (like a CDO, chief data officer) is still evolving, there remains many hard problems (specially for large companies with diverse groups).

It was interesting to have executives from Indian companies reflect on these issues that seem no different than what we discuss here. Big Data is everywhere and global.

Hadoop, the next ten years

I attended a meetup yesterday evening at the San Jose Convention Center on the subject “Apache Hadoop, the next 10 years” by Doug Cutting, the creator of Hadoop while at Yahoo, who works at Cloudera now. That venue was chosen because of the ongoing Strata+Hadoop conference there.

It’s always fun listening to Doug recounting how Hadoop got created in the first place. Based on early papers from Google on GFS (Google File System) and Map Reduce computing algorithm, a project was launched called Nutch, subsequently renamed Hadoop (after Doug’s son’s toy elephant name). This all made sense as horizontal scaling via commodity hardware was coming to dominate the computing landscape. All the modules in Hadoop were designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. That was all back in 2006. As an open source project, Hadoop gained momentum with community support for the overall ecosystem. Over the next seven years, we saw many new additions/improvements such as YARN, Hbase, Hive, Pig, Zookeeper, etc. Hence, Doug wanted to emphasize that there is a difference between just Hadoop and the Hadoop ecosystem.

The original Hadoop with its Map Reduce computing had its limitations and lately Spark is taking over the computing part. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It originated at UC, Berkeley’s AMPlab and is gaining fast momentum with its added features for machine learning, streaming, graph and SQL interfaces. To a question from the audience, Doug replied that such enhancements are expected and more will come as the Apache Hadoop ecosystem grows. Cloudera has created Impala, a speedier version plus the SQL interface to meet customer needs. Another example of a key addition to the ecosystem is Kafka which originated from Linked-In. The Apache Kafka project is a message broker service and  aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. To another question on whether another general-purpose platform will replace Hadoop, Doug suggested that projects like Spark will appear to handle parts of the ecosystem better. There may be many purpose-built software to address specific needs like Kafka. He eloquently praised the “open Source” philosophy of community of developers helping faster progress compared to the speed at older companies like Oracle in enhancing its DBMS software.

From the original Hadoop meant for batch processing of large volumes of data in a distributed cluster, we are moving towards the real-time world of streaming analytics and instant insights. The popularity of Hadoop can be gauged by the growth in attendance of the San Jose Hadoop Summit…from 2700 attendees in 2013, it more than doubled last year.

Doug is a good speaker and his 40 minute talk was informative and entertaining.