Category Archives: Hadoop

A conference in Bangalore

I was invited to speak at a conference called Solix Empower 2017 held in Bangalore, India on April 28th, 2017. It was an interesting experience. The conference focused on Big Data, Analytics, and Cloud. Over 800 people attended the one-day event with keynotes and parallel tracks on wide-ranging subjects.

I did three things. First, I was part of the inaugural keynote where I spoke on “Data as the new Oxygen” showing the emergence of data as a key platform for the future. I emphasized the new architecture of containers and micro-services on which are machine learning libraries and analytic tool kits to build modern big data applications.

Then I moderated two panels. The first was titled, ” The rise of real-time data architecture for streaming applications” and the second one was called, “Top data governance challenges and opportunities”. In the first panel, the members came from Hortonworks, Tech Mahindra, and ABOF (Aditya Birla Fashion). Each member described the criticality of real-time analytics where trends/anomalies are caught on the fly and action is taken immediately in a matter of seconds/minutes. I learnt that for online e-commerce players like ABOF, a key challenge is identifying customers most likely to refuse goods delivered at their door (many do not have credit cards, hence there is COD or cash on delivery). Such refusal causes major loss to the company. They do some trend analysis to identify specific customers who are likely to behave that way. By using real-time analytics, ABOF has been able to reduce such occurrences by about 4% with significant savings. The panel also discussed technologies for data ingestion, streaming, and building stateful apps. Some comments were made on combining Hadoop/EDW(OLAP) plus streaming(OLTP) into one solution like the Lambda architecture.

The second panel on data governance had members from Wipro, Finisar, Solix and Bharti AXA Insurance. These panelists agreed that data governance is no longer viewed as the “bureaucratic police and hence universally disliked” inside the company and it is taken seriously by the upper management. Hence policies for metadata management, data security, data retirement, and authorization are being put in place. Accuracy of data is a key challenge. While organizational structure for data governance (like a CDO, chief data officer) is still evolving, there remains many hard problems (specially for large companies with diverse groups).

It was interesting to have executives from Indian companies reflect on these issues that seem no different than what we discuss here. Big Data is everywhere and global.

Hadoop, the next ten years

I attended a meetup yesterday evening at the San Jose Convention Center on the subject “Apache Hadoop, the next 10 years” by Doug Cutting, the creator of Hadoop while at Yahoo, who works at Cloudera now. That venue was chosen because of the ongoing Strata+Hadoop conference there.

It’s always fun listening to Doug recounting how Hadoop got created in the first place. Based on early papers from Google on GFS (Google File System) and Map Reduce computing algorithm, a project was launched called Nutch, subsequently renamed Hadoop (after Doug’s son’s toy elephant name). This all made sense as horizontal scaling via commodity hardware was coming to dominate the computing landscape. All the modules in Hadoop were designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. That was all back in 2006. As an open source project, Hadoop gained momentum with community support for the overall ecosystem. Over the next seven years, we saw many new additions/improvements such as YARN, Hbase, Hive, Pig, Zookeeper, etc. Hence, Doug wanted to emphasize that there is a difference between just Hadoop and the Hadoop ecosystem.

The original Hadoop with its Map Reduce computing had its limitations and lately Spark is taking over the computing part. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It originated at UC, Berkeley’s AMPlab and is gaining fast momentum with its added features for machine learning, streaming, graph and SQL interfaces. To a question from the audience, Doug replied that such enhancements are expected and more will come as the Apache Hadoop ecosystem grows. Cloudera has created Impala, a speedier version plus the SQL interface to meet customer needs. Another example of a key addition to the ecosystem is Kafka which originated from Linked-In. The Apache Kafka project is a message broker service and  aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. To another question on whether another general-purpose platform will replace Hadoop, Doug suggested that projects like Spark will appear to handle parts of the ecosystem better. There may be many purpose-built software to address specific needs like Kafka. He eloquently praised the “open Source” philosophy of community of developers helping faster progress compared to the speed at older companies like Oracle in enhancing its DBMS software.

From the original Hadoop meant for batch processing of large volumes of data in a distributed cluster, we are moving towards the real-time world of streaming analytics and instant insights. The popularity of Hadoop can be gauged by the growth in attendance of the San Jose Hadoop Summit…from 2700 attendees in 2013, it more than doubled last year.

Doug is a good speaker and his 40 minute talk was informative and entertaining.