I attended a big event yesterday, organized by The Hive (mission – incubate, fund, launch data driven business). There were close to 500 people attending plus several hundreds remotely connected to live video streaming. It was a panel discussion on the status of SQL API on Hadoop.
Panel members were: Susheel Kaushik from Pivotal (the spin-off from EMC and VMWare with GE as key investor), Alan Gates (Hortonworks), Tomer Shiran (MapR), Justin Erickson (Cloudera), and Priyank Patel (Teradata Aster). The moderator was Raghu Ramakrishnan from Microsoft.The major question was – why SQL on Hadoop is attracting so much attention and what is the current status?
Each panelist gave a 3 minute talk on their initiatives in bringing SQL to Hadoop. Pivotal has a project called HAWQ that promises to expand the productivity and possibilities of Hadoop with existing SQL skill sets. Hortonworks’s project Stinger aims to improve Hive performance by 100x and also to extend Hive SQL to include features needed for analytics. MapR claims to have the broadest SQL support with its Apache Drill project. Cloudera’s Impala project offers interactive SQL (4-65x faster than Hive) plus SQL queries via HiveQL. Finally, Teradata SQL-H gives business users a better way to access data stored in Hadoop.
There was an animated debate on SQL standards and speakers claimed that they were giving priority to what functions users need and basically starting with SQL-92 base level. It was clear that these efforts are not just to support a popular query language, but to open the possibilities of Hadoop data analysis via SQL. Several BI tools using the SQL API can also take advantage of Hadoop data access. It is a starting point, but not the end point. Existing skill sets are a big motivation to popularize Hadoop in the enterprises. If you are a start-up with no legacy to worry about, then SQL support is not a big deal. But enterprises have been using SQL for over two decades and switching to something new is considered a big barrier.
The moderator pointed out that there are varieties of applications on data (he called it a digital shoebox store) such as SQL/Hive MR, stream processing, BI, and machine learning. Some of the big data coming from digital exhaust (logs) may require special analytic tools.
Overall it was a good session and showed the general interest on Big Data and Hadoop. Interestingly, none of the incumbents (IBM, Oracle, HP, SAP) were there. It’s a new world!