Category Archives: Apache

Apache Drill + Arrow = Dremio

A new company just emerged from stealth mode yesterday, called Dremio, backed by Redpoint and Lightspeed in a Series A funding of $10m back in 2015. The founders came from MapR, but were active in Apache projects like Drill and Arrow. The same VC’s backed MapR and had the Dremio founders work out of their facilities during the stealth phase. Now the company has around 50 people in their Mountainview, California office.

Apache Drill acts as a single SQL engine that, in turn, can query and join data from among several other systems. Drill can certainly make use of an in-memory columnar data standard. But while Dremio was still in stealth, it wasn’t immediately obvious what Drill’s strong intersection with Arrow might be. But yesterday the company launched a namesake product that also acts as a single SQL engine that can query and join data from among several other systems, and it accelerates those queries using Apache Arrow. So it is a combo of (Drill + Arrow): schema-free SQL for variety of data sources plus a columnar in-memory analytics execution engine.

Dremio believes that BI today involves too many layers. Source systems, via ETL processes, feed into data warehouses, which may then feed into OLAP cubes. BI tools themselves may add another layer, building their own in-memory models in order to accelerate query performance. Dremio thinks that’s a huge mess and disintermediates things by providing a direct bridge between BI tools and the source system they’re querying. The BI tools connect to Dremio as if it were a primary data source, and query it via SQL. Dremio then delegates the query work to the true back-end systems through push-down queries that it issues. Dremio can connect to relational databases (DB2, Oracle, SQL Server, MySQL, PostgreSQL), NoSQL stores (MongoDB, Amazon Redshift, HBase, MapR-FS), Hadoop, cloud blob stores like S3, and ElasticSearch.

Here’s how it works: all data pulled from the back-end data sources is represented in memory using Arrow. Combined with vectorized (in-CPU parallel processing) querying, this design can yield up to a 5x performance improvement over conventional systems (company claims). But a perhaps even more important optimization is Dremio’s use of what it calls “Reflections,” which are materialized data structures that optimize Dremio’s row and aggregation operations. Reflections are sorted, partitioned, and indexed, stored as files on Parquet disk, and handled in-memory as Arrow-formatted columnar data. This sounds similar to ROLAP aggregation tables).

Andrew Brust from ZDNet said, “While Dremio’s approach to this is novel, and may break a performance barrier that heretofore has not been well-addressed, the company is nonetheless entering a very crowded space. The product will need to work on a fairly plug-and-play basis and live up to its performance promises, not to mention build a real community and ecosystem. These are areas where Apache Drill has had only limited success. Dremio will have to have a bigger hammer, not just an Arrow”.

Serverless, FaaS, AWS Lambda, etc..

If you are part of the cloud development community, you certainly know about “serverless computing”, almost a misnomer. Because it implies there are no servers which is untrue. However the servers are hidden from the developers. This model eliminates operational complexity and increases developer productivity.

We came from monolithic computing to client-server to services to microservices to serverless model. In other words, our systems have slowly “dissolved” from monolithic to function-by-function. Software is developed and deployed as individual functions – a first-class object and cloud runs it for you. These functions are triggered by events which follows certain rules. Functions are written in fixed set of languages, with a fixed set of programming model and cloud-specific syntax and semantics. Cloud-specific services can be invoked to perform complex tasks. So for cloud-native applications, it offers a new option. But the key question is what should you use it for and why.

Amazon’s AWS, as usual, spearheaded this in 2014 with a engine called AWS Lambda. It supports Node, Python, C# and Java. It uses AWS API triggers for many AWS services. IBM offers OpenWhisk as a serverless solution that supports Python, Java, Swift, Node, and Docker. IBM and third parties provide service triggers. The code engine is Apache OpenWhisk. Microsoft provides similar function in its Azure Cloud function. Google cloud function supports Node only and has lots of other limitations.

This model of computing is also called “event-driven” or FaaS (Function as a Service). There is no need to manage provisioning and utilization of resources, nor to worry about availability and fault-tolerance. It relieves the developer (or devops) from managing scale and operations. Therefore, the key marketing slogans are event-driven, continuous scaling, and pay by usage. This is a new form of abstraction that boils down to function as the granular unit.

At the micro-level, serverless seems pretty simple – just develop a procedure and deploy to the cloud. However, there are several implications. It imposes a lot of constraints on developers and brings load of new complexities plus cloud lock-in. You have to pick one of the cloud providers and stay there, not easy to switch. Areas to ponder are cost, complexity, testing, emergent structure, vendor dependence, etc.

Serverless has been getting a lot of attention in last couple of years. We will wait and see the lessons learnt as more developers start deploying it in real-world web applications.

Hadoop, the next ten years

I attended a meetup yesterday evening at the San Jose Convention Center on the subject “Apache Hadoop, the next 10 years” by Doug Cutting, the creator of Hadoop while at Yahoo, who works at Cloudera now. That venue was chosen because of the ongoing Strata+Hadoop conference there.

It’s always fun listening to Doug recounting how Hadoop got created in the first place. Based on early papers from Google on GFS (Google File System) and Map Reduce computing algorithm, a project was launched called Nutch, subsequently renamed Hadoop (after Doug’s son’s toy elephant name). This all made sense as horizontal scaling via commodity hardware was coming to dominate the computing landscape. All the modules in Hadoop were designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. That was all back in 2006. As an open source project, Hadoop gained momentum with community support for the overall ecosystem. Over the next seven years, we saw many new additions/improvements such as YARN, Hbase, Hive, Pig, Zookeeper, etc. Hence, Doug wanted to emphasize that there is a difference between just Hadoop and the Hadoop ecosystem.

The original Hadoop with its Map Reduce computing had its limitations and lately Spark is taking over the computing part. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It originated at UC, Berkeley’s AMPlab and is gaining fast momentum with its added features for machine learning, streaming, graph and SQL interfaces. To a question from the audience, Doug replied that such enhancements are expected and more will come as the Apache Hadoop ecosystem grows. Cloudera has created Impala, a speedier version plus the SQL interface to meet customer needs. Another example of a key addition to the ecosystem is Kafka which originated from Linked-In. The Apache Kafka project is a message broker service and  aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. To another question on whether another general-purpose platform will replace Hadoop, Doug suggested that projects like Spark will appear to handle parts of the ecosystem better. There may be many purpose-built software to address specific needs like Kafka. He eloquently praised the “open Source” philosophy of community of developers helping faster progress compared to the speed at older companies like Oracle in enhancing its DBMS software.

From the original Hadoop meant for batch processing of large volumes of data in a distributed cluster, we are moving towards the real-time world of streaming analytics and instant insights. The popularity of Hadoop can be gauged by the growth in attendance of the San Jose Hadoop Summit…from 2700 attendees in 2013, it more than doubled last year.

Doug is a good speaker and his 40 minute talk was informative and entertaining.