Last June IBM made a serious commitment to the future of Apache Spark with a series of initiatives:
- It will offer Apache Spark as a service on Bluemix (Bluemix is an implementation of IBM’s Open Cloud Architecture based on Cloud Foundry, an open source Platform as a Service (PaaS). Bluemix delivers enterprise-level services that can easily integrate with your cloud applications without you needing to know how to install or configure them.
- It committed to include 3500 researchers to work on Spark-related projects.
- It will donate IBM SystemML (its machine learning language and libraries) to Apache Spark open source
The question is why this move by IBM?
First let us look at what is Apache Spark? Developed at UC Berkeley’s AMPLab, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. In other words, Spark is the next-generation of Hadoop (came with its batch pedigree and high latency).
With other solutions for real-time analytics via in-memory processing such as RethinkDB, an ambitious Redis project or commercial in-memory SAP Hana, IBM needed a competitive offering. Other vendors betting on Spark range from Amazon to Zoomdata. IBM will run its own analytics software on top of Spark, including SystemML for machine learning, SPSS, and IBM Streams.
At this week’s Strata conference, several companies like Uber described how they have deployed Spark all the way for speedy real-time analytics.