IBM’s big commitment to Apache Spark

Last June IBM made a serious commitment to the future of Apache Spark with a series of initiatives:  

  • It will offer Apache Spark as a service on Bluemix (Bluemix is an implementation of IBM’s Open Cloud Architecture based on Cloud Foundry, an open source Platform as a Service (PaaS). Bluemix delivers enterprise-level services that can easily integrate with your cloud applications without you needing to know how to install or configure them.
  • It committed to include 3500 researchers to work on Spark-related projects.
  • It will donate IBM SystemML (its machine learning language and libraries) to Apache Spark open source

The question is why this move by IBM?

First let us look at what is Apache Spark? Developed at UC Berkeley’s AMPLab, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. In other words, Spark is the next-generation of Hadoop (came with its batch pedigree and high latency).

With other solutions for real-time analytics via in-memory processing such as RethinkDB, an ambitious Redis project or commercial in-memory SAP Hana, IBM needed a competitive offering. Other vendors betting on Spark range from Amazon to Zoomdata. IBM will run its own analytics software on top of Spark, including SystemML for machine learning, SPSS, and IBM Streams.

At this week’s Strata conference, several companies like Uber described how they have deployed Spark all the way for speedy real-time analytics.

Translytical Database

This is a new term I learnt this week, thanks to the Forrester analyst Mike Gualtieri. Terms like Translytics or Exalytics (Oracle’s phrase) do not roll off the tongue that easy. Mike defined Translytical as , “single unified database that supports transaction and analytics in real time without sacrificing transactional integrity, performance, and scale.”

[Transactions + Analytics = Translytical]

Those of us who saw the early days of Data Warehousing, we deliberately separated the two worlds, so that analytics workloads do not interfere with transaction performance. Hence snapshots of operational data were taken to do data warehousing for offline batch analysis and reporting. Mostly that was getting a retro-view of what happened. In the current scheme of things, where data is coming fast and furious from so many sources, there is need to look at trends in real time and take action. Some insights are perishable, therefore need to be acted on immediately. All data originate fast, but analytics usually done much later. Perishable insights can have exponentially more value that after-the-fact traditional historical analysis. Here is a classification of analytics:                     Past —- Learn (Descriptive Analytics),                                                                             Present —- Infer (Predictive Analytics), Detect (Streaming Analytics),                   Future —– Action (Prescriptive Analytics).

Streaming analytics (real time) require a database that can do in-memory streaming for near-zero latency for complex data and analytical operations. The traditional approach of moving data to analytics has created many silos  such as CRM stack, BI stack or Mobile stack. Translytical databases are transactional as well as analytical. Point solutions like Spark data streaming which does micro batch processing are not the answer. Such a unified database must do in-memory processing (use RAM for real-time), multi-modal and support compression and tiered data as well.  Customers are stitching together open source products such as Spark, Kafka, and Cassandra to achieve streaming analytics, but it becomes a non-trivial programming task.

The only database claiming to be Translytical is VoltDB with functions such as: in-memory processing, scale-out with shared nothing, ACID compliance for transactional integrity, reliability and fault tolerance. It also has real time analytics built in combined with integration with Hadoop ecosystem.   Such a unified database has to prove its worth in the market.

So we have come full circle – from single database to more than one to handle both transactions and analytics; now back to single database doing both.

It makes logical sense, but let us watch and see if that works.

Cassandra Summit 2015

I attended my first Cassandra Summit 2015 this week at the Santa Clara convention center. I was quite surprised to see more than 6000 people attend this event, much bigger than last year (2000 attendees) with 130 sessions. It was a proof to the growing popularity of Cassandra’s NoSQL database platform. CTO and cofounder  Jonathan Ellis (formerly from Rackspace) described new release 2.2 and 3.0 functions.

With its major addition of JSON support in 2.2, Cassandra basically eliminated the difference with MongoDB. They have their own query language called CQL, an SQL-like construct. Now with JSON support, the developer community will see some big advantages. They have functions like collections, udf (user defined types), and deeper nesting. Release 3.0 will see a brand new storage engine, a vast improvement to their previous key-value store engine with better space efficiency. Release 3.0 will include materialized views. Bragging about their fast performance and efficient distributed database functionality, Jonathan joked about MongoDB as the “snapchat for databases” (a reference to occasional data loss because of weak consistency). He emphasized three key elements: availability (onstage they dramatized the shutting down of many nodes in two data centers with Cassandra still running), scale, and performance (both read and write).

However, when it comes to streaming analytics, one Cassandra user explained how he combined Spark, Kafka, and Cassandra to achieve the same – a non-trivial programming feat. Cassandra CEO Billy Bosworth emphasized that they solve the transaction workload problem (always-on) and not geared for analytics. I understood that two key customers are Apple running their iTune application on Cassandra and Netflix. Majority of use cases were web-centric applications where speed and scale are key requirements. Several case studies indicated customers replacing MySQl with Cassandra.

In a panel discussion on monetizing open source software, the comment was made that Cassandra is foundational and customers are happy to pay for the fast performance and scale using their enterprise edition. In that sense, they are different than the RedHat model.

It was an interesting experience to see a new-generation database product gaining wider acceptance.

HP – Where is it headed?

HP (Hewlett Packard) is in the news today. The company (HPQ) is splitting itself into two separate companies on Nov 1st. – HP Enterprise (HPE) and HP Inc.(HPI). The first one will include the server line, consulting, and related software whereas the second entity will have the PC and printer business. In the process, 33,000 employees will be laid off (10% of its current workforce of 300,000). This will be on top of 55,000 earlier lay-offs under the current CEO Meg Whitman. The lay-offs will be carried out over next 3 years, mostly from the HP Enterprise. The restructuring will result in pretax charges of about $2.7 billion at HPE and $300 million at HP Inc, which has been hit hard by a relentless decline in sales of PCs.

HP has had a checkered history of many mis-steps. When the company’s growth stalled in the late 1990’s, then-CEO Lew Platt also opted for a split, spinning off HP’s original medical products and test and measurement devices unit. The next decade would bring a string of mega-mergers under a long line of various CEOs, starting with now-Presidential-hopeful Carly Fiorina and her $19 billion Compaq acquisition in 2002, then the $13.9 billion purchase of IT services company EDS under CEO Mark Hurd and finally the $11.1 billion acquisition of Autonomy under Leo Apotheker. The Compaq acquisition saw a lot of drama in the board room and arguably it did not yield the results HP had hoped. Fiorina was fired by the board at the end, but walked away with almost $100m in severance. It seems this will be brought to the debate by the GOP hopefuls.  Mark Hurd resigned unceremoniously and is now co-CEO at Oracle. Leo Apotheker did not last even for a year and was fired by the board. The subsequent acquisition of Autonomy was a disaster. Meg Whitman took over in 2011 as CEO and this major split is happening under her watch.

Whitman hopes with her latest move to make HPE  a leading player in cloud services. Still, that path is a challenging one: Rivals like Amazon, Microsoft, Google and IBM are already dominating the cloud market, and it won’t be easy for HP to break through. Given the poor state of the PC market, HPI will continue to face big challenges in terms of revenue and earnings. “HP Software” has always been an oxymoron with hardly any worthwhile product to mention. Openview was a success in the systems management arena. Now they are focusing on big data and analytics with acquisitions like Vertica, but it is yet to make a big impact in the market.

We will wait and see.

Another Big day at Apple

Apple continues to amaze. Excitement build-up. Total secrecy. Rumor mill in full works. Then the day comes like this morning! A perfectly scripted, 2-hour extravaganza packed with new products and features. The Steve Jobs tradition lives – perfect execution of a marketing event!

Tim Cook started off with the Apple Watch update.  The cool fashionable Apple Watch Hermes with great bands and watch faces. Another sports lineup with gold, rose gold, and anodized aluminum colors plus red band to go with. There were cool demos of new Watch apps, such as AirStrip where a pregnant mother can monitor baby’s heartbeat and transmit that to the doctor instantly. Other apps include Facebook messenger, GoPro, calendar, etc.

Next was the big news of iPad Pro, a large 12.9″ screen with multi-touch and 5.6m pixels. It comes with a full size smart keyboard that can get connected magnetically. The chip is A9X, a 64-bit powerhouse that doubles in performance compared to the A8X. Interestingly this iPad is 22 times faster than the original iPad. Apple also introduced a new smart pencil to draw sketches on the iPad Pro. Due to its large real estate, multiple applications can be parked, as shown by Microsoft Office guys (Excel, Word, and Powerpoint on the same screen). Adobe also demoed its comp, Photoshop Fix, and Photoshop Sketch on the iPad Pro. The iPad Pro comes at a hefty price of $799(32GB) and $949(128GB) and more for the 256GB one. The pencil costs $99 and keyboard is a hefty $169.

The other big news was the Apple TV, a revamped receiver and a brand new remote with cool features such as button for Siri and touchpad to scroll faster. Tim Cook said the future of TV is Apps such as iTunes, Netflix, Hulu, etc. The operating system is called tvOS based on iOS where developers can build new applications. It will be better user experience for sure. I like the remote control with a volume adjustment button. That way, you won’t need another remote control for the TV. This is great for Apple TV users like me who have abandoned cable for ever. This is the future for TV for sure.

Finally there was the iPhone. As predicted Apple introduced two new versions, iPhone 6s and iPhone 6s Plus. The big feature is the 3D Touch where apps can pop up for quick viewing and disappear. It has a better camera with 12 MP power with HD video or 4K video. Tim Cook mentioned that iPhone has grown 35% overall during last one year, but 75% in China. The new operating system iOS9 which is available on Sept. 12. The 6s ranges in price from $199 to $399, and 6s Plus costs $299 and up – no change from existing iPhone6 and 6Plus prices.

As usual, it was a delightful event with many cool innovations complimented with the baritone voice of Johny Ive on video for each new product design.

SDR – Steamlined Data Refinery

Yesterday I attended a session in Palo Alto on the subject of Data Refinery and the speaker was Will Gorman of Pentaho. I did not realize that Pentaho was acquired by Hitachi Data Systems couple of months ago. The terms “data lake” was coined by James Dixon of Pentaho. I wrote a blog on this subject last year. As soon as the term started to appear in the data lexicon, other interesting terms such as “data swamp” appeared.

The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. The often unstated premise of a data lake is that it relieves users from dealing with data acquisition and maintenance issues, and guarantees fast access to local, accurate and updated data without incurring development costs (in terms of time and money) typically associated with structured data warehouses. According to IBM, “However appealing this premise, practically speaking, it is our experience, and that of our customers, that “raw” data is logistically difficult to obtain, quite challenging to interpret and describe, and tedious to maintain. Furthermore, these challenges multiply as the number of sources grows, thus increasing the need to thoroughly describe and curate the data in order to make it consumable”. I completely agree.

During the early days of Data Warehousing, the terms ETL dealt with all the data preparation stages – extract, transform, and load the curated data for query and reporting. I used to call this jokingly, “answer to 25 years of sin”. In my understanding, Pentaho’s SDR (Streamlined Data Refinery) is a modern form of ETL that deals with both internal structured data and external unstructured data including machine-generated data. In Pentaho’s own words, “The big data stakes are higher than ever before. No longer just about quantifying ‘virtual’ assets like sentiment and preference, analytics are starting to inform how we manage physical assets like inventory, machines and energy. This means companies must turn their focus to the traditional ETL processes that result in safe, clean and trustworthy data. However, for the types of ROI use cases we’re talking about today, this traditional IT process needs to be made fast, easy, highly scalable, cloud-friendly and accessible to business. And this has been a stumbling block – until now. Streamlined Data Refinery, a market-disrupting innovation that effectively brings the power of governed data delivery to “the people” unlocks big data’s full operational potential”.

Earlier I wrote about Data Curation and how new companies such as Tamr are addressing the issue. Pentaho’s SDR is another form of data curation. IBM calls it Data Wrangling process.

As usual, we love to confuse with variety of terms describing the same!

A new era for Google

Yesterday, August 10, 2015 saw the dawn of a new era at Google. It announced a huge restructuring of the company. An umbrella holding-company called Alphabet was created to manage a portfolio of separate companies each with its own CEO. Founders Larry Page and Sergey Brin elevated themselves to be the CEO and President of Alphabet with Eric Schmidt as chairman. The board will remain the same as before. For the biggest company Google, the new CEO is Sundar Pichai, 43 years old, who has seen fast rise in the ranks over his 11 year career. The other portfolio companies will be Nest, Fiber, GoogleX, Calico, Google Capital, Sidewalk, etc.

Google Inc. will remain the dominant revenue and profit generator that includes Search, Youtube, Android, Maps, Apps, and Technical Infrastructure. The new CEO Sundar Pichai joins the rank of newly anointed Indian-American execs like Satya Nadella (Microsoft CEO), Santanu Narayen (Adobe CEO), George Kurian (NetApps CEO), and twin brother Thomas Kurian (Oracle President). Sundar’s reputation as a great product-focused management executive helped to earn this position. We congratulate Sundar, an IIT Kharagpur alumni with Masters in Science degree from Stanford and a Wharton MBA.

Some analysts are comparing the move to Warren Buffet’s company structure, a mix of independent businesses and investments. In the letter announcing Alphabet, Page said, “In general, our model is to have a strong CEO who runs each business, with Sergey and me in service to them as needed.” The only challenge will be how to fund unprofitable units by taking money from one strong unit Google Inc. It can also be a way to force several “moonshot” projects to start surviving as independent entities with their own P&L.

Google has been an unusual company from the start. Let us hope this new structural experiment yields good results as intended.