Category Archives: BI

Apache Drill + Arrow = Dremio

A new company just emerged from stealth mode yesterday, called Dremio, backed by Redpoint and Lightspeed in a Series A funding of $10m back in 2015. The founders came from MapR, but were active in Apache projects like Drill and Arrow. The same VC’s backed MapR and had the Dremio founders work out of their facilities during the stealth phase. Now the company has around 50 people in their Mountainview, California office.

Apache Drill acts as a single SQL engine that, in turn, can query and join data from among several other systems. Drill can certainly make use of an in-memory columnar data standard. But while Dremio was still in stealth, it wasn’t immediately obvious what Drill’s strong intersection with Arrow might be. But yesterday the company launched a namesake product that also acts as a single SQL engine that can query and join data from among several other systems, and it accelerates those queries using Apache Arrow. So it is a combo of (Drill + Arrow): schema-free SQL for variety of data sources plus a columnar in-memory analytics execution engine.

Dremio believes that BI today involves too many layers. Source systems, via ETL processes, feed into data warehouses, which may then feed into OLAP cubes. BI tools themselves may add another layer, building their own in-memory models in order to accelerate query performance. Dremio thinks that’s a huge mess and disintermediates things by providing a direct bridge between BI tools and the source system they’re querying. The BI tools connect to Dremio as if it were a primary data source, and query it via SQL. Dremio then delegates the query work to the true back-end systems through push-down queries that it issues. Dremio can connect to relational databases (DB2, Oracle, SQL Server, MySQL, PostgreSQL), NoSQL stores (MongoDB, Amazon Redshift, HBase, MapR-FS), Hadoop, cloud blob stores like S3, and ElasticSearch.

Here’s how it works: all data pulled from the back-end data sources is represented in memory using Arrow. Combined with vectorized (in-CPU parallel processing) querying, this design can yield up to a 5x performance improvement over conventional systems (company claims). But a perhaps even more important optimization is Dremio’s use of what it calls “Reflections,” which are materialized data structures that optimize Dremio’s row and aggregation operations. Reflections are sorted, partitioned, and indexed, stored as files on Parquet disk, and handled in-memory as Arrow-formatted columnar data. This sounds similar to ROLAP aggregation tables).

Andrew Brust from ZDNet said, “While Dremio’s approach to this is novel, and may break a performance barrier that heretofore has not been well-addressed, the company is nonetheless entering a very crowded space. The product will need to work on a fairly plug-and-play basis and live up to its performance promises, not to mention build a real community and ecosystem. These are areas where Apache Drill has had only limited success. Dremio will have to have a bigger hammer, not just an Arrow”.

Advertisements

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

IBM’s Software Business

IBM has come a long way from my time – 16 years spent during the 1970s, 1980’s and early 1990s. Hardware was the king for most of my years there and software was merely a means to an end of “hardware sales”. Even during the early years of the IBM PC, that mistake (of thinking it was a hardware game), helped create a new software giant called Microsoft. Hence the acronym IBM was jokingly called I Blame Microsoft.

Advance two decades and we see a big shift of focus from hardware to software, finally. IBM has sold off much of its non-mainframe hardware (x86 servers) & storage business. During the 4th. quarter of 2015, IBM’s share of server-market was 14.1% with an impressive yearly growth of 8.9%. Contrast this to the growth rates of HPE(-2.1%), Dell (5.3%), and Lenovo (3.7%).

IBM’s software is another story. While it contributed about 28% to total revenue in 2015 ($81.7B), the profit contribution was 60%. If it’s software was a separate business, it would rank as the fourth largest software company, as shown below:

  1. Microsoft  –  $93.6B Rev. —> 30.1% profit
  2. Oracle        – $38.2B Rev. —> 36.8% profit
  3. SAP            –  $23.2B Rev. —> 23.4% profit
  4. IBM           –  $22.9B Rev.  —> 34.6% profit

IBM’s software is second most profitable after Oracle’s. The $22.9B revenue can be split into three components:

  • Middleware at 19.5B (includes everything above the operating system like DB2, CICS, Tivoli, Bluemix, etc.),
  • Operating System at $1.8B,
  • Miscellaneous at $1.6B.

It does not split its cloud software explicitly. Therefore, it is hard to compare it to AWS or Azure or GCE.

The only problem is that its software business is not growing. As a matter of fact, it showed a decline last year. Given the rise of cloud services, IBM has to step up its competitive offering in that space. It did acquire Softlayer couple of years back at a hefty price, but the cloud infrastructure growth does not match that of AWS (expected to hit $10B this year).

IBM is a company in transition. Resources are being shifted toward high-growth areas like cloud computing and analytics, and legacy businesses with poor growth prospects are in decline. Still, IBM remains a major force in the software market.

Big Data Predictions for 2016

As every year begins, several experts and analyst firms like to make predictions. Let us try to make some observations in an area much talked about lately – Big Data. So here goes:

  • Big Data quandary will continue as companies try to understand its value to business. Just dumping all kinds of data into a data lake (read Hadoop) is not going to solve anything. There has to be business value on what insights are needed. Therefore much like the Data Warehousing era brought additional tools in the ETL space, there is need for data curation and transformation for practical use besides the analytics piece.
  • Demand for BI and Analytics will reach new heights. The next-generation BI and analytics platform should help business tap into the power of their data, whether in the cloud or on-premises. This ‘Networked BI’ capability creates an interwoven data fabric that delivers business-user self-service while eliminating analytical silos, resulting in faster and more trusted decision-making. Real-time or streaming analytics will become crucial, as decisions must be taken as soon as some events occur.
  • SPARK will get even hotter. I had described IBM’s big endorsement of SPARK last year in a blogpost. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. This also says in-memory processing will continue to thrive.
  • Analytics & big events will drive demand exponentially. This year’s big events like the US presidential election and the Olympics in Brazil will see the harnessing of big data to provide data-driven insights like never before.
  • Protection of data itself will become paramount. It’s still too easy for hackers to circumvent perimeter defenses, steal valid user credentials, and get access to data  records. In 2016, as companies protect themselves from the threat of data loss, new means of data-centric security will become mainstream to consistently control user access and credentials where it matters the most.
  • Shortage of Data Scientists will drive companies to look for Big data cloud services. To circumvent the need to hire more data scientists and Hadoop admins, organizations will rely on fully managed cloud services with built-in operational support, freeing up existing data science teams to focus their time and effort on analysis instead of wrangling complex Hadoop clusters.
  • Finally, shift to cloud is getting to be main stream, because of the clear ROI. At least the dev-and-test shift is happening quite fast. AWS seems to dominate the production config, even though big data as service is still in its infancy. Microsoft Azure and IBM’s cloud service plus Oracle’s new cloud offerings will make this space quite vibrant.

Big Data Visualization

Recently I listened to a discussion on Big Data Visualization hosted by Bill McKnight of the McKnight Consulting group. The panelists agreed that Big Data is shifting from the hype state to an “imperative” state. For start-up companies, there are more Big Data projects whereas true big data is still a small part of the enterprise practice. At many companies, Big Data is moving from POC (Proof of Concept) to production. Interest in visualization of data from different sources is certainly increasing. There is a growth in data-driven decision-making as evidenced by the increasing use of platforms like YARN, HIVE, and Spark. The traditional approach of RDBMS platform can not scale to meet the needs of rapidly growing volume and varieties of Big Data.

So what is the difference between Data Exploration vs. Data Visualization? Data exploration is more analytical and is used to test hypothesis, whereas visualization is used to profile data and is more structured. The suggestion is to bring visualization to the beginning of data cycle (not the end) to do better data exploration. For example, in a personalized cancer treatment, the finding and examining of output of white blood counts and cancer cells can be done upfront using data visualization. In Internet e-commerce, billions of rows of data can be analyzed to understand consumer behavior. One customer uses Hadoop and Tableau’s visualization software to do this. Tableau enables visualization of all kinds of data sources from three scenarios – cold data from a data lake on Hadoop (where source data in native format can be located); warm data from a smaller set of data; or hot data served in-memory for faster processing.

Data format can be a challenge. How do you do visualization of NoSQL data? For example, JSON data (supported by MongoDB) is nested and schema-less and is hard for BI tools. Understanding data is crucial and flattening of nested hierarchies will be needed. Nested arrays can be broken as foreign keys. Graph data is another special case, where visualization of the right amount of graphs data is critical (good UX).

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.

Apache Spark is another exciting new approach to speed up queries by utilizing memory. It consists of Spark SQL (SQL like queries), Spark string, MLLib, and GraphX. It leverages Python, Scala, and Java to do the processing. It enables users of Hadoop to have more fun with data analysis and visualization.

Big Data Visualization is emerging to be a critical component for extracting business value from data.

Fast Data vs. Big Data

Back when we were doing DB2 at IBM, there was an important older product called IMS which brought significant revenue. With another database product coming (based on relational technology), IBM did not want any cannibalization of the existing revenue stream. Hence we coined the phrase “dual database strategy” to justify the need for both DBMS products. In a similar vain, several vendors are concocting all kinds of terms and strategies to justify newer products under the banner of Big Data.

One such phrase is Fast Data. We all know the 3V’s associated with the term Big Data – volume, velocity and variety. It is the middle V (velocity) that says data is not static, but is changing fast, like stock market data, satellite feeds, even sensor data coming from smart meters or an aircraft engine. The question always has been how to deal with such type of changing data (as opposed to static data typical in most enterprise systems of record).

Recently I was listening to a talk by IBM and VoltDB where VoltDB tried to justify the world of “Fast Data” as co-existing with “Big Data” which is narrowed to static data warehouse or “data lake” as IBM calls it. Again, they have chosen to pigeonhole Big Data into the world of HDFS, Netezza, Impala, and batch Map-Reduce. This way, they justify the phrase Fast Data as representing operational data that is changing fast. They call VoltDB as  “the fast, operational database” implying every other database solution as slow. Incumbents like IBM, Oracle, and SAP have introduced in-memory options for speed and even NoSQL databases can process very fast reads on distributed clusters.

VoltDB folks also tried to show how the two worlds (Fast Data and their version of Big Data) will coexist. The Fast Data side will ingest and interact on streams of inbound data, do real time data analysis and export to the data warehouse. They bragged about the performance benchmark of 1m tps on a 3-node cluster scaling to 2.4m on a 12-node system running in the SoftLayer cloud (owned by IBM). They also said that this solution is much faster than Amazon’s AWS cloud. The comparison is not apple-to-apple as the SoftLayer deployment is on bare metal compared to the AWS stack of software.

I wish they call this simply – real-time data analytics, as it is mostly read type transactions and not confuse with update-heavy workloads. We will wait and see how enterprises adopt this VoltDB-SoftLayer solution in addition to their existing OLTP solutions.

The High Tech Indian Election – lessons on Big Data, Social Media and 3D Holography

The recent national election in India that spanned over 5 weeks and concluded on May 12th. was unique in terms of  sheer numbers. The total size of the electorate was 815m (more than the population of USA and European Union combined), of which 550m actually voted. Half the electorate was below the age of 25 (voting age is 18). The sheer size and complexity was mind-boggling – it was the greatest democratic process on display!

The final results came out on May 16th., where the former opposition party BJP (Bharatiya Janata Party, or the Indian People’s party) had a land-slide victory, not seen in last 30 years. The leader of BJP is Mr. Narendra Modi, the chief minister (like a state Governor) of the state of Gujarat for last 13 years. He comes from a poor family and climbed up the ranks by sheer hard work and total focus on high growth development of his state. India puts a lot of hope on Mr. Modi to bring speedy growth back to the economy.

Here I will point out how he used technology during the election.

Mr. Modi found a smart technologist from London specializing in 3D Holography. He experimented with that during the state elections in 2012 where his 3D holographic image was beamed to many locations simultaneously. The system was debugged and was ready for effective use this time – broadcasting his image in full 3D to 100s of locations thus reaching millions of people and conveying his message very effectively. Physically he spoke at over 400 rallies, and each one was attended by at least one million people or more. But combined with the virtual presence via holography, his outreach was the maximum compared to any other candidate. Such an experiment of using 3D Holography has never been used before in any election around the world.

The second approach was putting together a crack team of smart Data Scientists  who took a page from president Barack Obama’s election process. Data about the electorate was gathered and dissected like never before. Based on behavior and past preferences, they were segmented and targeted very carefully. This was a bottoms-up approach. Thousands of volunteers followed up the Big data analysis and reached out to the voters in a door-to-door fashion, thus influencing their choices. Without such analysis and pin-point targeting, this would have been impossible within such a short time-frame of 4 months. All this effort was focussed primarily on 2 major states (UP and Bihar) where BJP’s electoral victory was going to be the game-changer (as they had poor record in the past). The results came out as proof of the effectiveness of this approach – BJP got 71 out of 80 in UP, and 31 out of 40 in Bihar (totally beyond any projections and expectations).

The third focus was on fully exploiting the social media such as Twitter, Facebook, and Blogs. The urban population with access to the Internet were constantly reached out via these new communication tools. Mr. Modi has one of the largest followings in Twitter.

It was the clever use of technology by Mr. Modi’s team that clinched their unprecedented victory. The other contesting parties including the current ruling Congres party for last 10 years were nowhere in using technology and hence they lost badly.

Some lesson for all future elections!