This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.
During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.
The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.
In the world of Big Data, this approach is very inadequate. Why?
- data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
- human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
- real-time data unification of streaming data and analysis can not be handled by these solutions.
Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.
The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.
As every year begins, several experts and analyst firms like to make predictions. Let us try to make some observations in an area much talked about lately – Big Data. So here goes:
- Big Data quandary will continue as companies try to understand its value to business. Just dumping all kinds of data into a data lake (read Hadoop) is not going to solve anything. There has to be business value on what insights are needed. Therefore much like the Data Warehousing era brought additional tools in the ETL space, there is need for data curation and transformation for practical use besides the analytics piece.
- Demand for BI and Analytics will reach new heights. The next-generation BI and analytics platform should help business tap into the power of their data, whether in the cloud or on-premises. This ‘Networked BI’ capability creates an interwoven data fabric that delivers business-user self-service while eliminating analytical silos, resulting in faster and more trusted decision-making. Real-time or streaming analytics will become crucial, as decisions must be taken as soon as some events occur.
- SPARK will get even hotter. I had described IBM’s big endorsement of SPARK last year in a blogpost. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. This also says in-memory processing will continue to thrive.
- Analytics & big events will drive demand exponentially. This year’s big events like the US presidential election and the Olympics in Brazil will see the harnessing of big data to provide data-driven insights like never before.
- Protection of data itself will become paramount. It’s still too easy for hackers to circumvent perimeter defenses, steal valid user credentials, and get access to data records. In 2016, as companies protect themselves from the threat of data loss, new means of data-centric security will become mainstream to consistently control user access and credentials where it matters the most.
- Shortage of Data Scientists will drive companies to look for Big data cloud services. To circumvent the need to hire more data scientists and Hadoop admins, organizations will rely on fully managed cloud services with built-in operational support, freeing up existing data science teams to focus their time and effort on analysis instead of wrangling complex Hadoop clusters.
- Finally, shift to cloud is getting to be main stream, because of the clear ROI. At least the dev-and-test shift is happening quite fast. AWS seems to dominate the production config, even though big data as service is still in its infancy. Microsoft Azure and IBM’s cloud service plus Oracle’s new cloud offerings will make this space quite vibrant.
Recently I listened to a discussion on Big Data Visualization hosted by Bill McKnight of the McKnight Consulting group. The panelists agreed that Big Data is shifting from the hype state to an “imperative” state. For start-up companies, there are more Big Data projects whereas true big data is still a small part of the enterprise practice. At many companies, Big Data is moving from POC (Proof of Concept) to production. Interest in visualization of data from different sources is certainly increasing. There is a growth in data-driven decision-making as evidenced by the increasing use of platforms like YARN, HIVE, and Spark. The traditional approach of RDBMS platform can not scale to meet the needs of rapidly growing volume and varieties of Big Data.
So what is the difference between Data Exploration vs. Data Visualization? Data exploration is more analytical and is used to test hypothesis, whereas visualization is used to profile data and is more structured. The suggestion is to bring visualization to the beginning of data cycle (not the end) to do better data exploration. For example, in a personalized cancer treatment, the finding and examining of output of white blood counts and cancer cells can be done upfront using data visualization. In Internet e-commerce, billions of rows of data can be analyzed to understand consumer behavior. One customer uses Hadoop and Tableau’s visualization software to do this. Tableau enables visualization of all kinds of data sources from three scenarios – cold data from a data lake on Hadoop (where source data in native format can be located); warm data from a smaller set of data; or hot data served in-memory for faster processing.
Data format can be a challenge. How do you do visualization of NoSQL data? For example, JSON data (supported by MongoDB) is nested and schema-less and is hard for BI tools. Understanding data is crucial and flattening of nested hierarchies will be needed. Nested arrays can be broken as foreign keys. Graph data is another special case, where visualization of the right amount of graphs data is critical (good UX).
Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.
Apache Spark is another exciting new approach to speed up queries by utilizing memory. It consists of Spark SQL (SQL like queries), Spark string, MLLib, and GraphX. It leverages Python, Scala, and Java to do the processing. It enables users of Hadoop to have more fun with data analysis and visualization.
Big Data Visualization is emerging to be a critical component for extracting business value from data.
Posted in BI, Big Data, cloud computing, Database, Internet of Things, New Technology, NoSQL
Tagged Analytics, BI, Big Data, DBMS, e-Commerce, enterprise data warehouse, Internet, IoT, NoSQL, Platform technologies
Back when we were doing DB2 at IBM, there was an important older product called IMS which brought significant revenue. With another database product coming (based on relational technology), IBM did not want any cannibalization of the existing revenue stream. Hence we coined the phrase “dual database strategy” to justify the need for both DBMS products. In a similar vain, several vendors are concocting all kinds of terms and strategies to justify newer products under the banner of Big Data.
One such phrase is Fast Data. We all know the 3V’s associated with the term Big Data – volume, velocity and variety. It is the middle V (velocity) that says data is not static, but is changing fast, like stock market data, satellite feeds, even sensor data coming from smart meters or an aircraft engine. The question always has been how to deal with such type of changing data (as opposed to static data typical in most enterprise systems of record).
Recently I was listening to a talk by IBM and VoltDB where VoltDB tried to justify the world of “Fast Data” as co-existing with “Big Data” which is narrowed to static data warehouse or “data lake” as IBM calls it. Again, they have chosen to pigeonhole Big Data into the world of HDFS, Netezza, Impala, and batch Map-Reduce. This way, they justify the phrase Fast Data as representing operational data that is changing fast. They call VoltDB as “the fast, operational database” implying every other database solution as slow. Incumbents like IBM, Oracle, and SAP have introduced in-memory options for speed and even NoSQL databases can process very fast reads on distributed clusters.
VoltDB folks also tried to show how the two worlds (Fast Data and their version of Big Data) will coexist. The Fast Data side will ingest and interact on streams of inbound data, do real time data analysis and export to the data warehouse. They bragged about the performance benchmark of 1m tps on a 3-node cluster scaling to 2.4m on a 12-node system running in the SoftLayer cloud (owned by IBM). They also said that this solution is much faster than Amazon’s AWS cloud. The comparison is not apple-to-apple as the SoftLayer deployment is on bare metal compared to the AWS stack of software.
I wish they call this simply – real-time data analytics, as it is mostly read type transactions and not confuse with update-heavy workloads. We will wait and see how enterprises adopt this VoltDB-SoftLayer solution in addition to their existing OLTP solutions.
What is the “3rd Platform” of IT? It comprises of the cloud, mobile, social, and big data products. According to IDC, “3rd Platform technologies and solutions will drive 29 percent of 2014 IT spending and 89 percent of all IT spending growth”. Much of that growth will come from the “cannibalization” of traditional IT markets. Here are some interesting quotes and statements I read recently.
- Adding terabytes to a Hadoop cluster is much less costly than adding terabytes to an enterprise data warehouse (EDW).
- IDG Enterprise’s 2014 Big Data survey: more than half of the IT leaders polled believe they will have to re-architect the data center network to some extent to accommodate big data services.
- “Big data has the same sort of disruptive potential as the client-server revolution of 30 years ago, which changed the whole way that IT infrastructure evolved. For some people the disruption will be exciting and for others, it will be threatening.” – Marshall Presser, CTO at Pivotal
- The traditional IT infrastructure was designed to help the CFO close the company’s books faster than the manual accounting systems that preceded IT. A surprising number of those original systems are still kicking around, adding to the pile of “legacy spaghetti” that CIOs love to complain about.
- We are seeing a bumpy transition from the old kind of IT that faced mostly inward, to a new kind of IT that mostly faces outward.
- After years of resistance, IT is following the nearly universal business trend of replacing “product-centricity” with “customer-centricity”.
- One key challenge is rapidly scaling systems to meet unexpected levels of demand. “I call it the ‘curse of success’ because if the market suddenly loves your product, you have to scale up very quickly. Those kinds of scaling problems are difficult to solve, and there isn’t a universal toolkit for achieving scalability on the Internet of Things. When Henry Ford needed to scale up production, he could add another assembly line.” (Jordan Husney, Strategy Director, Undercurrent)
- “HDFS is a complete replacement for not just one, but four different layers of the traditional IT stack. The HDFS ecosystem does storage, processing, analytics, and BI/visualization, all without moving the data back and forth from one system to another. It is a complete cannibalization of the existing stack.” (Abhisek Mehta, Founder, CEO of Tresata). My view is that this only applies to the analytics side, not the transaction-processing aspect of business.
- API-ification of the Enterprise – “Not only we have to change the infrastructure, we have to fundamentally change the way we build applications. Hundreds of millions of new applications will be built. Some of them will be very small, and very transient. Traditional IT organizations – along with their tooling, approaches, and processes – will have to change. For IT, it’s going to be a different world. We’re seeing the ‘API-ification’ of the enterprise.” (Rick Bullota, Cofounder and CTO, ThingWorx
All these observations are interesting, but must be taken with the proper scope in mind. There is a tendency to sensationalize and generalize too quickly. The 3rd Platform is real and Big data is certainly changing the IT landscape. The only question is on the velocity of change!