Category Archives: Data Warehouse

Data Sharehouse?

This is yet another new term in our lexicon. The San Mateo, California-based startup Snowflake announced this week a new offering with this name, as a free add-on to the data warehouse it built for cloud computing. Now companies using Snowflake’s technology, officially called Snowflake Data Sharing, can share any part of their data warehouses, subject to defined security policies and controls on access, with each other.

Snowflake’s data sharehouse allows companies to provide direct access to structured and unstructured data without the need to copy the data to a new location. Current approaches include file-sharing, electronic data interchange, application programming interfaces and email, but all of them have issues ranging from lack of security to cumbersome methods of providing data access to the right people. Jon Bock, Snowflake’s marketing chief compared the difference in data sharing on Snowflake versus other methods to the difference between streaming music and compact discs. “It looks [to the data recipient] just as if the data resides on their own data warehouse,” he said.

The catch is that every participant must be a Snowflake customer using their data warehouse in the cloud. So this is another way to grow their market. We have seen this approach in the 1990s when Exchanges were introduced by the likes of Oracle for B2B data interchange. That did not go very far. Of course cost was a big factor, but the policy agreement on common formats and security for data exchange was another issue. Snowflake claims to solve this by having one source of truth in the cloud.

Of course companies, like manufacturers and suppliers, advertisers and publishers have been sharing data for quite a long time, but it has been cumbersome via technologies like EDI (electronic data interchange, developed in the 1940s), email, file sharing, APIs and more. That kind of sharing takes time and wasn’t created for the current situation, in which businesses need live data processed in real time to keep a competitive edge.

According to Bob Muglia, Snowflake’s CEO (ex-Microsoft), the data sharehouse changes the game and democratizes the possibilities, because anyone can access the service. Rather than being charged a subscription fee, users pay only according to the amount of data they have processed. Snowflake’s data sharing service is free to data providers, data consumers pay for the compute resources they use. Not only that, but data providers and consumers make their arrangements independent of Snowflake Computing which is the infrastructure provider.

In an increasingly collaborative world there is little doubt that sharing data easily, and in real time, without sacrificing security, privacy, governance and compliance is of great value. Whether it will create entirely new markets has yet to be seen, but actionable data-driven insights are likely to be huge differentiators in the digital economy.

It is a clever move, but time will tell if this will enable smooth data exchange or create more chaos.

A conference in Bangalore

I was invited to speak at a conference called Solix Empower 2017 held in Bangalore, India on April 28th, 2017. It was an interesting experience. The conference focused on Big Data, Analytics, and Cloud. Over 800 people attended the one-day event with keynotes and parallel tracks on wide-ranging subjects.

I did three things. First, I was part of the inaugural keynote where I spoke on “Data as the new Oxygen” showing the emergence of data as a key platform for the future. I emphasized the new architecture of containers and micro-services on which are machine learning libraries and analytic tool kits to build modern big data applications.

Then I moderated two panels. The first was titled, ” The rise of real-time data architecture for streaming applications” and the second one was called, “Top data governance challenges and opportunities”. In the first panel, the members came from Hortonworks, Tech Mahindra, and ABOF (Aditya Birla Fashion). Each member described the criticality of real-time analytics where trends/anomalies are caught on the fly and action is taken immediately in a matter of seconds/minutes. I learnt that for online e-commerce players like ABOF, a key challenge is identifying customers most likely to refuse goods delivered at their door (many do not have credit cards, hence there is COD or cash on delivery). Such refusal causes major loss to the company. They do some trend analysis to identify specific customers who are likely to behave that way. By using real-time analytics, ABOF has been able to reduce such occurrences by about 4% with significant savings. The panel also discussed technologies for data ingestion, streaming, and building stateful apps. Some comments were made on combining Hadoop/EDW(OLAP) plus streaming(OLTP) into one solution like the Lambda architecture.

The second panel on data governance had members from Wipro, Finisar, Solix and Bharti AXA Insurance. These panelists agreed that data governance is no longer viewed as the “bureaucratic police and hence universally disliked” inside the company and it is taken seriously by the upper management. Hence policies for metadata management, data security, data retirement, and authorization are being put in place. Accuracy of data is a key challenge. While organizational structure for data governance (like a CDO, chief data officer) is still evolving, there remains many hard problems (specially for large companies with diverse groups).

It was interesting to have executives from Indian companies reflect on these issues that seem no different than what we discuss here. Big Data is everywhere and global.

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.