Data Curation Systems

There is a whole area in the Data world, called by various names – data integration, data movement, data curation or cleaning, data transformation, etc. One of the pioneers is Informatica which came into being when Data Warehouse became a hot topic during the 1990s. The term ETL (extraction, transformation, loading) became part of the warehouse lexicon. If we call this the first generation of the data integration tools, then they did an adequate job for its time. Often the T of the ETL was the hardest job as it required business domain knowledge. Data were assembled from fewer source (usually less than 20) into the warehouse for offline analysis and reporting. The cost of data curation (mostly, data cleaning) required to get heterogeneous data into proper format for querying and analysis was high. During my years at Oracle in the mid-1990s, such tools were provided by third party companies. Often, many warehouse projects were substantially over-budget and late.

Then a second generation of ETL systems arrived where major ETL products were extended with data cleaning modules, additional adaptors to ingest other kinds of data, and data cleaning tools. Data curation involved: ingesting data sources, cleaning errors, transforming attributes into other ones, schema integration to connect disparate data sources, and performing entity consolidation to remove duplicates. But you need a professional programmer to handle all these. With the arrival of the Internet, many new sources of data also arrived and the diversity increased manyfold and the integration task became much tougher.

Now there is talk of a third generation of tools termed “scalable data curation” which can scale to hundreds or even thousands of data sources. Experts mention that such tools can use statistics and machine learning to make automatic decision wherever possible. Such tools need human interaction only when needed.

Start-ups such as Trifacta and Paxata emerged, applying such techniques to data preparation, an approach subsequently embraced by incumbents Informatica, IBM, and Solix. A new startup called TamR (cofounded by Mike Stonebraker of Ingres, Vertica, and VoltDB fame) which got funded last year by Google Ventures and NEA ($16M funding), claims to create a true “curation at scale”. It has adopted a similar approach but applied it to a different upstream problem – curating data from multiple sources.  IBM has publicly stated its direction to develop a “Big Match” capability for Big Data that would complement its MDM (master data management) tools. More are expected to enter into this effort.

In summary, ETL systems arose to deal with the transformation challenges in early data warehouses. They evolved into second generation data curation systems with an expanded scope of offerings. Now a new generation of data curation systems is emerging to address the Big Data world where sources have multiplied with more heterogeneity of data sources. On the surface, this seems quite opposite to the concept of “data lake” where native formats are stored. However, the so-called “data refinery” is no different than the curation process.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s