Yesterday I attended a session in Palo Alto on the subject of Data Refinery and the speaker was Will Gorman of Pentaho. I did not realize that Pentaho was acquired by Hitachi Data Systems couple of months ago. The terms “data lake” was coined by James Dixon of Pentaho. I wrote a blog on this subject last year. As soon as the term started to appear in the data lexicon, other interesting terms such as “data swamp” appeared.
The term data lake has been coined to convey the concept of a centralized repository containing virtually inexhaustible amounts of raw (or minimally curated) data that is readily made available anytime to anyone authorized to perform analytical activities. The often unstated premise of a data lake is that it relieves users from dealing with data acquisition and maintenance issues, and guarantees fast access to local, accurate and updated data without incurring development costs (in terms of time and money) typically associated with structured data warehouses. According to IBM, “However appealing this premise, practically speaking, it is our experience, and that of our customers, that “raw” data is logistically difficult to obtain, quite challenging to interpret and describe, and tedious to maintain. Furthermore, these challenges multiply as the number of sources grows, thus increasing the need to thoroughly describe and curate the data in order to make it consumable”. I completely agree.
During the early days of Data Warehousing, the terms ETL dealt with all the data preparation stages – extract, transform, and load the curated data for query and reporting. I used to call this jokingly, “answer to 25 years of sin”. In my understanding, Pentaho’s SDR (Streamlined Data Refinery) is a modern form of ETL that deals with both internal structured data and external unstructured data including machine-generated data. In Pentaho’s own words, “The big data stakes are higher than ever before. No longer just about quantifying ‘virtual’ assets like sentiment and preference, analytics are starting to inform how we manage physical assets like inventory, machines and energy. This means companies must turn their focus to the traditional ETL processes that result in safe, clean and trustworthy data. However, for the types of ROI use cases we’re talking about today, this traditional IT process needs to be made fast, easy, highly scalable, cloud-friendly and accessible to business. And this has been a stumbling block – until now. Streamlined Data Refinery, a market-disrupting innovation that effectively brings the power of governed data delivery to “the people” unlocks big data’s full operational potential”.
Earlier I wrote about Data Curation and how new companies such as Tamr are addressing the issue. Pentaho’s SDR is another form of data curation. IBM calls it Data Wrangling process.
As usual, we love to confuse with variety of terms describing the same!