When I was part of the DB2 development team at IBM some twenty odd years ago, our design point for the largest table size was 64GB. Well, back in the early 1980’s that was considered a huge size for a single database. Remember, these were the pre-data warehousing years.
Then came the era of separating warehousing data from operational data for performance interference reasons. During the 1990s, Data Warehouse became a hot area, what with companies like Red Brick, Teradata, bringing specialization into that space. Traditional DBMS players like IBM, Oracle, Informix, Sybase, as well as Microsoft all jumped into the game. A new space called ETL (extraction, transformation, loading) brought a sleuth of new companies such as Informatica. Business Intelligence tools companies also flourished, such as Business Objects, Hyperion, Cognos, and many smaller companies like Microstrategy. Datamarts were talked about to address departmental warehouse, to minimize risk of building a “pie-in-the-sky” data warehouse for the entire corporation. The Telco (Call Records) and retail sectors became avid users of Data Warehousing solution.
Walmart became the poster child for exploiting the largest data warehouse on this planet. Stories of their store replenishment efficiency based on warehouse data on the day’s transactions, came out in many publications. Most of these solutions demanded a very high cost.
In today’s Internet companies like Yahoo, eBay, Amazon, and Google, the scale of data volume is quite stunning. Previously Terabyte and Gigabyte sizes were talked about. Now we have gone up to Petabytes and Exabytes. The new “unstructured” data such as email, audio, video, photos, and text fields keep accumulating and no one wants to delete anything. Given the exponential growth in number of users in sites like Myspace, the data explosion is happening beyond our imaginations.
All these new-age Internet companies will face a serious problem of scaling, managing, searching, archiving, huge volumes of data. Hence traditional database solutions like Teradata or Oracle will not address the issues of latency and ultra-high scalability (despite their marketing cliams and FUD tactics. These products were designed in another era (pre-internet) and addresses the Fortune 500 traditional customers.
New start-ups are coming up with the notion of a data warehouse appliance to address such cases of multi-petabyte sizes. Netezza is a case in point. This young company out of Boston has done a good job in addressing better price-performance for data warehousing problems. Better still, is a start-up called GreenPlum (San Mateo) which goes many steps further than Netezza in terabyte size processing in 60 seconds flat, at a very low cost, using 64-bit architecture machines with dual core. The latency is minimal as IO channels are very fast.
The future is bright for new solutions addressing ultra-high database size and fast processing using clever hardware clustering and parallelism as the main thrust, somewhat beyond today’s shared-nothing architectures.