Data Management Renaissance

I was listening to Tim Bray of Sun giving his keynote at QCon this week in San Francisco. Tim is the author of XML and a real technology visionary, specially in the area of storage and data management. With the surge in web applications with very high scalability needs (millions of users), a fundamental rethinking of data management is happening. Here are the main points:

– between the application code and the disk/cache, these are the traditional layers – Object/Relational mapping, SQL engine, and OS & File System. Between layers what gets transported are these – objects from Application code to O/R mapping, then SQL to the SQL engine, then normalized tuples to OS & File systems, and finally bytes to the cache/disk.

– These layers add latency and overhead and become huge bottlenecks for web applications catering to millions of users. Hence, some application code is directly manipulating large hash tables for key/value pairs. They claim this approach makes it 100 times faster than RDBMS.

– Application code can directly map to RDBMS, bypassing the O/R layer (which is hard). PHP does that, but its hideous and ugly.

– How about application code going directly to the OS and file system, bypassing the entire database layer? They use XML, JSON, plain text and media files.

– There is a project at Sun called Drizzle that is doing radical minimization of MySQL, calling it a lightweight SQL DB for the cloud and web. It aims to throw away large chunks of code from MySQL.

– Experiements are going on with “column-oriented” database. They have huge number of columns and SQL-like queries for speed.

– Google Application Engine goes directly to storage via Google Big Table (persistent layer, but not SQL). This is like column-oriented table.

– Document-oriented databases are appearing. CouchDB and Amazon’s SimpleDB are examples.

– Other interesting approaches include use of AtomPub, where application code directly interacts with the web space via HTTP. It’s nimble and can get faster performance. Google uses that in its online stuff. Rumor has it that Microsoft’s recently announced cloud OS Azure uses that also.

– REST is a good architecture style as evidenced by Amazon’s successful deployment.

Bottom line – Web application designs are not CPU-limited. They are persistence-limited and I/O-limited. Therefore, new innovations are coming up to enable such applications handle data efficiently.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s