Category Archives: Database

Closer look at one NoSQL database – MongoDB

Among the new crop of NoSQL database products, MongoDB ranks quite high, in my opinion. The company that produces MongoDB is 10Gen, a venture backed new start-up since 2008. But its rapid growth over last 4 years bears testimony to its technical strength.

MongoDB’s name comes from the middle five letters of the word “humongous”, meaning big data. It is an open-source, document-oriented storage which is schema-free and can entertain dynamic queries with full indexing. The programming model is BSON – binary encoding of JSON (Javascript Object Notation), a lightweight text-based open standard designed for data interchange. Douglas Crawford of Yahoo invented JSON in 2006.

The other key tenet of MongoDB is its scalability architecture – it can scale out horizontally using its automatic “sharding” (or keyrange partitioning). It does provide master-slave or peer-to-peer replication for high availability, recovery, and performance. One of its customers Disney’s Interactive Media Group, for example, has 1400 instances of Mongo. It uses sharding for write performance and replication for read performance.

MongoDB can be deployed from the cloud via Amazon’s AWS. Their revenue model is via support services, training, and consulting. Partners include VMWare, Amazon, Redhat, etc. – all cloud platform providers offering MongoDB as an option to their clients. Although the database suits document storage the best, it can handle other unstructured data like video, and images. But initial thrust seems to be those customers looking for high scalability using commodity hardware and superior performance.

MongoDB claims over 400 customers, including many internet companies like FourSquare, Craigslist, etc. Several textbooks have been published on MongoDB and the development community is growing fast. It certainly bridges the gap between traditional RDBMS (Oracle, MySQL, SQL Server, DB2) at one end and Key-Value pair search engines (Riak, Cassandra, Voldemart,..) at the other end.

The Fourth Paradigm in Science

We all remember the late Jim Gray, the great computer scientist and Turing award winner. During the last several years of his research work at Microsoft, he focused on data-intensive computing and called it the Fourth Paradigm in scientific discovery. In a special book dedicated to the memory of Jim, Bill Gates commented, “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.”

So what is the Fourth Paradigm? Here is the explanation.

1. Thousand years ago – Experimental Science
– Description of natural phenomena
2. Last few hundred years – Theoretical Science
– Newton’s Laws, Maxwell’s Equations…
3. Last few decades – Computational Science
– Simulation of complex phenomena
4. Today – Data-Intensive Science (unify theory, experiment, & simulation)

Scientists are overwhelmed with data sets from many different sources such as data captured by instruments, data generated by simulations, and data generated by sensor networks.

Jim Gray named it “eScience’ where IT (Information Technology) meets Science. It is the set of tools and technologies to support data federation and collaboration for analysis, data mining, data visualization and exploration, and for scholarly communication and dissemination. He laid out the principles, fondly called Gray’s law of data engineering:

  • —Scientific computing is revolving around data
  • —Need scale-out solution for analysis
  • —Take the analysis to the data!
  • —Start with “20 queries”
  • —Go from “working to working”

Interestingly, all these apply to the commercial world of Big Data. Only the scientific world has been grappling with these problems longer. Given the proliferation of devices and incoming data in petabytes, the need for tools to do analytics is of the highest priority. No wonder, 2012′s biggest buzzword is Big Data.

We miss you Jim and your pioneering thoughts on DISC (Data Intensive Scalable Computing)!

Revisiting “Big Data”

Big Data is a top technology trend for 2012 according to Forrester Research. The Economist said that Big Data is a new game changing asset and The Harvard Business Review termed it as a scientific revolution. Scientific Revolution? Because it is data-intensive computing to unify, theorize, experiment, and do simulation at scale.

It is also termed the Fourth Paradigm – “The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.”

Big Data is when the size of the data itself becomes part of the problem. But Big Data is not just “big”. There are the 3V’s of Big Data:

  1. Volume – Terabyte records, transactions, tables, files. A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture.
  2. Velocity – batch, near-time, real-time, streams. Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer scoring probabilities. Stream data, such as movies, need to  travel at high speed for proper rendering.
  3. Variety – structures, unstructured, semi-structured, and all the above in a mix. WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (petabytes). There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc.

Because of these characteristics, traditional DBMS solutions are inadequate. Hence we have seen the growth of technologies such as Hadoop (map-reduce algorithm started at Google) mostly processing unstructured data in batch mode. New solutions are needed for realtime processing.

See my blog from last year on this subject.

Nasscom 2012 – Down the memory lane

I just saw the Nasscom’s (National Association of Software and Services Companies) annual conference program called India Leadership Forum (ILF) 2012. It is an impressive agenda of great topics and speakers that range from the actors (Abhishek Bachhan and Shekhar Kapoor), to political leaders (Kapil Sibal, P. Chidamabram), industry stalwarts (K.M. Birla, Rajendra Pawar, ..) to technical leaders from the software world as well as services world.

The website talks about 20 years of leadership which brings me to this recollection. It was 1992, exactly twenty years back, that I was invited to speak at the Nasscom event in Delhi. I had just left IBM after 16 years of work in developing relational technology products like DB2. I joined Oracle that year in April and the Nasscom event was held in December. The late Dewang Mehta, the founder and first president of Nasscom organized the event, one of the first large forums for those days. Remember, this was one year after the de-regulation of the Indian economy and the beginning of an Indian software outsourcing presence in the global scene. Dewang became a close friend and kept expanding Nasscom until his untimely death few years later.

Mr. N. Vittal was the secretary, DOT who was the inaugural speaker and I had a one-on-one meeting with him the day before. The keynote speakers included known names like Narayan Murthy of Infosys and Fakir Chand Kohli of TCS. When I spoke of the emerging trends in software, the interest was tremendous as I was mobbed by so many people after the talk for the rest of the conference. This was pre-Internet and the focus was all about enterprise computing and how to leverage the client-server computing architecture for cost-performance.

I also spoke at Nasscom back in 2003 in Mumbai. The audience was bigger and the topics were wider. I mentioned of the need for India 2.0 where innovation is to be emphasized besides services. Looking at this agenda, I do not see many new companies with innovative technologies coming out of India. The best ones are copycats such as Flipkart (Amazon like book selling via the web to the Indian market), or several travel sites. Here in Silicon valley, innovation is happening with a vengeance. Many exciting new breakthroughs are reshaping our industry. India seems to live on the “service” focus and slowly moving to product building and innovation.

Data Management, circa 2011

The world of Data Management has never been this vibrant as now. Only five years back, if you were to start a new database product company, the VC’s would have thought you to be real crazy. Why start something in an established market with 3 leaders – Oracle, IBM (DB2), and Microsoft (SQL Server)? Then we started to notice “specialized” appliance products such as Netezza (now IBM) and Greenplum (now EMC) crop up,  to focus on large scale data analytics. This trend was soon followed by Oracle (Exadata) and now HP (Vertica).

But what I am talking about is a list of new companies backed by well-known VC’s addressing the Data Management problems of the Internet era. We can roughly divide the data world into two – operational data management and analytic data management

Within the operational data camp, there are three categories:

  1. Traditional RDBMS (read Oracle, DB2, SQL Server, Sybase, Ingres, MySQL,etc.) and NewSQL products addressing mostly MySQL scalability and performance issues (e.g. Clustrix, Drizzle, VoltDB, NimbleDB, MySQL Cluster,..). I advise two companies in this category, ScaleDB and ScalArc.
  2. Traditional non-relational DBMS (Objectivity, Progress, Versant, etc.) and NoSQL which has seen a lot of new activities. The NoSQL data management products deal with key-value store, or the big table, or a document data, or a graph data. Examples of products include CouchBase, MongoDB, Riak, VoldeMart, BerkeleyDB, Hypertable, HBase, Cassandra, GraphDB, etc. They address very large number of simple structures and use parallel computing for performance. Google invented Map-Reduce algorithm that has become the Hadoop open source with HDFS as its file base.
  3. Distributed Data Grid and Cache technologies. Here Memcached came as an open source caching framework for MySQL and PHP applications. Other solutions include Terracotta, GigaSpaces, Oracle Coherence, etc.  SAP is also trying in-memory solution called Hana.

The Analytic Data Management space has two categories

  1. Non-relational (like Hadoop, Mapr, Piccolo,Dryad, ..)
  2. Relational products like Infobright, Netezza, ParAccel, SAP Sybase IQ, Teradata, EMC Greenplum, HP Vertica, IBM Infosphere, etc. The phrase Big Data is applied here, typically exceeding a petabyte. Social networking sites like Facebook and Tweeter are dealing with this.

I have seen the acronym SPRAIN (Scalability, Performance, Relaxed Consistency, Agility, Intricacy, and Necessity) to explain why the incumbents are inadequate to address the new challenges of unstructured data as well as Big Data.

These are exciting times for Data Management research and development.

Large Scale Hadoop Data Migration at Facebook

In 2010 Facebook had the largest Hadoop cluster in the world, with over 20PB of storage. Yes, that is 20 Petabytes or 20,000 Terabytes or  20 to the power of 15 bytes!  By March 2011, the cluster had grown to 30 PB – that’s 3000 times the Library of Congress. As they ran out of power and space to add more nodes, a migration to a larger data center was needed. In a Facebook post Paul Yang describes how they accomplished this monumental task.

The Facebook infrastructure team chose a replication approach for this migration. Two steps were followed. First, a bulk copy transferred most of the data from the source cluster to the destination. Second all changes since the start of the bulk copy were captured via a cluster Hive plug-in that recorded the changes in an audit log. The replication system continuously polled the audit log and applied modified files to the destination. The plug-in also recorded Hive metadata changes. The Facebook Hive team developed both the plug-in and the replication system.

Remember, the challenge here is not to interrupt the 750 million  Facebook users access to the data 24/7, while the migration was happening. It is like changing the engine of an airplane while flying 600 miles an hour at 36000 feet above ground.  Quite a remarkable effort.

The big lesson learned was to develop a fast replication system that proved invaluable in addressing the issues. For example, corrupt files could easily be remedied with additional copy without affecting the schedule. The replication system became a potential disaster-recovery solution for warehouses using Hive. The Hadoop HDFS-based warehouses lack built-in data-recovery functionality usually found in traditional RDBMS systems.

Facebook’s next challenge will be to support a data warehouse distributed across multiple data centers.

Netflix’s management of Data

Netflix is familiar to all of us. It is trying to switch to mostly video streaming, away from physical DVD business and the recent cost increase is a reflection of that strategy. What we don’t realize is the back-end data management and its complexity and challenges.

Let us look at the load factor. Approximately 15% of the US households, roughly  20 million plus are its paying subscribers yielding about $2B revenue in 2010.  Each subscriber and household members see 100 movies a month. So 20m times 100 is 2 billion instances. Therefore, its database must manage billions of records.

What is the nature of the data? It’s all video-centric such as critics and users review, video metadata like directors, actors, title, year made, etc.  Then it has to track users video queue, watch history, video rating, video playback metadata, etc. They also have to collect all the client’s “streaming device” information such as XBox, Blue-ray video player,etc. Critical customer information such as name, address, rental-package, credit card info are part of the highly secure data.

Netflix started with a traditional Oracle RDBMS handling its data during its early low-subscriber days. Given the load factor increase, it started to focus on a cloud strategy (to reduce its capex) and picked Amazon’s AWS for capacity planning and scale-out. It has been working for last couple of years to this cloud migration. The PII (personal Identifiable Information) and PCI data which demands greater privacy and security is kept at its own data center under Oracle. The rest of its data (in Terabytes) goes to the cloud (e.g. movie recommendation, movie metadata,..). The cloud data-store is SimpleDB from Amazon, although they have been looking at Cassandra and DataStax. So billions of records are moved to the cloud.

The nature of their data yields naturally to a key-value data store such as SimpleDB for extreme scale and performance. They did have several challenges to translate RDBMS concepts to a KV store. For example, “null” (value unknown at this time) concept of a RDBMS is not handled very well in SimpleDB. The system just does not return those records containing nulls.  Nor does it provide backup/recovery. There are no native data types. Again, Netflix can forfeit consistency (to eventually consistent) in favor of high Availability and Partitioning or distributability (AP out of the CAP). The video data plus streaming device activity log are all kept in Amazon S3 storage at a very low cost.

Netflix is a great example of a hybrid deployment of traditional data center and cloud computing. Technical challenges are there, but they have learnt the art of “pick and adjust the most optimal solution” for high scalability and performance.

Big Data

The phrase “Big Data” is thrown around a lot these days. What exactly is referred to by this phrase? When I was part of IBM’s DB2 development team, the largest size limit of a DB2 Table was 64 Gigabytes (GB) and I thought who on earth can use this size of a database. Thirty years later, that number looks so small. Now you can buy a 1 Terabyte external drive for less than $100.

Let us start with a level set on the unit of storage. In multiples of 1000, we go from Byte – Kilobyte (KB) – Megabyte (MB) – Gigabyte (GB) – Terabyte (TB) – Petabyte (PB) – Exabyte (EB) – Zettabyte (ZB) – Yottabyte (YB). The last one YB is 10 to the power of 24. A typed page is 2KB. The entire book collection at the US Library of Congress is 15TB. The amount of data processed in one hour at Google is 1PB. The total amount of information in existence is around 1.27ZB. Now you get some context to these numbers.

When we say Big Data, we enter the petabyte space (1000 Terabytes). There is talk of “personal petabyte” to store all your audio, video, and pictures. The cost has come down from $2M in 2002 to $2K in 2012 – real Moore’s law in disk storage technology here. This is not the stuff for current commercial database products such as DB2 or Oracle or SQLServer. Such RDBMS’s handle maximum of 10 to 100 Terabyte sizes. Anything bigger would cause serious performance nightmares. These large databases are mostly in the decision support and data warehousing applications. Walmart is known to have its main retail transaction data warehouse at 100 plus terabytes in a Teradata DBMS system.

Most of the growth in data is in “files”, not in DBMS. Now we see huge volumes of data in social networking sites like Facebook. At the beginning of 2010, Facebook was handling more than 4TB per day (compressed). Now that it has gone to 750M users, that number is at least 50% more. The new Zuck’s (Zuckerberg) law is , “Shared contents double every 24 months”. The question is how to deal with such volumes.

Google pioneered the algorithm called MapReduce to process massive amounts of data via parallel processing through hundreds of thousands of commodity servers. A simple Google query you type, probably touches 700 to 1000 servers to yield that half-second response time. MapReduce was made an open source under the Apache umbrella and was released as Hadoop (by Doug Cutting, former Xerox Parc, Apple, now at Cloudera). Hadoop has a file store called HDFS besides the MapReduce computational process. Hadoop therefore is a “flexible and available architecture for large scale computation and data processing on a network of commodity servers”. What is Redhat to Linux is Cloudera (new VC funded company) to Hadoop.

While Hadoop is becoming a defacto standard for big data, it’s pedigree is batch. For near-real-time analytics, better answers are needed. Yahoo, for example, has a real time analytics project called S4. Several other innovations are happening in this area of realtime or near realtime analytics. Visualization is another hot area for big data.

Big Data offers many opportunities for innovation in next few years.

Importance of ILM & Data Archiving

With the noise of cloud computing rising by the day, there are basic operational issues one should not forget – cloud or no cloud. One such issue is the discipline of ILM (Information Life-cycle Management). How do you manage data over its lifetime of many years and decades? Do you keep all data current which drastically impacts the performance of applications using them? As everyone knows the appetite for data is growing by leaps and bounds. Not far from now, “personal petabyte” is quite viable given the need to store audio and video stuff. A petabyte is one thousand terabytes which is 1000 gigabytes which is 1000 megabytes. Now do the math. A petabyte is ten to the power of 15 bytes. And 1000 petabytes is one “exabyte”. Back in 2002, one petabyte would have cost  $2M, whereas in 2012 (ten years) its cost will be $2K. This is real Moore’s law in disk storage!

Most of enterprise business data is resident as structured data managed by DBMS (e.g. Oracle or DB2). There are production databases of the size of 100 plus terabytes , mostly in places such as Walmart’s data warehouse for retail transactions. Telcos also have huge databases for call records. With the growth in size, performance degradation is normal. Hence enterprises must create a multi-tiered archiving policy. For example, current data can be in active databases for 2-3 years, followed by 2-4 years of inactive data followed by several years of historical data. As we move further behind, such data can be part of cloud storage. But access is paramount even if data is stored in multiple levels. For compliance and legal reasons, historical data should be easily accessible at high speed with smart search.

Another aspect of ILM is management of copies of data. Some companies may need 8-20 copies of active data for test, development, disaster recovery, quality control, etc.  A 200 GB database may end up like 1200 GBs of data with six copies. Such issues are normally not reflected as part of planning, but IT shops get shocked when they see such numbers and the associated cost factors. Anther area at many enterprises is the “application retirement” issue. This happens with M&A or as a precursor to move into the cloud. This area is addressed in a very adhoc way resulting in unforeseen delays and cost. Any automation here should be highly welcome.

Gartner Group said this last year, “The return on the investment for implementing a structured data archiving solution is exceptionally high, especially for application retirement or when deployed for a packaged application for which vendor-supplied templates are available to ease implementation and maintenance.”

One company (I am an adviser) leading in this space is Solix that provides all the tools mentioned above. Their Enterprise Data Management System (EDMS) platform provides a comprehensive set of ILM tools for  enterprises. Solix even introduced an appliance to ease the cost and administrative burdens for clients. The rapid adoption of Solix products is a testimony to the growing importance of data archiving, application retirement, data masking, and test data management.

ILM should be a well-thought-out discipline at every IT organization.

HP’s software moves

HP, the giant technology company with revenues of $130B is moving aggressively into software based on its recent actions. First it acquired Vertica, a column-based database company. Then the new CEO announced the wider adoption of WebOS in all its PC’s, Tablets, and mobile devices. He also outlined a cloud strategy for HP.

Let us look at a bit of history. HP has not really focused on a comprehensive software strategy. It’s product OpenView is a good one in the systems management arena, competing with IBM’s Tivoli and CA’s Unicenter products. Then it acquired OpsWare and Mercury Interactive, both in data center infrastructure management space. It has no products in the DBMS, or middleware space nor in the EDW (Enterprise Data Warehouse) or BI space. It’s brave attempt to create a Teradata-killer called Neoview (based on the ancient Tandem Non-stop SQL technology) did not go anywhere. The acquisition of Vertica pretty well put an end to Neoview. In the mean time, the “stack” war has started in full swing – IBM, Oracle, Microsoft being the key players. SAP is also trying to be a wider player with its acquisition of Sybase and Business Objects.

Add to this, the emergence of the Data Warehouse appliance for specialized data-intensive workloads for analytics. IBM acquired Netezza and EMC acquired Greenplum. Oracle is combining the Oracle DBMS with Sun Server to come up with the Exadata appliance. In that context, HP’s acquisition of Vertica to become a part of an appliance is quite predictable.

HP’s partnership with Oracle is coming to an end, now that Oracle announced an end to supporting Intel’s Itanium chip (used in HP Servers). Now HP has two strong partners – Microsoft and SAP. The Microsoft partnership is at threat now with the WebOS announcement. The SAP partnership is also impacted by the Vertica solution, which competes with Sybase IQ product. But it was HP few years ago who coined the phrase “co-opetition” (competing and co-operating at the same time).

HP does not have a clear road map yet in software. What appears so far is a fragmented approach of bits and pieces. But I am sure there is a grand plan being worked out and let us hope this time around, HP will come up with a comprehensive set of software products to fight IBM and Oracle.