Category Archives: Big Data

Yahoo – the CEO drama continues

Four CEOs in five years! Yahoo was a symbol of innovation and success in its first few years of life. Founded by two Stanford Ph.D. students (Yang and Filo), Yahoo defined the Internet era of communities and sharing. It still has an enviable community using various services like email, finance, news, etc. It has lost much advertising dollars to Google. Terry Semel came from Hollywood and wanted to make it a media company. That did not work. Terry flew in every week on a private jet from LA to San Francisco and was driven in a limo to work every day. His compensation was way higher than many other CEOs at similar valley companies. Jerry Yang returned as CEO for the second time and botched up a lucrative offer from Microsoft. Yang was no Steve Jobs on his second return to the company he founded. He turned down the Microsoft offer to buy Yahoo at $47 per share (current share is $15.50). There was indeed an “identity” crisis at Yahoo.

Then came Carol Bartz and she talked tough and tried to straighten out the confusion, but results did not show any positive impact. She was let go last year, after being fired over phone from the board chairman. Then the board picked Scott Thompson, a well-reputed executive from Paypal (eBay) to head the company just a few months ago. He started reducing redundancy and bring clarity to Yahoo’s core business. He let go 2000 employees recently. Several top skills left the company. As he was settling in, came the news that his resume had information on his degree that is not right. Most likely that error existed for a while, but one disgruntled investor questioned his integrity and the board on not doing due diligence before hiring him as CEO. Yahoo and Scott did a poor job responding to this and the result was his departure yesterday.

With all the business problems at Yahoo and its anaemic growth, a strong leader is needed to refocus the company on what it is best at – innovating new solutions for keeping the community loyal. After all, Yahoo gave many technologies such as Hadoop and HDFS in managing Big data. Without a strong execution-oriented CEO,  it will fade away like many dot-com era companies.

It is hard to believe that Yahoo was once valued at $100B (current valuation $18.9B).

Closer look at one NoSQL database – MongoDB

Among the new crop of NoSQL database products, MongoDB ranks quite high, in my opinion. The company that produces MongoDB is 10Gen, a venture backed new start-up since 2008. But its rapid growth over last 4 years bears testimony to its technical strength.

MongoDB’s name comes from the middle five letters of the word “humongous”, meaning big data. It is an open-source, document-oriented storage which is schema-free and can entertain dynamic queries with full indexing. The programming model is BSON – binary encoding of JSON (Javascript Object Notation), a lightweight text-based open standard designed for data interchange. Douglas Crawford of Yahoo invented JSON in 2006.

The other key tenet of MongoDB is its scalability architecture – it can scale out horizontally using its automatic “sharding” (or keyrange partitioning). It does provide master-slave or peer-to-peer replication for high availability, recovery, and performance. One of its customers Disney’s Interactive Media Group, for example, has 1400 instances of Mongo. It uses sharding for write performance and replication for read performance.

MongoDB can be deployed from the cloud via Amazon’s AWS. Their revenue model is via support services, training, and consulting. Partners include VMWare, Amazon, Redhat, etc. – all cloud platform providers offering MongoDB as an option to their clients. Although the database suits document storage the best, it can handle other unstructured data like video, and images. But initial thrust seems to be those customers looking for high scalability using commodity hardware and superior performance.

MongoDB claims over 400 customers, including many internet companies like FourSquare, Craigslist, etc. Several textbooks have been published on MongoDB and the development community is growing fast. It certainly bridges the gap between traditional RDBMS (Oracle, MySQL, SQL Server, DB2) at one end and Key-Value pair search engines (Riak, Cassandra, Voldemart,..) at the other end.

The Fourth Paradigm in Science

We all remember the late Jim Gray, the great computer scientist and Turing award winner. During the last several years of his research work at Microsoft, he focused on data-intensive computing and called it the Fourth Paradigm in scientific discovery. In a special book dedicated to the memory of Jim, Bill Gates commented, “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.”

So what is the Fourth Paradigm? Here is the explanation.

1. Thousand years ago – Experimental Science
– Description of natural phenomena
2. Last few hundred years – Theoretical Science
– Newton’s Laws, Maxwell’s Equations…
3. Last few decades – Computational Science
– Simulation of complex phenomena
4. Today – Data-Intensive Science (unify theory, experiment, & simulation)

Scientists are overwhelmed with data sets from many different sources such as data captured by instruments, data generated by simulations, and data generated by sensor networks.

Jim Gray named it “eScience’ where IT (Information Technology) meets Science. It is the set of tools and technologies to support data federation and collaboration for analysis, data mining, data visualization and exploration, and for scholarly communication and dissemination. He laid out the principles, fondly called Gray’s law of data engineering:

  • —Scientific computing is revolving around data
  • —Need scale-out solution for analysis
  • —Take the analysis to the data!
  • —Start with “20 queries”
  • —Go from “working to working”

Interestingly, all these apply to the commercial world of Big Data. Only the scientific world has been grappling with these problems longer. Given the proliferation of devices and incoming data in petabytes, the need for tools to do analytics is of the highest priority. No wonder, 2012′s biggest buzzword is Big Data.

We miss you Jim and your pioneering thoughts on DISC (Data Intensive Scalable Computing)!

Revisiting “Big Data”

Big Data is a top technology trend for 2012 according to Forrester Research. The Economist said that Big Data is a new game changing asset and The Harvard Business Review termed it as a scientific revolution. Scientific Revolution? Because it is data-intensive computing to unify, theorize, experiment, and do simulation at scale.

It is also termed the Fourth Paradigm – “The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.”

Big Data is when the size of the data itself becomes part of the problem. But Big Data is not just “big”. There are the 3V’s of Big Data:

  1. Volume – Terabyte records, transactions, tables, files. A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture.
  2. Velocity – batch, near-time, real-time, streams. Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer scoring probabilities. Stream data, such as movies, need to  travel at high speed for proper rendering.
  3. Variety – structures, unstructured, semi-structured, and all the above in a mix. WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (petabytes). There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc.

Because of these characteristics, traditional DBMS solutions are inadequate. Hence we have seen the growth of technologies such as Hadoop (map-reduce algorithm started at Google) mostly processing unstructured data in batch mode. New solutions are needed for realtime processing.

See my blog from last year on this subject.

Data Management, circa 2011

The world of Data Management has never been this vibrant as now. Only five years back, if you were to start a new database product company, the VC’s would have thought you to be real crazy. Why start something in an established market with 3 leaders – Oracle, IBM (DB2), and Microsoft (SQL Server)? Then we started to notice “specialized” appliance products such as Netezza (now IBM) and Greenplum (now EMC) crop up,  to focus on large scale data analytics. This trend was soon followed by Oracle (Exadata) and now HP (Vertica).

But what I am talking about is a list of new companies backed by well-known VC’s addressing the Data Management problems of the Internet era. We can roughly divide the data world into two – operational data management and analytic data management

Within the operational data camp, there are three categories:

  1. Traditional RDBMS (read Oracle, DB2, SQL Server, Sybase, Ingres, MySQL,etc.) and NewSQL products addressing mostly MySQL scalability and performance issues (e.g. Clustrix, Drizzle, VoltDB, NimbleDB, MySQL Cluster,..). I advise two companies in this category, ScaleDB and ScalArc.
  2. Traditional non-relational DBMS (Objectivity, Progress, Versant, etc.) and NoSQL which has seen a lot of new activities. The NoSQL data management products deal with key-value store, or the big table, or a document data, or a graph data. Examples of products include CouchBase, MongoDB, Riak, VoldeMart, BerkeleyDB, Hypertable, HBase, Cassandra, GraphDB, etc. They address very large number of simple structures and use parallel computing for performance. Google invented Map-Reduce algorithm that has become the Hadoop open source with HDFS as its file base.
  3. Distributed Data Grid and Cache technologies. Here Memcached came as an open source caching framework for MySQL and PHP applications. Other solutions include Terracotta, GigaSpaces, Oracle Coherence, etc.  SAP is also trying in-memory solution called Hana.

The Analytic Data Management space has two categories

  1. Non-relational (like Hadoop, Mapr, Piccolo,Dryad, ..)
  2. Relational products like Infobright, Netezza, ParAccel, SAP Sybase IQ, Teradata, EMC Greenplum, HP Vertica, IBM Infosphere, etc. The phrase Big Data is applied here, typically exceeding a petabyte. Social networking sites like Facebook and Tweeter are dealing with this.

I have seen the acronym SPRAIN (Scalability, Performance, Relaxed Consistency, Agility, Intricacy, and Necessity) to explain why the incumbents are inadequate to address the new challenges of unstructured data as well as Big Data.

These are exciting times for Data Management research and development.

Large Scale Hadoop Data Migration at Facebook

In 2010 Facebook had the largest Hadoop cluster in the world, with over 20PB of storage. Yes, that is 20 Petabytes or 20,000 Terabytes or  20 to the power of 15 bytes!  By March 2011, the cluster had grown to 30 PB – that’s 3000 times the Library of Congress. As they ran out of power and space to add more nodes, a migration to a larger data center was needed. In a Facebook post Paul Yang describes how they accomplished this monumental task.

The Facebook infrastructure team chose a replication approach for this migration. Two steps were followed. First, a bulk copy transferred most of the data from the source cluster to the destination. Second all changes since the start of the bulk copy were captured via a cluster Hive plug-in that recorded the changes in an audit log. The replication system continuously polled the audit log and applied modified files to the destination. The plug-in also recorded Hive metadata changes. The Facebook Hive team developed both the plug-in and the replication system.

Remember, the challenge here is not to interrupt the 750 million  Facebook users access to the data 24/7, while the migration was happening. It is like changing the engine of an airplane while flying 600 miles an hour at 36000 feet above ground.  Quite a remarkable effort.

The big lesson learned was to develop a fast replication system that proved invaluable in addressing the issues. For example, corrupt files could easily be remedied with additional copy without affecting the schedule. The replication system became a potential disaster-recovery solution for warehouses using Hive. The Hadoop HDFS-based warehouses lack built-in data-recovery functionality usually found in traditional RDBMS systems.

Facebook’s next challenge will be to support a data warehouse distributed across multiple data centers.