What does the data landscape look like at Facebook with its 1.3 billion users across the globe? They classify small data referring to OLTP-like queries that process and retrieve a small amount of data, usually 1-1000 objects requested by their id. Indexes limit the amount of data accessed during a single query, regardless of the total volume. Big data refers to queries that process large amounts of data, usually for analysis: trouble-shooting, identifying trends, and making decisions. The total volume of data is massive for both small and big data, ranging from hundreds of terabytes to hundreds of petabytes on disk.
The heart of Facebook’s core data is TAO (The Association of Objects) – distributed data store for the social graph. The workload on this is extremely demanding. Every time any one of over a billion active users visits Facebook through a desktop browser or on a mobile device, they are presented with hundreds of pieces of information from the social graph. Users see News Feed stories; comments, likes, and shares for those stories; photos and check-ins from their friends — the list goes on. The high degree of output customization, combined with a high update rate of a typical user’s News Feed, makes it impossible to generate the views presented to users ahead of time. Thus, the data set must be retrieved and rendered on the fly in a few hundred milliseconds. TAO provides access to tens of petabytes of data, but answers most queries by checking a single page in a single machine. The challenge here is how to optimize for mobile devices which have intermittent connectivity and higher network latencies than most web clients.
Big data stores are the workhorses for data analysis and they grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores are: 1) ODS (Operational Data Store) has 2 billion time series of counters (used mostly in alerts and dashboards) and processes 40000 queries per second. 2) Scuba is the fast slice-and-dice data store with 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second scanning 100 billion rows per second. 3) Hive is the data warehouse with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabytes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.
What is interesting to note here is that ODS, Scuba, and Hive are not traditional relational databases. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage and retrieval. Instead, challenge arises from high data insertion rates and massive data quantities.
Facebook represents the new generation Big Data user demanding innovative solutions not seen in the traditional database world. A big challenge is how to accommodate the seismic shift to mobile devices with their peculiar intermittent network connectivity.
No wonder, Facebook hosted a data faculty summit last September with many known academic researchers from around the country to find answers to its future data challenges.