Netflix’s management of Data

Netflix is familiar to all of us. It is trying to switch to mostly video streaming, away from physical DVD business and the recent cost increase is a reflection of that strategy. What we don’t realize is the back-end data management and its complexity and challenges.

Let us look at the load factor. Approximately 15% of the US households, roughly  20 million plus are its paying subscribers yielding about $2B revenue in 2010.  Each subscriber and household members see 100 movies a month. So 20m times 100 is 2 billion instances. Therefore, its database must manage billions of records.

What is the nature of the data? It’s all video-centric such as critics and users review, video metadata like directors, actors, title, year made, etc.  Then it has to track users video queue, watch history, video rating, video playback metadata, etc. They also have to collect all the client’s “streaming device” information such as XBox, Blue-ray video player,etc. Critical customer information such as name, address, rental-package, credit card info are part of the highly secure data.

Netflix started with a traditional Oracle RDBMS handling its data during its early low-subscriber days. Given the load factor increase, it started to focus on a cloud strategy (to reduce its capex) and picked Amazon’s AWS for capacity planning and scale-out. It has been working for last couple of years to this cloud migration. The PII (personal Identifiable Information) and PCI data which demands greater privacy and security is kept at its own data center under Oracle. The rest of its data (in Terabytes) goes to the cloud (e.g. movie recommendation, movie metadata,..). The cloud data-store is SimpleDB from Amazon, although they have been looking at Cassandra and DataStax. So billions of records are moved to the cloud.

The nature of their data yields naturally to a key-value data store such as SimpleDB for extreme scale and performance. They did have several challenges to translate RDBMS concepts to a KV store. For example, “null” (value unknown at this time) concept of a RDBMS is not handled very well in SimpleDB. The system just does not return those records containing nulls.  Nor does it provide backup/recovery. There are no native data types. Again, Netflix can forfeit consistency (to eventually consistent) in favor of high Availability and Partitioning or distributability (AP out of the CAP). The video data plus streaming device activity log are all kept in Amazon S3 storage at a very low cost.

Netflix is a great example of a hybrid deployment of traditional data center and cloud computing. Technical challenges are there, but they have learnt the art of “pick and adjust the most optimal solution” for high scalability and performance.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s