Data Landscape at Facebook

What does the data landscape look like at Facebook with its 1.3 billion users across the globe? They classify small data referring to OLTP-like queries that process and retrieve a small amount of data, usually 1-1000 objects requested by their id. Indexes limit the amount of data accessed during a single query, regardless of the total volume. Big data refers to queries that process large amounts of data, usually for analysis: trouble-shooting, identifying trends, and making decisions. The total volume of data is massive for both small and big data, ranging from hundreds of terabytes to hundreds of petabytes on disk.

The heart of Facebook’s core data is TAO (The Association of Objects) – distributed data store for the social graph. The workload on this is extremely demanding. Every time any one of over a billion active users visits Facebook through a desktop browser or on a mobile device, they are presented with hundreds of pieces of information from the social graph. Users see News Feed stories; comments, likes, and shares for those stories; photos and check-ins from their friends — the list goes on. The high degree of output customization, combined with a high update rate of a typical user’s News Feed, makes it impossible to generate the views presented to users ahead of time. Thus, the data set must be retrieved and rendered on the fly in a few hundred milliseconds. TAO provides access to tens of petabytes of data, but answers most queries by checking a single page in a single machine. The challenge here is how to optimize for mobile devices which have intermittent connectivity and higher network latencies than most web clients.

Big data stores are the workhorses for data analysis and they grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores are: 1) ODS (Operational Data Store) has 2 billion time series of counters (used mostly in alerts and dashboards) and processes 40000 queries per second. 2) Scuba is the fast slice-and-dice data store with 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second scanning 100 billion rows per second. 3) Hive is the data warehouse with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabytes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.

What is interesting to note here is that ODS, Scuba, and Hive are not traditional relational databases. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage and retrieval. Instead, challenge arises from high data insertion rates and massive data quantities.

Facebook represents the new generation Big Data user demanding innovative solutions not seen in the traditional database world. A big challenge is how to accommodate the seismic shift to mobile devices with their peculiar intermittent network connectivity.

No wonder, Facebook  hosted a data faculty summit last September with many known academic researchers from around the country to find answers to its future data challenges.

Industry Bifurcation?

Apple reported record sales in the most recent quarter, thanks to its upgraded iPhone line. But it was almost alone among the big technology firms in doing so. Profits reported for the same period, have fallen at Google as well as at IBM, SAP and VMWare.

IBM has done more financial engineering than the real kind in recent years by its $100 billion buyback of its own shares since year 2000. It has shed less profitable assets, but now lacks a big fast-growing business. It even started layoffs in India recently. The earnings at VMWare dropped because of a recent acquisition. Some firms are having difficulty shifting to new trends such as cloud computing. SAP and Oracle are seeing more of their business shifting to the cloud. That requires big investments in data centers and yields lower margins, at least initially. Workday, a cloud-based HR products company has been enticing existing customers (of SAP and Oracle) due to its low cost and better usability.

Google, for example, is hit by the shift of users to mobile devices, where advertising rates are lower. Apple and even Yahoo are taking advantage by their shift to mobile-advertising sales (Yahoo got 17% of its revenue of $1.1 billion on these sales in the past quarter). More fundamentally,  the IT industry is maturing and seeing slow revenue growth of only 3% (according to Boston Consulting Group). The biggest sectors such as hardware, business software, and IT services are growing slowly or even shrinking.

This is the “bifurcation” (between new emerging trends and old traditional segments) which will lead to big restructuring of the industry. HP recently talked about breaking up its business to separate entities. IBM continues to shed unprofitable businesses. Others like SAP are buying web-based travel and expense management company, Concur (for $8.3 billion!). Oracle continues to acquire more and more companies with cloud technology for the growth of its cloud business. During the Oracle Open World in late September, the whole theme centered around Cloud and to some degree mobility.

The recent disappointing results are another harbinger of an unbundling and rebuilding of the IT industry. It will be hard to tell how the new landscape will look like at the end of the process. But Apple continues to prove that bold innovations and winning user’s mindshare can bring big rewards. Older incumbents must learn a few lessons from Apple’s success.

Fast Data vs. Big Data – how to combine?

Today, all the discussion on Big Data centers around “static data” in a data lake (old Data Warehouse) accessed by BI tools or SQL on Hadoop (Hawk, Impala) or Map/Reduce algorithms (MapR) for analysis. This is looking at historical data and finding trends. Some new tools are trying to provide predictive analysis based on past trends. This area deals with mostly the volume and variety aspect of Big Data, but not the velocity or for “data in motion”.

The term “Fast Data” is applied to data that is in motion. This component is getting more and more significant as there is a constant streaming of data coming from edge devices such as sensors, smart phones and connected devices. As these devices explode (10 Billion now going towards 50B in a few years, according to market analysis), there will be a data explosion and that is not going to be addressed by current Big Data products and tools. What is needed is capture of this data at ingestion points, efficient storage plus management and doing real time analytics for faster decisions. Streaming data has been around for a while, but we are talking about two-way sensors where constant feedback and aggregation is needed. For example, smart meters in the utilities industry can provide readings from individual homes, but aggregating it at the transformer level is important to predict seasonality and other trends. With fast data, things that were not possible before become achievable: instant decisions can be made on realtime data to drive sales, connect with customers, inform business processes, and create value.

Fast data is the payoff for big data. While much can be accomplished by mining data to derive insights that enable a business to grow and change, looking into the past provides only hints about the future. Simply collecting vast amounts of data for exploration and analysis will not prepare a business to act in real time, as data flows into the organization from millions of endpoints. The IoT (Internet of Things) makes this much more significant.

Enterprises have to figure out a combined architecture for Fast Data as well as Big Data. Streams of data from edge devices will eventually migrate to the data lake, but much realtime analysis have to happen before. Technologies such as in-memory databases and complex event processing are needed to handle the performance. This space is still new and much more work is needed in the area of analytics that is real-time. The older OLTP systems will be inadequate to handle the demands of ingestion and analysis at an affordable cost.

It is time to look at the world of data with a wider lens than just Hadoop-centric Big Data!

India 2.0 – many startups and big funding rounds

Way back in 2003, when I gave a talk at Nasscom’s annual conference in Mumbai, I remember saying, “when will India 2.0 begin?” meaning the start of innovative products from India.

India 1.0 mostly started around the Y2K era, when lots of programming work was needed in the US and India stepped up to provide the manpower. This saw an unprecedented growth of Indian software services industry led by TCS, Infosys, Wipro, HCL, Cognizant, etc. Such success can also be a deterrent to innovation and product development culture, as service revenue becomes a quick road to faster ROI. At the same time, new innovation in software products is hard and requires different mindset and skills, something no one was interested to invest in. But that is changing now.

During a trip to India recently, I saw several signs of change.  Masayoshi Son of SoftBank, Japan announced sizable investment of $627M in startups like Snapdeal and Ola (an Uber-like taxi service). Mr. Son even made a statement that the next Jack Ma (founder-CEO of Alibaba which had its IPO recently) may come from India. Nikesh Aurora, vice chairman at SoftBank (formerly from Google) announced investing his own personal money in Snapdeal along with other known names like Ratan Tata.

Then there are the Bansals who founded Flipkart, an Amazon-like shopping site which saw $1B investment few months back. Not to be left behind, Jeff Bezos visited India and announced a $2B investment in Amazon India. To have a dramatic impact, he rode an open truck showing a big check with $2B investment he was committing. He announced several large warehouses in Gudgaon and other India cities to meet the demands of India’s Amazon users. Myntra, another Indian start-up recently got acquired by Flipkart.

Last week, I was having dinner with a friend of mine belonging to a venture fund in Bangalore. He confirmed the increased activity in venture funding which includes the valley VC funds like Matrix Partners, Accel, Norwest Venture Partners, etc. Last week there was an event in Bangalore called Nasscom Conclave where several startups and VC’s networked (1700 attended) including many visitors from here. Besides the funding, the new graduates from top schools like the IITs are also joining new start-ups rather than the service companies they used to be attracted in the past.

The internet shopping sites like Snapdeal, Flipkart, etc are popular because of the convenience factor. Consumers do not want to fight the traffic in big cities to go for physical shopping which also includes groceries. Taxi fares in Bangalore are going through a steady decline due to Uber and Ola type services. India is finally catching up to the Internet e-commerce in a serious way. Facebook’s second largest market is India. So no wonder, Marc Zuckerberg and Jeff Bezos on  recent visits to India last month, met with the prime minister, a big supporters of new technology.

Welcome the beginning of India 2.0!

Intimate Symbiosis – the Apple Watch

A recent Time magazine essay on the Apple Watch said, “The Apple Watch represents a redrawing of the map that locates technology in one place and our bodies in another…  The Apple Watch signals the advent of of an “always-on Internet”, an Internet that can not be put away. We are used to dabbling just our fingertips in the Internet, but the Apple Watch doesn’t stop there. It tracks your movements. It listens to your heartbeat. It puts your whole body on-line”.

What we saw this week at Apple’s big announcement is the ushering of a new era via its Apple Watch. Our watches have been dumb, it just shows time. Now it is going to be much more. Even Apple made sure of the accuracy of time – within milliseconds, something the other smart watch products never emphasized on. Also, notice the absent of i in the Apple Watch, a significance of post-Jobs creation of a beautiful product by a whole team headed by Johny Ive, the chief designer.

From a business point of view, it is sheer genius. Given the flat growth of Apple iPhones, the Watch requires the presence of the iPhone as a pre-requisite. Given that it supports iPhone5, right away, there are 200 million users of iPhone loyalists to embrace the Apple Watch plus the new buyers of iPhone6 family. The user interface is full of clever innovations. Notice the “digital crown”, the small sensor-filled nub on the side of the watch to scroll the screen, zoom-in and zoom-out and navigate home. The screen is not grid-like, but shows small circles of apps available. Also Siri, the voice-activated digital assistant will be handy.

The Apple Watch has a rich feature set. It makes calls like a phone. It handles text messages and emails, though because of the tiny screen reading is a lot easier than writing. Users can send one another small drawings that animate and then disintegrate after a few seconds. They can also send their heartbeats to each other. Double tapping on the screen sends a gentle nudge to a nearby friend, like a light tap on the wrist. In practice it’s silly, ephemeral and lovely. Besides supporting the usual iPhone and iPad apps: weather, stocks, passbook, photos, maps, calendar (you have to have a link to nearby iPhone for GPS and Internet connectivity). Crucially it supports Apple’s new wireless payment system, a major play on its own right.

But how much of personal stuff we want to broadcast? Yes, it’s intimate, a word used frequently by Tim Cook. But it brings new challenges of exposing our behavior online.

The Time essay concluded, “Once you are O.K. with wearing technology, the only way forward is inward: the next product launch after the Apple Watch would logically be the iMplant. If Apple succeeds in legitimizing wearables as a category, it will have established the founding node in a network that could spread throughout our bodies, with Apple setting the standards. Then we’ll really have to decide how much control we want – and what we’re prepared to give up for it.”

Big Day at Apple – Sept 9, 2014

Today, Apple made significant announcements at the historic Flint Center in Cupertino, where exactly 30 years ago, the brand new Apple McIntosh was introduced by Steve Jobs. It is worthwhile to see the entire event (almost 1.5 hours, culminated by the band U2 playing on stage and releasing a new album on iTune for free download).

Three key announcements were made: the new iPhone6 and iPhone6Plus, Apple Pay, and Apple Watch.

The new iPhone6 and 6Plus offer bigger screen sizes, as expected, with a new chip plus new iOS8 software. Screen size went from 4 inch (iPhone5) to 4.7″ and 5.5″. Displays are sharper and the camera is much more powerful. Enhanced video, longer battery life,  and many additional features make this quite a move forward. The processor is almost 50x faster than the original iPhone, whereas the graphics performance is 84x faster. Clearly the computer game business will be attractive for the iPhone6Plus. The phones are thinner and better rounded. Being the largest revenue product, Apple clearly has worked hard at these improvements.

Apple enters the digital payment market for the first time using its new product Apple Pay. A special secure chip is introduced here along with NFC (Near Field Communication), only in iPhone6 and 6Plus. Credit cards will be stored securely, and future purchases can be done by just bringing the iPhone close to a payment device – incredibly easier than sliding the plastic today   (a technology that is five decades old and prone to theft and fraud). Many stores have signed up already – McDonalds, Panera Bread, Whole foods, Starwood Hotels (even room doors can be unlocked by iPhone), Target, Disney, etc. The much discussed concept of a digital wallet seems viable now using Apple Pay.

The Apple Watch is an incredible product that brings a lot of new innovations. Given the small real estate of this watch, Apple has introduced a circular side button for easy movement of contents. This will provide many many functions, from messaging, to calendar alerts, email, twitter feeds, Facebook friends, maps, Siri, and more. The most attractive feature will be health-related – steps taken, calories burnt, miles walked, heart rate, etc. Health and fitness wearables such as Fitbit and Nike band, have been far more limited compared to what Apple has packed into the Watch. This will set the bar for others to imitate, much like what the iPhone did initially.

Today’s event certainly proved the incredible innovation power of Apple!

The NoSQLNow conference in San Jose this week

I attended the NoSQLNow conference this week at the San Jose Convention Center. The organizers claimed there were 800 attendees, clearly much higher than last couple of years. Given the number of sessions, exhibits, speakers and attendees, the interest on newer data management products and solutions (aka Big Data) has been growing fast.

I spoke at a session titled, “Are NoSQL databases ready for the enterprise? Examples of MongoDB deployment” which was well attended. I also participated in a panel on “enterprise adoption of cloud”. My co-panelists were from Oracle and NeoDB. The conference opening session was given by one of the co-hosts, Dan McCreary and he spoke about the state of NoSQL. He mentioned that a total of $2.4B have been invested in NoSQL DB companies over last couple of years- MongoDB ($231M), CouchBase ($116M), Aerospike ($22M), Basho ($32.5M), Datastax ($83.7M), Clustrix ($59.3M), FoundationDB ($22.3M), etc. Even big player like Intel has invested in Cloudera. 

Here are some new trends in the NoSQL world:

  • Hadoop is starting to move from batch to real time and streaming
  • Real time systems are adding Hadoop integration points
  • Storm (twitter) and Spark are addressing data streaming
  • Spark/Scala is popular on multiple systems
  • MongoDB is the big leader in NoSQL operational systems based on document data model, followed by Datastax and CouchBase

The market pressures, according to Dan point to:

  • Big Data & Predictive analytics
  • Internet of Things (time series data and log files)
  • Security for highly regulated areas like finance/banking, healthcare, and the government
  • streaming data
  • keeping the operational cost low (bye bye to license fees)
  • High Availability (move away from master-slave to clusters of peer to peer networks)

There are other trends like old-school Map-Reduce programming is being taken over by Spark. JSON data formats are gaining in popularity for agile development, but there is no standardization of JSON query language. On the other hand, XQuery 3.1 is supporting both XML and JSON formats. There is new emphasis on agile transformation, as data storage is no longer the issue. The question is how non-programmers can transform data to various useful formats.  The acronym ETL will be replaced by ETTTTTTT… (extract, store in data lake, and transform in many ways).

Other keynotes included Oracle’s head of database development, Andy Mendelson, who showed Oracle’s 3 areas under “big data” – Oracle DBMS & Exadata, Oracle Hadoop, and Oracle NoSQL (formerly BerkeleyDB), all with one interface called Oracle Big Data SQL. SQL seems to make a comeback as an interface to several products such as Cloudera Impala.

Amazon presented their Dynamo DB, built for the cloud with fast and predictable performance. They claim seamless scalability and easy admin. Amazon’s motto has always been, “build services, not software”. uses DynamoDB to minimize opex.

I presented many examples of enterprises deploying MongoDB to build “systems of engagement” on top of “systems of record” ( a concept Geoff Moore of Crossing the Chasm fame has been talking lately). There is great momentum of MongoDB deployment at enterprises because of agile development (flexible data model and high coding velocity), fast scalability and high availability using shards and replicas, and the open source culture.