Category Archives: NoSQL

Big Data & Analytics – what’s ahead?

Recently I read somewhere this statement – As we end 2017 and look ahead to 2018, topics that are top of mind for data professionals are the growing range of data management mandates, including the EU’s new General Data Protection Regulation that is directed at personal data and privacy, the growing role of artificial intelligence (AI) and machine learning in enterprise applications, the need for better security in light of the onslaught of hacking cases, and the ability to leverage the expanding Internet of Things.

Here are the key areas as we look ahead:

  • Business owners demand outcomes – not just a data lake to store all kinds of data in its native format and API’s.
  • Data Science must produce results – Play and Explore is not enough. Learn to ask the right questions. Visualization of analytics from search.
  • Everyone wants Real Time – Days and weeks too slow, need immediate actionable outcomes. Analytics & recommendations based on real time data.
  • Everyone wants AI (artificial intelligence) – Tell me what I don’t know.
  • Systems must be secure – no longer a mere platitude.
  • ML (machine learning) and IoT at massive scale – Thousands of ML models. Need model accuracy.
  • Blockchain – need to understand its full potential to business – since it’s not transformational, but a foundational technology shift.

In the area of big data, a combination of new and long-established technologies are being put to work. Hadoop and Spark are expanding their roles within organizations. NoSQL and NewSQL databases bring their own unique attributes to the enterprise, while in-memory capabilities (such as Redis) are increasingly being utilized to deliver insights to decision makers faster. And through it all, tried-and-true relational databases continue to support many of the most critical enterprise data environments.

Cloud becomes the de-facto deployment choice for both users and developers. Serverless technology with FaaS (Function as a Service) is getting rapid adoption amongst developers. According to IDC, enterprises are undergoing IT transformation as they rethink their business operations, including how they use information and what technology to deploy. In line with that transformation, nearly 80% of large organizations already have a hybrid cloud strategy in place. The modern application architecture, sometimes referred to as SMAC (social, mobile, analytics, cloud) is becoming standard everywhere.

The DBaaS (database as a service) is still not as widespread as other cloud services. Microsoft is arguably making the strongest explicit claim for a converged database system with its Azure Cosmo DB as DBaaS. Cosmo DB claims to support four data models – key-value, column-family, document, and graph. However, databases have been slower to migrate to the cloud than other elements of computing infrastructure mainly for security and performance reasons. But DBaaS adoption is poised to accelerate. Some of these cloud based DBaaS systems – Cosmo DB, Spanner from Google, and AWS DynamoDB – now offer significant advantages over their on-premise counterparts.

One thing for sure, big data and analytics will continue to be vibrant and exciting in 2018.

Advertisements

RocksDB from Facebook

I attended a HIVE-sponsored Meetup yesterday evening titled, “Rocking the database world with RocksDB”. Since I had never heard of RocksDB, I was curious to learn how it is rocking the database world.

Facebook built this key value store storage layer originally to use for MySQL (instead of InnoDB), as MySQL is used heavily at Facebook. They claim that was not the only motivation. Then in 2013, they decided to open source RocksDB. Last evening’s speaker in an earlier post on November, 2013 had said, “Storing and accessing hundreds of petabytes of data is a huge challenge, and we’re constantly improving and overhauling our tools to make this as fast and efficient as possible. Today, we are open-sourcing RocksDB, an embeddable, persistent key-value store for fast storage that we built and use here at Facebook.”

RocksDB is also ideal for SSD (Flash store) and claims fast performance. The team was excited when MongoDB opened up to other storage engines back in 2014 summer. For a period of time, MongoDB plus RocksDB was a fast combination. Then MongoDB decided to acquire WiredTiger ( a competitor) in December, 2014 to contribute to the performance, scalability, and hardware efficiency of MongoDB. That left RocksDB out of the official engagement with MongoDB. But they built something called MongoRocks that claims to be very fast. It seems several MongoDB users prefer MongoRocks over the native combo of MongoDB with WiredTiger.

Several users of RocksDB talked about their experience, specially in the IoT world where sensor data can be processed at the edge (ingestion, aggregation, and some transformation) before being sent to the cloud servers. The only issue I saw is the fact that there is no “real” owner of RocksDB as a deliverable solution. There is no equivalent of a Cloudera (For Hadoop) or Confluent (for Kafka) who can provide value-additions and support for the user base. It’s all open source download and do-your-own stuff till now. So serious production-level deployment is still a risky affair. For now, it’s a developer’s play tool.

The saga of the Unicorns

Unicorn is a term in the investment industry, and in particular the venture capital industry, which denotes a start-up company whose valuation has exceeded (the somewhat arbitrary) $1 billion. The term has been popularized by Aileen Lee of Cowboy Ventures. Fortune magazine counted over 80 unicorns as of January 2015. Now its most likely past 100. But their journey lately has been bumpy.

There are signs of cooling the “lofty valuations” of these unicorns. Fidelity wrote down Dropbox by 20%; Snapchat by 25%; and Zenefits and MongoDB by around 50% each. Zenefits had raised money at a $4.5B valuation in May. The reason for the markdown is the slow growth in meeting their targets. Square which had its IPO earlier in November, was valued at $4 billion, about a third less than in its most recent private round. Several others besides Square have faced “markdowns”: Pure Storage, Box, GoPro, News Relic, Hottonworks, etc.

So what is going on? Some of it is due to stock market jitteriness. Some of the unicorns claim to be disruptive and a threat to the incumbents. This has not happened. Google, Facebook, Amazon have continued to grow impressively. Facebook has messaging apps that compete with Snapchat and Dropbox has a rival in Amazon with a fast growing cloud storage business. MongoDB claimed to disrupt Oracle’s business, but Oracle’s stock has been growing lately. Investors clearly see that profitless startups may not be as good as incumbents’ growth prospects. Also, the burn rates of the unicorns are way too high. Lyft suffered a loss of $130m during the first half of this year on less that $50m in revenue. Instacart is losing $10 on each order. Open source software companies like Cloudera, MongoDB or Cassandra have a tough time growing their revenue.

Also, there seems to be a competition to pump up the valuation of these unicorns. The velocity to get into the “unicorn club” is too high. New fund raising rounds get creative to boost the valuation with investors. There are too many companies in similar spaces, each claiming to be $20-30 billion dollar companies in future. This is not going to happen. In the past downturns, healthy and well capitalized firms have benefited. Airbnb has $2 Billion in cash with a burn rate of around $100m a year.  Those firms that hoarded up cash during good times for the downturn will do well.

So unicorns, watch out before you become history.

Cassandra Summit 2015

I attended my first Cassandra Summit 2015 this week at the Santa Clara convention center. I was quite surprised to see more than 6000 people attend this event, much bigger than last year (2000 attendees) with 130 sessions. It was a proof to the growing popularity of Cassandra’s NoSQL database platform. CTO and cofounder  Jonathan Ellis (formerly from Rackspace) described new release 2.2 and 3.0 functions.

With its major addition of JSON support in 2.2, Cassandra basically eliminated the difference with MongoDB. They have their own query language called CQL, an SQL-like construct. Now with JSON support, the developer community will see some big advantages. They have functions like collections, udf (user defined types), and deeper nesting. Release 3.0 will see a brand new storage engine, a vast improvement to their previous key-value store engine with better space efficiency. Release 3.0 will include materialized views. Bragging about their fast performance and efficient distributed database functionality, Jonathan joked about MongoDB as the “snapchat for databases” (a reference to occasional data loss because of weak consistency). He emphasized three key elements: availability (onstage they dramatized the shutting down of many nodes in two data centers with Cassandra still running), scale, and performance (both read and write).

However, when it comes to streaming analytics, one Cassandra user explained how he combined Spark, Kafka, and Cassandra to achieve the same – a non-trivial programming feat. Cassandra CEO Billy Bosworth emphasized that they solve the transaction workload problem (always-on) and not geared for analytics. I understood that two key customers are Apple running their iTune application on Cassandra and Netflix. Majority of use cases were web-centric applications where speed and scale are key requirements. Several case studies indicated customers replacing MySQl with Cassandra.

In a panel discussion on monetizing open source software, the comment was made that Cassandra is foundational and customers are happy to pay for the fast performance and scale using their enterprise edition. In that sense, they are different than the RedHat model.

It was an interesting experience to see a new-generation database product gaining wider acceptance.

Big Data Visualization

Recently I listened to a discussion on Big Data Visualization hosted by Bill McKnight of the McKnight Consulting group. The panelists agreed that Big Data is shifting from the hype state to an “imperative” state. For start-up companies, there are more Big Data projects whereas true big data is still a small part of the enterprise practice. At many companies, Big Data is moving from POC (Proof of Concept) to production. Interest in visualization of data from different sources is certainly increasing. There is a growth in data-driven decision-making as evidenced by the increasing use of platforms like YARN, HIVE, and Spark. The traditional approach of RDBMS platform can not scale to meet the needs of rapidly growing volume and varieties of Big Data.

So what is the difference between Data Exploration vs. Data Visualization? Data exploration is more analytical and is used to test hypothesis, whereas visualization is used to profile data and is more structured. The suggestion is to bring visualization to the beginning of data cycle (not the end) to do better data exploration. For example, in a personalized cancer treatment, the finding and examining of output of white blood counts and cancer cells can be done upfront using data visualization. In Internet e-commerce, billions of rows of data can be analyzed to understand consumer behavior. One customer uses Hadoop and Tableau’s visualization software to do this. Tableau enables visualization of all kinds of data sources from three scenarios – cold data from a data lake on Hadoop (where source data in native format can be located); warm data from a smaller set of data; or hot data served in-memory for faster processing.

Data format can be a challenge. How do you do visualization of NoSQL data? For example, JSON data (supported by MongoDB) is nested and schema-less and is hard for BI tools. Understanding data is crucial and flattening of nested hierarchies will be needed. Nested arrays can be broken as foreign keys. Graph data is another special case, where visualization of the right amount of graphs data is critical (good UX).

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.

Apache Spark is another exciting new approach to speed up queries by utilizing memory. It consists of Spark SQL (SQL like queries), Spark string, MLLib, and GraphX. It leverages Python, Scala, and Java to do the processing. It enables users of Hadoop to have more fun with data analysis and visualization.

Big Data Visualization is emerging to be a critical component for extracting business value from data.

The NoSQLNow conference in San Jose this week

I attended the NoSQLNow conference this week at the San Jose Convention Center. The organizers claimed there were 800 attendees, clearly much higher than last couple of years. Given the number of sessions, exhibits, speakers and attendees, the interest on newer data management products and solutions (aka Big Data) has been growing fast.

I spoke at a session titled, “Are NoSQL databases ready for the enterprise? Examples of MongoDB deployment” which was well attended. I also participated in a panel on “enterprise adoption of cloud”. My co-panelists were from Oracle and NeoDB. The conference opening session was given by one of the co-hosts, Dan McCreary and he spoke about the state of NoSQL. He mentioned that a total of $2.4B have been invested in NoSQL DB companies over last couple of years- MongoDB ($231M), CouchBase ($116M), Aerospike ($22M), Basho ($32.5M), Datastax ($83.7M), Clustrix ($59.3M), FoundationDB ($22.3M), etc. Even big player like Intel has invested in Cloudera. 

Here are some new trends in the NoSQL world:

  • Hadoop is starting to move from batch to real time and streaming
  • Real time systems are adding Hadoop integration points
  • Storm (twitter) and Spark are addressing data streaming
  • Spark/Scala is popular on multiple systems
  • MongoDB is the big leader in NoSQL operational systems based on document data model, followed by Datastax and CouchBase

The market pressures, according to Dan point to:

  • Big Data & Predictive analytics
  • Internet of Things (time series data and log files)
  • Security for highly regulated areas like finance/banking, healthcare, and the government
  • streaming data
  • keeping the operational cost low (bye bye to license fees)
  • High Availability (move away from master-slave to clusters of peer to peer networks)

There are other trends like old-school Map-Reduce programming is being taken over by Spark. JSON data formats are gaining in popularity for agile development, but there is no standardization of JSON query language. On the other hand, XQuery 3.1 is supporting both XML and JSON formats. There is new emphasis on agile transformation, as data storage is no longer the issue. The question is how non-programmers can transform data to various useful formats.  The acronym ETL will be replaced by ETTTTTTT… (extract, store in data lake, and transform in many ways).

Other keynotes included Oracle’s head of database development, Andy Mendelson, who showed Oracle’s 3 areas under “big data” – Oracle DBMS & Exadata, Oracle Hadoop, and Oracle NoSQL (formerly BerkeleyDB), all with one interface called Oracle Big Data SQL. SQL seems to make a comeback as an interface to several products such as Cloudera Impala.

Amazon presented their Dynamo DB, built for the cloud with fast and predictable performance. They claim seamless scalability and easy admin. Amazon’s motto has always been, “build services, not software”. Amazon.com uses DynamoDB to minimize opex.

I presented many examples of enterprises deploying MongoDB to build “systems of engagement” on top of “systems of record” ( a concept Geoff Moore of Crossing the Chasm fame has been talking lately). There is great momentum of MongoDB deployment at enterprises because of agile development (flexible data model and high coding velocity), fast scalability and high availability using shards and replicas, and the open source culture.

Fast Data vs. Big Data

Back when we were doing DB2 at IBM, there was an important older product called IMS which brought significant revenue. With another database product coming (based on relational technology), IBM did not want any cannibalization of the existing revenue stream. Hence we coined the phrase “dual database strategy” to justify the need for both DBMS products. In a similar vain, several vendors are concocting all kinds of terms and strategies to justify newer products under the banner of Big Data.

One such phrase is Fast Data. We all know the 3V’s associated with the term Big Data – volume, velocity and variety. It is the middle V (velocity) that says data is not static, but is changing fast, like stock market data, satellite feeds, even sensor data coming from smart meters or an aircraft engine. The question always has been how to deal with such type of changing data (as opposed to static data typical in most enterprise systems of record).

Recently I was listening to a talk by IBM and VoltDB where VoltDB tried to justify the world of “Fast Data” as co-existing with “Big Data” which is narrowed to static data warehouse or “data lake” as IBM calls it. Again, they have chosen to pigeonhole Big Data into the world of HDFS, Netezza, Impala, and batch Map-Reduce. This way, they justify the phrase Fast Data as representing operational data that is changing fast. They call VoltDB as  “the fast, operational database” implying every other database solution as slow. Incumbents like IBM, Oracle, and SAP have introduced in-memory options for speed and even NoSQL databases can process very fast reads on distributed clusters.

VoltDB folks also tried to show how the two worlds (Fast Data and their version of Big Data) will coexist. The Fast Data side will ingest and interact on streams of inbound data, do real time data analysis and export to the data warehouse. They bragged about the performance benchmark of 1m tps on a 3-node cluster scaling to 2.4m on a 12-node system running in the SoftLayer cloud (owned by IBM). They also said that this solution is much faster than Amazon’s AWS cloud. The comparison is not apple-to-apple as the SoftLayer deployment is on bare metal compared to the AWS stack of software.

I wish they call this simply – real-time data analytics, as it is mostly read type transactions and not confuse with update-heavy workloads. We will wait and see how enterprises adopt this VoltDB-SoftLayer solution in addition to their existing OLTP solutions.