Category Archives: NoSQL

NewSQL Meetup last week

I attended a meetup last week in Santa Clara and the topic was The Realities of NewSQL. Three companies were represented in a panel discussion – Clustrix (Raj Bains), VoltDB (Scott Jar), and TransLattice (Michael Lyle). Steve Baunach from Starview was the moderator.

This new category called NewSQL represents companies using the relational data model and SQL to impart better scalability, performance, and high availability. Following the rise of NoSQL community of companies bringing schema-less object-oriented data model with relaxed consistency and scale-out on commodity servers, the NewSQL group claims similar scale-out, but with relational DB and SQL support.

Three claims stood out in their discussion – preserving the SQL skill-base and relational model of data that has dominated the landscape for last 20 plus years; high scale-out by adding commodity servers (a weakness specially with MySQL); and better availability.

VoltDB deals with transaction processing (dominated by IBM and Oracle products) with very high throughput (due to the proliferation of devices as new data sources) and better performance. Their claim is that they have eliminated many unnecessary overheads from traditional RDBMS products by using in-memory techniques extensively.

Clustrix claims it has eliminated sharding (extra burden to users if they have to manage it) as offered by NoSQL products. Their mantra for success is scale-out on clusters – being able to handle high loads by adding commodity scale servers. They specifically focus on the MySQL user base.

The TransLattice Elastic Database (TED) is a Relational Database Management System that provides ANSI-SQL support, the ACID transactions enterprise applications require, and the ability to scale-out across wide distances using ordinary Internet connections. It uses partitioning to split databases across nodes. This notion is not new and has been deployed by IBM and Oracle for many years.

It was unclear on why existing users of IBM or Oracle will adopt one of these products, as the incumbents are marching forward to scale-out models and improving TCO. The MySQL community has been using external products for scalability for a while and that is understandable. But being part of Oracle corporation, MySQL will see enhancements in its scalability offerings. Then there is SAP Hana that claims big performance gains.

There are many companies under this umbrella – ClustrixGenieDBSchoonerVoltDBRethinkDBScaleDB, Akiban,CodeFuturesScaleBaseTranslattice, NimbusDB, etc. With the marketing noise of Big Data and Cloud, new companies are getting funded by the dozens. It is going to be a tough space to differentiate and become a winner.

NewSQL – What is it?

There has been a lot of discussion on NoSQL databases over the past couple of years. These databases do not use the Structured Query Language (SQL), the standard data manipulation language for relational databases such as Oracle, DB2, MySQL, Sybase, and SQL Server. The data model is closer to object-oriented data and hence fits well for documents or geospatial data. Being schema-less, they accommodate well for flexible data structures, unlike their relational brethren. Examples of NoSQL databases are MongoDB (most popular), CouchDB, and Cassandra. Programming is easier and rigid consistency is not guaranteed.  They also have scale-out models with replication and sharding (partitioning) for speed. These products support multiple languages.

A new category called NewSQL databases are aiming to provide the scale-out advantages of NoSQL databases, and often their commodity hardware friendliness as well. But NewSQL databases maintain the transactional data consistency guarantees of traditional relational databases, as well as their compatibility with SQL for queries and connectivity (using technologies like ODBC and JDBC).  One such product called NuoDB believes that transactional, analytical and “Web scale,” elastic workloads can be handled by the same database; it’s just a matter of making that the design goal. This is hard to believe until proven!

Another NewSQL product, VoltDB also claims to bring ACID-compliant transactions with analytics. VoltDB focuses on using in-memory technology to perform in situ analysis on financial, clickstream, gaming, and other high-velocity data as it streams in. In the company’s own words, VoltDB is meant to “narrow the ‘ingestion-to-decision’ gap.” There is growing need for instant analysis of transactional data (Real-time BI).

You squander the value of transactional data unless you analyze it as it is being recorded. SAP said much the same thing recently, as it announced the availability of its Business Suite on its HANA in-memory data platform, and fellow NewSQL player NuoDB uses in-memory and asynchronous technology to facilitate similar real-time analyses. Other NewSQL database products include ScaleDB and Clustrix, addressing the scalability needs of MySQL customers. Most of these products are also offering their services in the cloud.

It seems a grand unification process is on its way. Conventional relational databases and NoSQL databases seem to be at opposite ends of a spectrum. NewSQL databases acknowledge the merits in both models and seek to eliminate unreasonable compromise by marrying the approaches. NewSQL products may thus win out, but traditional relational database players may also incorporate NoSQL and NewSQL features to stay competitive. Perhaps that’s why Microsoft announced in November last year that the next major release of its SQL Server relational database will include an in-memory transactional database engine, codenamed “Hekaton.”

Five Questions around Big Data

Data is the new currency of business and we are in the era of data-intensive computing. Much has been written on Big Data throughout 2012 and customers around the world are struggling to figure out its significance to their businesses. Someone said there are 3 I’s to Big Data

  • Immediate (I must do something right away)
  • Intimidating (what will happen if I don’t take advantage of Big Data)
  • Ill-defined (the term is so broad that I’m not clear what it means).

In this blog post, I would like to pose five key questions that customers must find answers to with regards to Big Data. So here goes.

1. Do I understand my data and do I have a data strategy?

There are varieties of data – customer transaction data, operational data, documents/emails and other unstructured data, clickstream data, sensor data, audio streams, video streams, etc. Do I have a clear understanding the 3V’s of Big Data – Volume, Velocity, and Variety? What is data “in motion” vs. data “in rest”? Data in motion demands split-second decisions and do I have such tools? Every data source must be understood followed by their attributes and growth projections.

Customers must have an overall data strategy based on their business importance. For example, business critical data must be highly reliable, secure and of high performance. A data policy must be in place to take care of volume, growth, retention, security and compliance needs.

2. What are my reporting needs to transform my business and give me insights for growth?

Businesses are transforming to stay ahead of the competition. While we asked, “what happened” in the past, now it is “why did it happen and what is going to happen?”. From data collection, we have to move to data analysis. Instead of analyzing existing business, we must create new business. Therefore, the retail industry wants to give “today’s recommendation” on the fly to clients; internal IT needs operational intelligence to make it more efficient; customer service must provide customer insight; and fraud management must look at social profiles to reduce fraud. The list goes on…

Do you have a clear understanding of your reporting needs via data visualization on mobile devices like the iPad with touch interface? You will need a strategy of all the analytic tools for key employees/executives to make quick business-relevant decisions.

3. How do I drastically reduce my TCO of Data Warehousing and BI?

Many large enterprises are spending millions of dollars to move operational data to a data warehouse via ETL tools (Extraction, Transformation, Loading). This can be expensive and time consuming. Sears, for example, has a slogan “ETL must die”. By moving to Hadoop, they reduced the ETL time from 20 hours to 17 minutes. They claim serious cost reductions by moving from traditional ETL to direct loading of raw data to Hadoop servers. Today’s implementations must be studied for price-performance and newer technologies can bring down costs and improve processing time drastically. Would you like to develop reports in days rather than weeks?

4. How does Big Data co-exist with my current OLTP and DW data?

All enterprises have business-critical operational systems (OLTP). These are using traditional DBMS systems (such as Oracle, DB2, IMS, etc.). They also created separate Data Warehousing systems with BI tools for analysis. Now the new world of Internet data such as chatters from social networks and Web Log data (digital exhaust) are adding to the complexity. What is your approach to data integration of the legacy vs. new data?

5. What is the right technology for my needs?

I keep hearing so many new terms and vendor names – Hadoop, Cloudera, Hortonworks, Datameer, NoSQL, MongoDB, Map-reduce, Data Appliance, HBase, etc. It surely can be very confusing!

I need to know what is the right technology for my needs. If I have petabyte volumes data coming from various sources, what technology can I implement to efficiently handle that? Then, how do I get relevant information from that pile to help my business insights? I also need to know what skills I need to do that and the cost. I need an implementation roadmap for getting value from all the data that my business is coming up with.

Tech thoughts for 2013

Last year, we saw three trends making lots of noise and a fourth one closely following – Cloud Computing, Big Data, Mobility, and Social networking for the enterprise. Let me comment on each one as we enter 2013.

In cloud computing, the focus shifts to Platform as a Service (PaaS) as SaaS is now accepted into the mainstream. CRM and HR applications dominate the space with SalesForce.com and Workday as leaders. Microsoft, for example, is evolving its Windows Azure from a PaaS to Infrastructure-as-a-service (IaaS). Last year, it added persistent-state virtual machine support to Azure, allowing it to accommodate a wider variety of software, including Linux. Microsoft also introduced Hadoop for Azure and support for MapReduce. Amazon’s AWS stack now blurs the boundary between PaaS and IaaS. SalesForce.com wants to be a PaaS player via its Force.com platform for developing any SaaS offering. Besides CRM/HR cloud apps., we have seen emergence of financial apps for midsize companies – Adaptive Planning, Anaplan, Host Analytics, and Tidemark are some example companies.

In Big Data, the focus will shift more to analytics and data visualization. The other key trend is “data in motion”, where capture and analysis can be done for split-second decisions. The post-Hadoop era has started and we see a host of new players offering near-realtime data reduction and analysis. This trend will accelerate. A set of NewSQL players (not NoSQL) are adding scale and performance to Postgres or MySQL, that can also be offered as a cloud service. Relational databases like IBM’s DB2 and Oracle will dominate the enterprise space, given its long years of proven robustness and reliability. However extreme scale in the order of petabytes will attract newer solutions.

Mobility is a given, thanks to the outselling of iPads over PC’s. Last year iPad sales  exceeded Lenovo’s number of PC sales. Cloud computing assumes user devices like iPad, Android, and smart-phones for users. Apple boasts over 700,000 iOS applications. Microsoft has a lot of catching to do with its slow sales of Surface RT. Going forward, every enterprise application must design its UI to the form factors of mobile devices. This will be a price of entry for any vendor. Gone are the drop-down icons on Windows as UI.

Social networking has grown a great deal for consumers, but enterprises are still struggling to figure out the proper usage and business benefits. Social will come into the organization through the back door (much like how PC’s entered the business during the 1980s and 1990s). A communication director may test out a company page on Facebook or customers complaining about or praising your company on their Twitter profiles or traditional enterprise applications being updated with social capabilities, there will be social. Hence it may be worthwhile your company should have some policy around social. I think enterprise applications will integrate more social features. Someone said that Facebook will matter less, but Twitter and Pinterest will be of more significance.

Welcome to 2013.

In-memory Database – Oracle’s Exadata X3 vs. SAP’s HANA

This week at the Oracle Open World conference, Larry Ellison announced the new Exadata X3 processor that has 4TB of DRAM plus 22TB of Flash or SSD memory. Therefore, he said that you could have 26TB of in-memory data for fast processing at very fast write-speed (1m writes per second). Clearly this is aimed at SAP’s in-memory database project HANA that has been shipping for last 6 months. Larry, in his typical style, derided HANA as one with 0.5TB of DRAM and therefore not worth comparing to the X3.

Subsequent to this announcement, Vishal Sikka, SAP’s CTO and head of HANA development, wrote a blog refuting Oracle’s claim as false and mis-leading. He says that the 22TB of SSD does not count as memory and HANA has such SSD for persistence. He says, “We are presently shipping, for the last several months, certified 16-node HANA hardware made by 4 vendors: IBM, HP, Fujitsu and Cisco.  These systems are available for 16TB of DRAM, so they are already 4 times bigger than Oracle’s machine, and they have been in the market since spring of this year. The machines can take up to 32TB of DRAM, within their current configurations.  In IBM’s case, with the Max5 configuration, they can go up to 40TB.”

During SAP’s annual conference Sapphire last May, they demonstrated the largest HANA system built so far – an IBM cluster running a 100TB of DRAM and 4000 CPU core. Already today this system can go up to 250TB of DRAM (and with HANA’s compression, can hold multiple Petabytes of data entirely in-memory).

SAP is not into hardware unlike Oracle (with its Sun acquisition) who is quite motivated to make the hardware business succeed by creating their “engineered” systems. SAP gets its hardware cluster from four vendors and IBM is the strongest partner (even though it competes in the database space). SAP claims that HANA is not merely an in-memory database system, it provides many additional functions such as real-time analytics, etc.

Oracle will be a formidable competitor, as they have the longest years of experience in managing data. Now they are shifting to providing platform as a service with database, analytics, application development and social services. The game is not merely about speed and feed, but several other dimensions. SAP claims a rapid adoption of HANA in the few months of its introduction. It is hard to compare as there is no benchmark performance numbers.

The market will be the best judge of who is better. There are many camps now. The NoSQL camp is denouncing all the traditional database vendors as incapable of handling the large volume of unstructured data (Big Data). The initial target of both SAP and Oracle seem to be its existing customer base, who will find a logical upgrade path to use these in-memory database solutions for fast speed and scalability.

My Talk on Big data yesterday (Sept 27, 2012)

I gave a talk titled Big Data – Trends and Challenges yesterday in San Jose. This was organized as a meet-up event by Datapipe and Compassites Software. Datapipe provides cloud infrastructure services to clients whereas Compassites Software (where I am a board director) is a technology services firm out of Bangalore, India focusing on areas like consumeration of IT, cloud computing, and Big Data.

At the talk yesterday, I realized how confused people seem to be on Big Data, as the term is so ill-defined. One thing is for sure, Big Data comes in one size – Big. Besides the size issue (over petabytes), there is the velocity issue (Data in Motion vs. Data in Rest) and the variety issue. I mentioned that as the volume of data keeps rising, the percentage of data for analysis and insight keeps declining. I mentioned that 80% of the data in the world is unstructured, hence new solutions are being invented. Also, M2M (machine to machine) or sensor data keeps rising. In the volume context, I said that a single engine in a Boeing 747, spills out 10 Terabytes per hour. When you take all four engines on a Boeing 747 flying across the Atlantic, it produces a staggering 640TB. Now everyday there are 25000 flights across the Atlantic and you can do the math on how much data gets collected per day.

We discussed the business value of big data and how the typical pilot project at enterprises seems to be IT Log Data analysis. Other areas like fraud detection, social media, call center feedback are candidates for Big data application. On the technology front, much has been happening during last 5-7 years. All the innovations are coming out of the new web companies like Google, Amazon, Yahoo, Facebook, and Twitter. The Hadoop platform is an offshoot of Google’s early work on GFS (Google File System) and GMR (Google MapReduce). Google is moving beyond Hadoop via its recent work on Dremel, Percolator, and Pregel. Facebook is also putting many new projects like Puma, mostly for realtime access and analysis. Twitter’s Storm project is also noteworthy. Google has offered the BigQuery as a cloud service recently. Then there are dozens of NoSQL products such as Cassandra, Couchbase, MongoDB, Riak, etc.

It is important to remember that the world is not being taken over by Hadoop, as it is a batch system for handling very large data volumes via distributed parallel processing on commodity hardware. It does not touch the space of OLTP which is critical for airlines and banking industries. Also, if your data volume is under 100 Terabytes and it is structured data, then current offerings of Data Warehousing via a RDBMS or appliances (e.g. Oracle Exadata, IBM Netezza) are excellent solutions. The web-centric interactive world has given rise to the need of extreme scale and the Hadoop-based solutions must learn to co-exist with the existing world. Hence Big Data integration will be a key area.

One thing for sure. There is a lot of interest on this subject of Big Data, as clarity is one thing lacking amidst all the marketing hype and noise.

Big Data – business value

With every passing day, Big Data assumes new strength as a significant force in our industry. Someone even said that Big Data is transforming business same way IT did few decades ago.

The overall revenue (includes hardware, software, service) for Big Data is said to be around $5.1B in 2011 and includes players such as IBM, Intel, HP, Fujitsu, etc. This is hard to fathom! But pure-play revenue coming from players such as Vertica, Aster, Cloudera, Greenplum, 1010Data, etc. is valued at $468M in 2011. Then someone said the projected revenue from Big Data will reach an astounding $53B by 2017 (source- Wikibon), growing to $10B in 2013, $32 in 2015 and $48B in 2016. We can argue on these numbers, but let us agree that this will be quite big. Why is that?

We all know about data growth. Facebook with 900M users in April, 2012 did analytics on 25PB (petabytes) of compressed data ($125PB of raw data). Twitter handled 400M tweets a day during June 2012. Overall corporate data is supposed to grow by 94% year to year. Facebook made several shifts – from “what data to store” to “what can we do with more data”. They simplified data analytics for end users by adopting more than one infrastructure to solve all Big Data problems.

I read that someone identified the 3 I’s of Big Data besides the 3 V’s (volume, velocity, and variability). They are Immediate (do something now), Intimidating (what if I don’t?), and Ill-defined (what is it anyway? many definitions). The middle one “Intimidating” refers to leveraging Big Data applications like online advertising and marketing optimization, or applications to predict crime data, or machine-generated data for analytics.

Many vertical industries are trying to exploit Big Data and analytics – retail (today’s recommendation), sales leads (campaign recommendation), IT (from manual log files to operational intelligence), customer service (customer insight), billing (intelligent coding), fraud management (social profiles), and automatic operations management.

Key challenges remain, specially in the areas of visual analytic tools, and doing trend analysis across multiple data sources. But several new start-ups are addressing these white spaces. Big data is not just Hadoop or NoSQL database systems. It encompasses current RDBMS data and new unstructured data plus all the analytics and special applications to provide businesses the insight for better growth.-

Closer look at one NoSQL database – MongoDB

Among the new crop of NoSQL database products, MongoDB ranks quite high, in my opinion. The company that produces MongoDB is 10Gen, a venture backed new start-up since 2008. But its rapid growth over last 4 years bears testimony to its technical strength.

MongoDB’s name comes from the middle five letters of the word “humongous”, meaning big data. It is an open-source, document-oriented storage which is schema-free and can entertain dynamic queries with full indexing. The programming model is BSON – binary encoding of JSON (Javascript Object Notation), a lightweight text-based open standard designed for data interchange. Douglas Crawford of Yahoo invented JSON in 2006.

The other key tenet of MongoDB is its scalability architecture – it can scale out horizontally using its automatic “sharding” (or keyrange partitioning). It does provide master-slave or peer-to-peer replication for high availability, recovery, and performance. One of its customers Disney’s Interactive Media Group, for example, has 1400 instances of Mongo. It uses sharding for write performance and replication for read performance.

MongoDB can be deployed from the cloud via Amazon’s AWS. Their revenue model is via support services, training, and consulting. Partners include VMWare, Amazon, Redhat, etc. – all cloud platform providers offering MongoDB as an option to their clients. Although the database suits document storage the best, it can handle other unstructured data like video, and images. But initial thrust seems to be those customers looking for high scalability using commodity hardware and superior performance.

MongoDB claims over 400 customers, including many internet companies like FourSquare, Craigslist, etc. Several textbooks have been published on MongoDB and the development community is growing fast. It certainly bridges the gap between traditional RDBMS (Oracle, MySQL, SQL Server, DB2) at one end and Key-Value pair search engines (Riak, Cassandra, Voldemart,..) at the other end.