Category Archives: Database

DB2 is 30 years old next month!

Daryl Taft’s article in eWeek reminded me that next month, on June 6th. IBM’s DB2 RDBMS product will celebrate its 30th. anniversary. This has a personal significance for me. I was part of the DB2 planning team then and on June 6th. 1983, I was in Lyon, France at the European user group meeting, ready to announce IBM’s new RDBMS on MVS called DB2. Interestingly, I had prepared two presentation decks: one for DB2, and the other for IBM’s Database directions. The second one was in hand, in case the announcement could not clear all the IBM approval process on time. Luckily I was clear to go with the announcement of the new production-ready RDBMS product called DB2 to run on the mainframe MVS platform. I still recall the excitement of doing that in front of 2000 people in the gastronomic capital of France, Lyon. Later that evening, the attendees were taken by buses to the Beaujolais winery for the evening dinner.

Why was this significant? IBM Research had worked on a prototype called System R and that was commercialized on the VM platform with the name of SQL/DS.  Even though it supported the relational model and SQL, it lacked the DBMS-robustness such as scalability, performance, and reliability. In the mean time, Oracle got started in 1977 and its first product based on System R principles and SQL was introduced in 1979 on DEC/VAX. There was a gap of four years when IBM did not have a commercial RDBMS on its flagship platform MVS. The only DBMS on MVS was IMS based on hierarchical data model and DL/1 proprietary language. One of the internal debates was on the positioning of the new RDBMS when IMS was so significant a revenue generator. I recall the “dual database strategy” presentation we used to give (which one to use when). One good thing about DB2 was that the bottom layer of the engine (buffering, locking, latching, backup-recovery, write-ahead log, etc.) drew a lot of lessons from the user experience of IMS. Hence DB2 had superior  industrial-strength features than its research cousin SQL/DS as well as Oracle.

The next year in 1984, I went to IBM’s Austin Lab for two years, to lay the foundation work for DB2 for the IBM PC (OS/2). Subsequently the development was shifted to IBM Toronto lab. I personally headed a team doing the early work of porting DB2 to Unix in the year 1990-91.

All this was done before the Internet was invented and memory and disks were expensive commodities. Now the scene has changed a great deal and we see so many new types of database engines coming to market to address the needs of extreme scale and huge volumes of data. IBM continues to be a lead player in the data management and analytics business.

It feels good to be part of that history. Happy birthday DB2.

NewSQL Meetup last week

I attended a meetup last week in Santa Clara and the topic was The Realities of NewSQL. Three companies were represented in a panel discussion – Clustrix (Raj Bains), VoltDB (Scott Jar), and TransLattice (Michael Lyle). Steve Baunach from Starview was the moderator.

This new category called NewSQL represents companies using the relational data model and SQL to impart better scalability, performance, and high availability. Following the rise of NoSQL community of companies bringing schema-less object-oriented data model with relaxed consistency and scale-out on commodity servers, the NewSQL group claims similar scale-out, but with relational DB and SQL support.

Three claims stood out in their discussion – preserving the SQL skill-base and relational model of data that has dominated the landscape for last 20 plus years; high scale-out by adding commodity servers (a weakness specially with MySQL); and better availability.

VoltDB deals with transaction processing (dominated by IBM and Oracle products) with very high throughput (due to the proliferation of devices as new data sources) and better performance. Their claim is that they have eliminated many unnecessary overheads from traditional RDBMS products by using in-memory techniques extensively.

Clustrix claims it has eliminated sharding (extra burden to users if they have to manage it) as offered by NoSQL products. Their mantra for success is scale-out on clusters – being able to handle high loads by adding commodity scale servers. They specifically focus on the MySQL user base.

The TransLattice Elastic Database (TED) is a Relational Database Management System that provides ANSI-SQL support, the ACID transactions enterprise applications require, and the ability to scale-out across wide distances using ordinary Internet connections. It uses partitioning to split databases across nodes. This notion is not new and has been deployed by IBM and Oracle for many years.

It was unclear on why existing users of IBM or Oracle will adopt one of these products, as the incumbents are marching forward to scale-out models and improving TCO. The MySQL community has been using external products for scalability for a while and that is understandable. But being part of Oracle corporation, MySQL will see enhancements in its scalability offerings. Then there is SAP Hana that claims big performance gains.

There are many companies under this umbrella – ClustrixGenieDBSchoonerVoltDBRethinkDBScaleDB, Akiban,CodeFuturesScaleBaseTranslattice, NimbusDB, etc. With the marketing noise of Big Data and Cloud, new companies are getting funded by the dozens. It is going to be a tough space to differentiate and become a winner.

NewSQL – What is it?

There has been a lot of discussion on NoSQL databases over the past couple of years. These databases do not use the Structured Query Language (SQL), the standard data manipulation language for relational databases such as Oracle, DB2, MySQL, Sybase, and SQL Server. The data model is closer to object-oriented data and hence fits well for documents or geospatial data. Being schema-less, they accommodate well for flexible data structures, unlike their relational brethren. Examples of NoSQL databases are MongoDB (most popular), CouchDB, and Cassandra. Programming is easier and rigid consistency is not guaranteed.  They also have scale-out models with replication and sharding (partitioning) for speed. These products support multiple languages.

A new category called NewSQL databases are aiming to provide the scale-out advantages of NoSQL databases, and often their commodity hardware friendliness as well. But NewSQL databases maintain the transactional data consistency guarantees of traditional relational databases, as well as their compatibility with SQL for queries and connectivity (using technologies like ODBC and JDBC).  One such product called NuoDB believes that transactional, analytical and “Web scale,” elastic workloads can be handled by the same database; it’s just a matter of making that the design goal. This is hard to believe until proven!

Another NewSQL product, VoltDB also claims to bring ACID-compliant transactions with analytics. VoltDB focuses on using in-memory technology to perform in situ analysis on financial, clickstream, gaming, and other high-velocity data as it streams in. In the company’s own words, VoltDB is meant to “narrow the ‘ingestion-to-decision’ gap.” There is growing need for instant analysis of transactional data (Real-time BI).

You squander the value of transactional data unless you analyze it as it is being recorded. SAP said much the same thing recently, as it announced the availability of its Business Suite on its HANA in-memory data platform, and fellow NewSQL player NuoDB uses in-memory and asynchronous technology to facilitate similar real-time analyses. Other NewSQL database products include ScaleDB and Clustrix, addressing the scalability needs of MySQL customers. Most of these products are also offering their services in the cloud.

It seems a grand unification process is on its way. Conventional relational databases and NoSQL databases seem to be at opposite ends of a spectrum. NewSQL databases acknowledge the merits in both models and seek to eliminate unreasonable compromise by marrying the approaches. NewSQL products may thus win out, but traditional relational database players may also incorporate NoSQL and NewSQL features to stay competitive. Perhaps that’s why Microsoft announced in November last year that the next major release of its SQL Server relational database will include an in-memory transactional database engine, codenamed “Hekaton.”

Big Data – Status

According to a Wall Street Journal article today by Rachael King and Steven Rosenbush, the market for new databases serving Big Data reached $1.22B last year and is expected to more than double by 2014 (according to research firm Wikibon). That is quite impressive.

Since relational databases using SQl are inefficient in handling data from social chatters, smartphones, and clicks (because of volume and variety), new databases are popping up over last 3-4 years. In the past two years 119 database software companies have been funded by VC’s for $1.17B (according to Venture Source, a Dow Jones company). This is remarkable, as not too long ago, the space was declared taken by 3 incumbents – IBM, Oracle, and Microsoft. However, the scene has changed dramatically now.

Thanks must go to Google for pioneering the start of new innovations in Big Table, GFS (Googel File System), and Map-Reduce algorithms for massively parallel processing using commodity hardware clusters. These technologies became part of Apache open source foundation and the result is Hadoop, HDFS, and several associated tools for the new ecosystem. Amazon, Yahoo and Facebook have also contributed good work here.

The article mentions a client Autozone using one of the new DBMS’s called NuoDB for better managing store inventory according to local shoppers. NuoDb like many others offers a cloud service with an annual subscription, cutting Capex for customers.

Another client Trulia (online real estate) was using MySQL, but has added Cassandra to better manage the listing of home foreclosures and apartment listings of its 100 million homes in the US.

Shutterstcok, a photo agency, stores 24 million images with 10,000 added each day. It uses HDFS (Hadoop) to find out user behavior (how long they hover over an image before purchasing).

The article suggests that large financial clients will stick to existing vendors such as Oracle for various reasons, but the threat of these newcomers is there. This is much like the cloud software  is shaking up Microsoft’s desktop software model.

We are in the data-intensive computing era now and the race will be fierce for leadership and market share.

IBM’s focus on Big Data and Analytics

Yesterday at IBM’s investors day meeting in San Jose, CEO Ginnie Rometty specifically talked about its focus on Big Data and Analytics business. This is what she said -

IBM expects to continue its big bets on technologies like Big Data and analytics. “Data will be the basis of competitive advantage for every company, for every industry in the coming decade.”

To that end, she said that IBM now expects revenue from business analytics to account for as much as $20 billion in annual revenue by fiscal 2015. The prior target was $16 billion. And if Big Blue hits that goal it would amount to a doubling of analytics revenue from 2010.

That is quite a commitment, the likes of which has not been seen from other key players such as Oracle, HP, SAP, or Microsoft. IBM has a full division on Big Data and their coverage on the subject is quite impressive.

From my 16 years at IBM during the development of DB2 family of products, I know firsthand the talent and experience IBM has in the data business. When they set their mind on an area, good things happen. Hence this commitment by the CEO is serious and competitors better take notice!

Five Questions around Big Data

Data is the new currency of business and we are in the era of data-intensive computing. Much has been written on Big Data throughout 2012 and customers around the world are struggling to figure out its significance to their businesses. Someone said there are 3 I’s to Big Data

  • Immediate (I must do something right away)
  • Intimidating (what will happen if I don’t take advantage of Big Data)
  • Ill-defined (the term is so broad that I’m not clear what it means).

In this blog post, I would like to pose five key questions that customers must find answers to with regards to Big Data. So here goes.

1. Do I understand my data and do I have a data strategy?

There are varieties of data – customer transaction data, operational data, documents/emails and other unstructured data, clickstream data, sensor data, audio streams, video streams, etc. Do I have a clear understanding the 3V’s of Big Data – Volume, Velocity, and Variety? What is data “in motion” vs. data “in rest”? Data in motion demands split-second decisions and do I have such tools? Every data source must be understood followed by their attributes and growth projections.

Customers must have an overall data strategy based on their business importance. For example, business critical data must be highly reliable, secure and of high performance. A data policy must be in place to take care of volume, growth, retention, security and compliance needs.

2. What are my reporting needs to transform my business and give me insights for growth?

Businesses are transforming to stay ahead of the competition. While we asked, “what happened” in the past, now it is “why did it happen and what is going to happen?”. From data collection, we have to move to data analysis. Instead of analyzing existing business, we must create new business. Therefore, the retail industry wants to give “today’s recommendation” on the fly to clients; internal IT needs operational intelligence to make it more efficient; customer service must provide customer insight; and fraud management must look at social profiles to reduce fraud. The list goes on…

Do you have a clear understanding of your reporting needs via data visualization on mobile devices like the iPad with touch interface? You will need a strategy of all the analytic tools for key employees/executives to make quick business-relevant decisions.

3. How do I drastically reduce my TCO of Data Warehousing and BI?

Many large enterprises are spending millions of dollars to move operational data to a data warehouse via ETL tools (Extraction, Transformation, Loading). This can be expensive and time consuming. Sears, for example, has a slogan “ETL must die”. By moving to Hadoop, they reduced the ETL time from 20 hours to 17 minutes. They claim serious cost reductions by moving from traditional ETL to direct loading of raw data to Hadoop servers. Today’s implementations must be studied for price-performance and newer technologies can bring down costs and improve processing time drastically. Would you like to develop reports in days rather than weeks?

4. How does Big Data co-exist with my current OLTP and DW data?

All enterprises have business-critical operational systems (OLTP). These are using traditional DBMS systems (such as Oracle, DB2, IMS, etc.). They also created separate Data Warehousing systems with BI tools for analysis. Now the new world of Internet data such as chatters from social networks and Web Log data (digital exhaust) are adding to the complexity. What is your approach to data integration of the legacy vs. new data?

5. What is the right technology for my needs?

I keep hearing so many new terms and vendor names – Hadoop, Cloudera, Hortonworks, Datameer, NoSQL, MongoDB, Map-reduce, Data Appliance, HBase, etc. It surely can be very confusing!

I need to know what is the right technology for my needs. If I have petabyte volumes data coming from various sources, what technology can I implement to efficiently handle that? Then, how do I get relevant information from that pile to help my business insights? I also need to know what skills I need to do that and the cost. I need an implementation roadmap for getting value from all the data that my business is coming up with.

Tech thoughts for 2013

Last year, we saw three trends making lots of noise and a fourth one closely following – Cloud Computing, Big Data, Mobility, and Social networking for the enterprise. Let me comment on each one as we enter 2013.

In cloud computing, the focus shifts to Platform as a Service (PaaS) as SaaS is now accepted into the mainstream. CRM and HR applications dominate the space with SalesForce.com and Workday as leaders. Microsoft, for example, is evolving its Windows Azure from a PaaS to Infrastructure-as-a-service (IaaS). Last year, it added persistent-state virtual machine support to Azure, allowing it to accommodate a wider variety of software, including Linux. Microsoft also introduced Hadoop for Azure and support for MapReduce. Amazon’s AWS stack now blurs the boundary between PaaS and IaaS. SalesForce.com wants to be a PaaS player via its Force.com platform for developing any SaaS offering. Besides CRM/HR cloud apps., we have seen emergence of financial apps for midsize companies – Adaptive Planning, Anaplan, Host Analytics, and Tidemark are some example companies.

In Big Data, the focus will shift more to analytics and data visualization. The other key trend is “data in motion”, where capture and analysis can be done for split-second decisions. The post-Hadoop era has started and we see a host of new players offering near-realtime data reduction and analysis. This trend will accelerate. A set of NewSQL players (not NoSQL) are adding scale and performance to Postgres or MySQL, that can also be offered as a cloud service. Relational databases like IBM’s DB2 and Oracle will dominate the enterprise space, given its long years of proven robustness and reliability. However extreme scale in the order of petabytes will attract newer solutions.

Mobility is a given, thanks to the outselling of iPads over PC’s. Last year iPad sales  exceeded Lenovo’s number of PC sales. Cloud computing assumes user devices like iPad, Android, and smart-phones for users. Apple boasts over 700,000 iOS applications. Microsoft has a lot of catching to do with its slow sales of Surface RT. Going forward, every enterprise application must design its UI to the form factors of mobile devices. This will be a price of entry for any vendor. Gone are the drop-down icons on Windows as UI.

Social networking has grown a great deal for consumers, but enterprises are still struggling to figure out the proper usage and business benefits. Social will come into the organization through the back door (much like how PC’s entered the business during the 1980s and 1990s). A communication director may test out a company page on Facebook or customers complaining about or praising your company on their Twitter profiles or traditional enterprise applications being updated with social capabilities, there will be social. Hence it may be worthwhile your company should have some policy around social. I think enterprise applications will integrate more social features. Someone said that Facebook will matter less, but Twitter and Pinterest will be of more significance.

Welcome to 2013.

Big Data at Sears

Sears plus Kmart belong to Sears Holdings, whose goal is to get closer to its customers and increase customer loyalty. That requires big time data analytics capabilities for consumer behavior. While revenue at Sears has declined from $50B in 2008 to $42B in 2011, rivals like Wal-Mart, Target and Amazon have grown steadily with better profits. Amazon’s business  has grown from $19B in revenue in 2008 to $48B in 2011, passing Sears for the first time.

Sears used IMS (IBM’s first generation database product) on mainframe plus Teradata. Its ETL process using IBM DataStage software on a cluster of distributed servers took 20 hours to run. Since their adoption of Hadoop back in 2010, one of the steps (taking 10 hours out of the 20 hours) ran at 17 minutes. Their slogan is “ETL must die”, as they would like to load raw data directly to Hadoop. The old systems consisted of EMC Greenplum, Microsoft SQL Server, and Oracle Exadata (four boxes) for analytical workload. That is all being replaced by Hadoop, Datameer, MySQL, and InfoBright. Teradata is staying.

Sears’ process for analyzing marketing campaigns for loyalty club members used to take six weeks on mainframe, Teradata, and SAS servers. The new process running on Hadoop can be completed weekly. For certain online and mobile commerce scenarios, Sears can now perform daily analyses. The Hadoop systems at 200 Terabytes cost about one-third of 200-TB relational platforms. Mainframe costs have been reduced by more than $500K per year while delivering 50-100 times better performance on batch jobs. The volume of data on Hadoop is currently at 2 Petabytes. As the CTO says, Hadoop is no longer a science project at Sears – critical reports run on the platform, including financial analyses; SEC reporting; logistics planning; and analysis of supply chains, products, and customer data. Sears uses Datameer, a spread-sheet style tool that supports data exploration and visualization directly on Hadoop. It claims to develop interactive reports in 3 days that used to take 6 to 12 weeks before.

Sears has actually spun off a new subsidiary called MetaScale to offer cloud services to other retailers with Hadoop platform. They are leveraging their three years of acquired expertise in Hadoop to make money in analytic services. There are many open questions on whether Hadoop will be that platform that brings big success to Sears in the future.

Percolator, Dremel and Pregel – Google’s new data-crunching ecosystem

Hadoop traces its origins to Google where two early projects GFS (Google File System) and GMR (Google Map Reduce) were written besides Big Table, to manage large volumes of data. These systems are great at crunching large volumes of data in a distributed computing environment (with commodity servers) in batch mode. Any changes to the data requires streaming over the entire data-set and thus big latency. So it is good for “Data in Rest” or static data.

Now Google finds itself limited by its own invention of GFS/GMR/BigTable. Hence they have been working on the post-Hadoop set of data crunching tools – Percolator, Dremel, and Pregel. Here is a brief narration of each of these tools.

Percolator is a system for incrementally processing updates to a large data set. By replacing a batch-based indexing system with one on incremental processing with Percolator, you significantly speed up the process and reduce analysis time. Percolator’s architecture provides horizontal scalability and resilience. The best candidates for this is large indexes where the performance improvement factor can be 100. The big advantage of Percolator is that the indexing time is now proportional to the size of the page, not to the size of the index.

Dremel is for ad-hoc analytics. It is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. Dremel claims to be about 100 times faster than MapReduce. It’s architecture is similar to Pig and Hive, but instead of MapReduce, it’s engine is based on aggregator trees.

Pregel is a system for large-scale graph processing and graph data analysis. It is designed to execute graph algorithms faster and API is easy to use. As to be expected Pregel is architected for efficient, scalable, and fault-tolerant implementation on clusters of thousands of commodity computers. Graphs are everywhere – social networks, computer network topologies, games among soccer teams, citations among scientific papers, and the most pervasive graph is the web itself. Pregel is a scalable infrastructure to mine a wide range of graphs and programs are expressed as a sequence of iterations. Google has been using Pregel internally for some time now.

Besides Google, Facebook and Twitter are also working on new innovations. Recently Twitter released its Storm project to the Apache open source. One key trend is “Data in Motion”, or how to deal with data that is moving. This is the velocity aspect of Big Data.

In-memory Database – Oracle’s Exadata X3 vs. SAP’s HANA

This week at the Oracle Open World conference, Larry Ellison announced the new Exadata X3 processor that has 4TB of DRAM plus 22TB of Flash or SSD memory. Therefore, he said that you could have 26TB of in-memory data for fast processing at very fast write-speed (1m writes per second). Clearly this is aimed at SAP’s in-memory database project HANA that has been shipping for last 6 months. Larry, in his typical style, derided HANA as one with 0.5TB of DRAM and therefore not worth comparing to the X3.

Subsequent to this announcement, Vishal Sikka, SAP’s CTO and head of HANA development, wrote a blog refuting Oracle’s claim as false and mis-leading. He says that the 22TB of SSD does not count as memory and HANA has such SSD for persistence. He says, “We are presently shipping, for the last several months, certified 16-node HANA hardware made by 4 vendors: IBM, HP, Fujitsu and Cisco.  These systems are available for 16TB of DRAM, so they are already 4 times bigger than Oracle’s machine, and they have been in the market since spring of this year. The machines can take up to 32TB of DRAM, within their current configurations.  In IBM’s case, with the Max5 configuration, they can go up to 40TB.”

During SAP’s annual conference Sapphire last May, they demonstrated the largest HANA system built so far – an IBM cluster running a 100TB of DRAM and 4000 CPU core. Already today this system can go up to 250TB of DRAM (and with HANA’s compression, can hold multiple Petabytes of data entirely in-memory).

SAP is not into hardware unlike Oracle (with its Sun acquisition) who is quite motivated to make the hardware business succeed by creating their “engineered” systems. SAP gets its hardware cluster from four vendors and IBM is the strongest partner (even though it competes in the database space). SAP claims that HANA is not merely an in-memory database system, it provides many additional functions such as real-time analytics, etc.

Oracle will be a formidable competitor, as they have the longest years of experience in managing data. Now they are shifting to providing platform as a service with database, analytics, application development and social services. The game is not merely about speed and feed, but several other dimensions. SAP claims a rapid adoption of HANA in the few months of its introduction. It is hard to compare as there is no benchmark performance numbers.

The market will be the best judge of who is better. There are many camps now. The NoSQL camp is denouncing all the traditional database vendors as incapable of handling the large volume of unstructured data (Big Data). The initial target of both SAP and Oracle seem to be its existing customer base, who will find a logical upgrade path to use these in-memory database solutions for fast speed and scalability.