Category Archives: BI

NewSQL – What is it?

There has been a lot of discussion on NoSQL databases over the past couple of years. These databases do not use the Structured Query Language (SQL), the standard data manipulation language for relational databases such as Oracle, DB2, MySQL, Sybase, and SQL Server. The data model is closer to object-oriented data and hence fits well for documents or geospatial data. Being schema-less, they accommodate well for flexible data structures, unlike their relational brethren. Examples of NoSQL databases are MongoDB (most popular), CouchDB, and Cassandra. Programming is easier and rigid consistency is not guaranteed.  They also have scale-out models with replication and sharding (partitioning) for speed. These products support multiple languages.

A new category called NewSQL databases are aiming to provide the scale-out advantages of NoSQL databases, and often their commodity hardware friendliness as well. But NewSQL databases maintain the transactional data consistency guarantees of traditional relational databases, as well as their compatibility with SQL for queries and connectivity (using technologies like ODBC and JDBC).  One such product called NuoDB believes that transactional, analytical and “Web scale,” elastic workloads can be handled by the same database; it’s just a matter of making that the design goal. This is hard to believe until proven!

Another NewSQL product, VoltDB also claims to bring ACID-compliant transactions with analytics. VoltDB focuses on using in-memory technology to perform in situ analysis on financial, clickstream, gaming, and other high-velocity data as it streams in. In the company’s own words, VoltDB is meant to “narrow the ‘ingestion-to-decision’ gap.” There is growing need for instant analysis of transactional data (Real-time BI).

You squander the value of transactional data unless you analyze it as it is being recorded. SAP said much the same thing recently, as it announced the availability of its Business Suite on its HANA in-memory data platform, and fellow NewSQL player NuoDB uses in-memory and asynchronous technology to facilitate similar real-time analyses. Other NewSQL database products include ScaleDB and Clustrix, addressing the scalability needs of MySQL customers. Most of these products are also offering their services in the cloud.

It seems a grand unification process is on its way. Conventional relational databases and NoSQL databases seem to be at opposite ends of a spectrum. NewSQL databases acknowledge the merits in both models and seek to eliminate unreasonable compromise by marrying the approaches. NewSQL products may thus win out, but traditional relational database players may also incorporate NoSQL and NewSQL features to stay competitive. Perhaps that’s why Microsoft announced in November last year that the next major release of its SQL Server relational database will include an in-memory transactional database engine, codenamed “Hekaton.”

Big Data – Status

According to a Wall Street Journal article today by Rachael King and Steven Rosenbush, the market for new databases serving Big Data reached $1.22B last year and is expected to more than double by 2014 (according to research firm Wikibon). That is quite impressive.

Since relational databases using SQl are inefficient in handling data from social chatters, smartphones, and clicks (because of volume and variety), new databases are popping up over last 3-4 years. In the past two years 119 database software companies have been funded by VC’s for $1.17B (according to Venture Source, a Dow Jones company). This is remarkable, as not too long ago, the space was declared taken by 3 incumbents – IBM, Oracle, and Microsoft. However, the scene has changed dramatically now.

Thanks must go to Google for pioneering the start of new innovations in Big Table, GFS (Googel File System), and Map-Reduce algorithms for massively parallel processing using commodity hardware clusters. These technologies became part of Apache open source foundation and the result is Hadoop, HDFS, and several associated tools for the new ecosystem. Amazon, Yahoo and Facebook have also contributed good work here.

The article mentions a client Autozone using one of the new DBMS’s called NuoDB for better managing store inventory according to local shoppers. NuoDb like many others offers a cloud service with an annual subscription, cutting Capex for customers.

Another client Trulia (online real estate) was using MySQL, but has added Cassandra to better manage the listing of home foreclosures and apartment listings of its 100 million homes in the US.

Shutterstcok, a photo agency, stores 24 million images with 10,000 added each day. It uses HDFS (Hadoop) to find out user behavior (how long they hover over an image before purchasing).

The article suggests that large financial clients will stick to existing vendors such as Oracle for various reasons, but the threat of these newcomers is there. This is much like the cloud software  is shaking up Microsoft’s desktop software model.

We are in the data-intensive computing era now and the race will be fierce for leadership and market share.

IBM’s focus on Big Data and Analytics

Yesterday at IBM’s investors day meeting in San Jose, CEO Ginnie Rometty specifically talked about its focus on Big Data and Analytics business. This is what she said -

IBM expects to continue its big bets on technologies like Big Data and analytics. “Data will be the basis of competitive advantage for every company, for every industry in the coming decade.”

To that end, she said that IBM now expects revenue from business analytics to account for as much as $20 billion in annual revenue by fiscal 2015. The prior target was $16 billion. And if Big Blue hits that goal it would amount to a doubling of analytics revenue from 2010.

That is quite a commitment, the likes of which has not been seen from other key players such as Oracle, HP, SAP, or Microsoft. IBM has a full division on Big Data and their coverage on the subject is quite impressive.

From my 16 years at IBM during the development of DB2 family of products, I know firsthand the talent and experience IBM has in the data business. When they set their mind on an area, good things happen. Hence this commitment by the CEO is serious and competitors better take notice!

Five Questions around Big Data

Data is the new currency of business and we are in the era of data-intensive computing. Much has been written on Big Data throughout 2012 and customers around the world are struggling to figure out its significance to their businesses. Someone said there are 3 I’s to Big Data

  • Immediate (I must do something right away)
  • Intimidating (what will happen if I don’t take advantage of Big Data)
  • Ill-defined (the term is so broad that I’m not clear what it means).

In this blog post, I would like to pose five key questions that customers must find answers to with regards to Big Data. So here goes.

1. Do I understand my data and do I have a data strategy?

There are varieties of data – customer transaction data, operational data, documents/emails and other unstructured data, clickstream data, sensor data, audio streams, video streams, etc. Do I have a clear understanding the 3V’s of Big Data – Volume, Velocity, and Variety? What is data “in motion” vs. data “in rest”? Data in motion demands split-second decisions and do I have such tools? Every data source must be understood followed by their attributes and growth projections.

Customers must have an overall data strategy based on their business importance. For example, business critical data must be highly reliable, secure and of high performance. A data policy must be in place to take care of volume, growth, retention, security and compliance needs.

2. What are my reporting needs to transform my business and give me insights for growth?

Businesses are transforming to stay ahead of the competition. While we asked, “what happened” in the past, now it is “why did it happen and what is going to happen?”. From data collection, we have to move to data analysis. Instead of analyzing existing business, we must create new business. Therefore, the retail industry wants to give “today’s recommendation” on the fly to clients; internal IT needs operational intelligence to make it more efficient; customer service must provide customer insight; and fraud management must look at social profiles to reduce fraud. The list goes on…

Do you have a clear understanding of your reporting needs via data visualization on mobile devices like the iPad with touch interface? You will need a strategy of all the analytic tools for key employees/executives to make quick business-relevant decisions.

3. How do I drastically reduce my TCO of Data Warehousing and BI?

Many large enterprises are spending millions of dollars to move operational data to a data warehouse via ETL tools (Extraction, Transformation, Loading). This can be expensive and time consuming. Sears, for example, has a slogan “ETL must die”. By moving to Hadoop, they reduced the ETL time from 20 hours to 17 minutes. They claim serious cost reductions by moving from traditional ETL to direct loading of raw data to Hadoop servers. Today’s implementations must be studied for price-performance and newer technologies can bring down costs and improve processing time drastically. Would you like to develop reports in days rather than weeks?

4. How does Big Data co-exist with my current OLTP and DW data?

All enterprises have business-critical operational systems (OLTP). These are using traditional DBMS systems (such as Oracle, DB2, IMS, etc.). They also created separate Data Warehousing systems with BI tools for analysis. Now the new world of Internet data such as chatters from social networks and Web Log data (digital exhaust) are adding to the complexity. What is your approach to data integration of the legacy vs. new data?

5. What is the right technology for my needs?

I keep hearing so many new terms and vendor names – Hadoop, Cloudera, Hortonworks, Datameer, NoSQL, MongoDB, Map-reduce, Data Appliance, HBase, etc. It surely can be very confusing!

I need to know what is the right technology for my needs. If I have petabyte volumes data coming from various sources, what technology can I implement to efficiently handle that? Then, how do I get relevant information from that pile to help my business insights? I also need to know what skills I need to do that and the cost. I need an implementation roadmap for getting value from all the data that my business is coming up with.

Tech thoughts for 2013

Last year, we saw three trends making lots of noise and a fourth one closely following – Cloud Computing, Big Data, Mobility, and Social networking for the enterprise. Let me comment on each one as we enter 2013.

In cloud computing, the focus shifts to Platform as a Service (PaaS) as SaaS is now accepted into the mainstream. CRM and HR applications dominate the space with SalesForce.com and Workday as leaders. Microsoft, for example, is evolving its Windows Azure from a PaaS to Infrastructure-as-a-service (IaaS). Last year, it added persistent-state virtual machine support to Azure, allowing it to accommodate a wider variety of software, including Linux. Microsoft also introduced Hadoop for Azure and support for MapReduce. Amazon’s AWS stack now blurs the boundary between PaaS and IaaS. SalesForce.com wants to be a PaaS player via its Force.com platform for developing any SaaS offering. Besides CRM/HR cloud apps., we have seen emergence of financial apps for midsize companies – Adaptive Planning, Anaplan, Host Analytics, and Tidemark are some example companies.

In Big Data, the focus will shift more to analytics and data visualization. The other key trend is “data in motion”, where capture and analysis can be done for split-second decisions. The post-Hadoop era has started and we see a host of new players offering near-realtime data reduction and analysis. This trend will accelerate. A set of NewSQL players (not NoSQL) are adding scale and performance to Postgres or MySQL, that can also be offered as a cloud service. Relational databases like IBM’s DB2 and Oracle will dominate the enterprise space, given its long years of proven robustness and reliability. However extreme scale in the order of petabytes will attract newer solutions.

Mobility is a given, thanks to the outselling of iPads over PC’s. Last year iPad sales  exceeded Lenovo’s number of PC sales. Cloud computing assumes user devices like iPad, Android, and smart-phones for users. Apple boasts over 700,000 iOS applications. Microsoft has a lot of catching to do with its slow sales of Surface RT. Going forward, every enterprise application must design its UI to the form factors of mobile devices. This will be a price of entry for any vendor. Gone are the drop-down icons on Windows as UI.

Social networking has grown a great deal for consumers, but enterprises are still struggling to figure out the proper usage and business benefits. Social will come into the organization through the back door (much like how PC’s entered the business during the 1980s and 1990s). A communication director may test out a company page on Facebook or customers complaining about or praising your company on their Twitter profiles or traditional enterprise applications being updated with social capabilities, there will be social. Hence it may be worthwhile your company should have some policy around social. I think enterprise applications will integrate more social features. Someone said that Facebook will matter less, but Twitter and Pinterest will be of more significance.

Welcome to 2013.

My Talk on Big data yesterday (Sept 27, 2012)

I gave a talk titled Big Data – Trends and Challenges yesterday in San Jose. This was organized as a meet-up event by Datapipe and Compassites Software. Datapipe provides cloud infrastructure services to clients whereas Compassites Software (where I am a board director) is a technology services firm out of Bangalore, India focusing on areas like consumeration of IT, cloud computing, and Big Data.

At the talk yesterday, I realized how confused people seem to be on Big Data, as the term is so ill-defined. One thing is for sure, Big Data comes in one size – Big. Besides the size issue (over petabytes), there is the velocity issue (Data in Motion vs. Data in Rest) and the variety issue. I mentioned that as the volume of data keeps rising, the percentage of data for analysis and insight keeps declining. I mentioned that 80% of the data in the world is unstructured, hence new solutions are being invented. Also, M2M (machine to machine) or sensor data keeps rising. In the volume context, I said that a single engine in a Boeing 747, spills out 10 Terabytes per hour. When you take all four engines on a Boeing 747 flying across the Atlantic, it produces a staggering 640TB. Now everyday there are 25000 flights across the Atlantic and you can do the math on how much data gets collected per day.

We discussed the business value of big data and how the typical pilot project at enterprises seems to be IT Log Data analysis. Other areas like fraud detection, social media, call center feedback are candidates for Big data application. On the technology front, much has been happening during last 5-7 years. All the innovations are coming out of the new web companies like Google, Amazon, Yahoo, Facebook, and Twitter. The Hadoop platform is an offshoot of Google’s early work on GFS (Google File System) and GMR (Google MapReduce). Google is moving beyond Hadoop via its recent work on Dremel, Percolator, and Pregel. Facebook is also putting many new projects like Puma, mostly for realtime access and analysis. Twitter’s Storm project is also noteworthy. Google has offered the BigQuery as a cloud service recently. Then there are dozens of NoSQL products such as Cassandra, Couchbase, MongoDB, Riak, etc.

It is important to remember that the world is not being taken over by Hadoop, as it is a batch system for handling very large data volumes via distributed parallel processing on commodity hardware. It does not touch the space of OLTP which is critical for airlines and banking industries. Also, if your data volume is under 100 Terabytes and it is structured data, then current offerings of Data Warehousing via a RDBMS or appliances (e.g. Oracle Exadata, IBM Netezza) are excellent solutions. The web-centric interactive world has given rise to the need of extreme scale and the Hadoop-based solutions must learn to co-exist with the existing world. Hence Big Data integration will be a key area.

One thing for sure. There is a lot of interest on this subject of Big Data, as clarity is one thing lacking amidst all the marketing hype and noise.

Big Data – business value

With every passing day, Big Data assumes new strength as a significant force in our industry. Someone even said that Big Data is transforming business same way IT did few decades ago.

The overall revenue (includes hardware, software, service) for Big Data is said to be around $5.1B in 2011 and includes players such as IBM, Intel, HP, Fujitsu, etc. This is hard to fathom! But pure-play revenue coming from players such as Vertica, Aster, Cloudera, Greenplum, 1010Data, etc. is valued at $468M in 2011. Then someone said the projected revenue from Big Data will reach an astounding $53B by 2017 (source- Wikibon), growing to $10B in 2013, $32 in 2015 and $48B in 2016. We can argue on these numbers, but let us agree that this will be quite big. Why is that?

We all know about data growth. Facebook with 900M users in April, 2012 did analytics on 25PB (petabytes) of compressed data ($125PB of raw data). Twitter handled 400M tweets a day during June 2012. Overall corporate data is supposed to grow by 94% year to year. Facebook made several shifts – from “what data to store” to “what can we do with more data”. They simplified data analytics for end users by adopting more than one infrastructure to solve all Big Data problems.

I read that someone identified the 3 I’s of Big Data besides the 3 V’s (volume, velocity, and variability). They are Immediate (do something now), Intimidating (what if I don’t?), and Ill-defined (what is it anyway? many definitions). The middle one “Intimidating” refers to leveraging Big Data applications like online advertising and marketing optimization, or applications to predict crime data, or machine-generated data for analytics.

Many vertical industries are trying to exploit Big Data and analytics – retail (today’s recommendation), sales leads (campaign recommendation), IT (from manual log files to operational intelligence), customer service (customer insight), billing (intelligent coding), fraud management (social profiles), and automatic operations management.

Key challenges remain, specially in the areas of visual analytic tools, and doing trend analysis across multiple data sources. But several new start-ups are addressing these white spaces. Big data is not just Hadoop or NoSQL database systems. It encompasses current RDBMS data and new unstructured data plus all the analytics and special applications to provide businesses the insight for better growth.-

Big Data & Analytics – What is new?

A friend of mine from my IBM days (an expert in Data Warehousing, BI, etc.) told me about the Hadoop conference he attended in San Jose few weeks back. When he attended the same conference two years ago in New York, there were hardly 200 attendees whereas this time, the number exceeded 2000 and it was a sold out event. This just proves how fast Hadoop has generated interest. He said that one theme in every presentation was the need for Hadoop skills as almost every presentation had a slide, “we are hiring”.

Hadoop offers a massively scalable data management and analysis environment that can handle many different data types without the complicated transformation and schema changes required to load diverse data into a conventional RDBMS. Remember the days of ETL (Extraction, Transformation, Loading) when data massaging and cleansing preceded the creation of the Data Warehouse for analytics purpose. Given the growth in data volume, velocity and variety, the era of “Big Data” has started and new tools such as Hadoop is the need of the hour for doing search and analytics.

Three vendors are worth mentioning here in the Hadoop solution space.

- Cloudera is the market share leader and it offers the open source Apache Hadoop software (CDH4) in its fourth generation and its proprietary system management software. The new version of CDH offers high availability, improved security and hot failover for the NameNode (metadata server) of the HDFS (file system). This node has been known as single point of failure (not good for enterprise needs).

- Hortonworks, which spun out of Yahoo last year has released its first product Hortonwork Data Platform. It uses Hadoop 1.0 code base (more stable) reassuring the enterprise users. It provides the high availability and failover needs with VMware virtualization and uses open source software for management console and also for ETL (Talend software).

- The third player is MapR which pitches its Hadoop distribution as a high-performance alternative replacing HDFS with a derivative of the Unix-based network file system that is highly scalable and has high availability features. MapR  also is part of the Amazon’s Elastic MapReduce service.

Hadoop scales in linear fashion to solve the data-volume challenge and runs on commodity hardware (less expensive). It has challenges in terms of skill shortage and batch-related delays. Many IT shops want to integrate old-school BI systems that are integrated with Hadoop to analyze data inside a cluster or result sets moved out of Hadoop. New Analytics vendors are popping up. Two start-ups are worth mentioning – Datameer and Karmasphere.

Datameer’s analytics platform provides modules for data integration to sources from mainframe to Twitter. It provides a spread-sheet driven data analysis environment meant for business analysts without IT skills. Karmasphere also provides reporting, analysis, and data visualization on Hadoop. It uses a graphical interface and collaborative workflow that works with Hive, the data warehousing component of Hadoop.

Hadoop integration with current BI environment will be a critical need, as years of investment in BI and analytics will not be thrown away to accommodate the new analytic tools.

The Fourth Paradigm in Science

We all remember the late Jim Gray, the great computer scientist and Turing award winner. During the last several years of his research work at Microsoft, he focused on data-intensive computing and called it the Fourth Paradigm in scientific discovery. In a special book dedicated to the memory of Jim, Bill Gates commented, “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.”

So what is the Fourth Paradigm? Here is the explanation.

1. Thousand years ago – Experimental Science
– Description of natural phenomena
2. Last few hundred years – Theoretical Science
– Newton’s Laws, Maxwell’s Equations…
3. Last few decades – Computational Science
– Simulation of complex phenomena
4. Today – Data-Intensive Science (unify theory, experiment, & simulation)

Scientists are overwhelmed with data sets from many different sources such as data captured by instruments, data generated by simulations, and data generated by sensor networks.

Jim Gray named it “eScience’ where IT (Information Technology) meets Science. It is the set of tools and technologies to support data federation and collaboration for analysis, data mining, data visualization and exploration, and for scholarly communication and dissemination. He laid out the principles, fondly called Gray’s law of data engineering:

  • —Scientific computing is revolving around data
  • —Need scale-out solution for analysis
  • —Take the analysis to the data!
  • —Start with “20 queries”
  • —Go from “working to working”

Interestingly, all these apply to the commercial world of Big Data. Only the scientific world has been grappling with these problems longer. Given the proliferation of devices and incoming data in petabytes, the need for tools to do analytics is of the highest priority. No wonder, 2012′s biggest buzzword is Big Data.

We miss you Jim and your pioneering thoughts on DISC (Data Intensive Scalable Computing)!

Revisiting “Big Data”

Big Data is a top technology trend for 2012 according to Forrester Research. The Economist said that Big Data is a new game changing asset and The Harvard Business Review termed it as a scientific revolution. Scientific Revolution? Because it is data-intensive computing to unify, theorize, experiment, and do simulation at scale.

It is also termed the Fourth Paradigm – “The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.”

Big Data is when the size of the data itself becomes part of the problem. But Big Data is not just “big”. There are the 3V’s of Big Data:

  1. Volume – Terabyte records, transactions, tables, files. A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture.
  2. Velocity – batch, near-time, real-time, streams. Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer scoring probabilities. Stream data, such as movies, need to  travel at high speed for proper rendering.
  3. Variety – structures, unstructured, semi-structured, and all the above in a mix. WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (petabytes). There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc.

Because of these characteristics, traditional DBMS solutions are inadequate. Hence we have seen the growth of technologies such as Hadoop (map-reduce algorithm started at Google) mostly processing unstructured data in batch mode. New solutions are needed for realtime processing.

See my blog from last year on this subject.