Data Curation Systems

There is a whole area in the Data world, called by various names – data integration, data movement, data curation or cleaning, data transformation, etc. One of the pioneers is Informatica which came into being when Data Warehouse became a hot topic during the 1990s. The term ETL (extraction, transformation, loading) became part of the warehouse lexicon. If we call this the first generation of the data integration tools, then they did an adequate job for its time. Often the T of the ETL was the hardest job as it required business domain knowledge. Data were assembled from fewer source (usually less than 20) into the warehouse for offline analysis and reporting. The cost of data curation (mostly, data cleaning) required to get heterogeneous data into proper format for querying and analysis was high. During my years at Oracle in the mid-1990s, such tools were provided by third party companies. Often, many warehouse projects were substantially over-budget and late.

Then a second generation of ETL systems arrived where major ETL products were extended with data cleaning modules, additional adaptors to ingest other kinds of data, and data cleaning tools. Data curation involved: ingesting data sources, cleaning errors, transforming attributes into other ones, schema integration to connect disparate data sources, and performing entity consolidation to remove duplicates. But you need a professional programmer to handle all these. With the arrival of the Internet, many new sources of data also arrived and the diversity increased manyfold and the integration task became much tougher.

Now there is talk of a third generation of tools termed “scalable data curation” which can scale to hundreds or even thousands of data sources. Experts mention that such tools can use statistics and machine learning to make automatic decision wherever possible. Such tools need human interaction only when needed.

Start-ups such as Trifacta and Paxata emerged, applying such techniques to data preparation, an approach subsequently embraced by incumbents Informatica, IBM, and Solix. A new startup called TamR (cofounded by Mike Stonebraker of Ingres, Vertica, and VoltDB fame) which got funded last year by Google Ventures and NEA ($16M funding), claims to create a true “curation at scale”. It has adopted a similar approach but applied it to a different upstream problem – curating data from multiple sources.  IBM has publicly stated its direction to develop a “Big Match” capability for Big Data that would complement its MDM (master data management) tools. More are expected to enter into this effort.

In summary, ETL systems arose to deal with the transformation challenges in early data warehouses. They evolved into second generation data curation systems with an expanded scope of offerings. Now a new generation of data curation systems is emerging to address the Big Data world where sources have multiplied with more heterogeneity of data sources. On the surface, this seems quite opposite to the concept of “data lake” where native formats are stored. However, the so-called “data refinery” is no different than the curation process.

The Software Paradox

I just read a new booklet from O’Reilly called The Software Paradox by Stephen O’Grady. You can access it here.

Here is a direct quote:

This is the Software Paradox: the most powerful disruptor we have ever seen and the creator of multibillion-dollar net new markets is being commercially devalued, daily. Just as the technology industry was firmly convinced in 1981 that the money was in hardware, not software, the industry today is largely built on the assumption that the real revenue is in software. The evidence, however, suggests that software is less valuable—in the commercial sense—than many are aware, and becoming less so by the day. And that trend is, in all likelihood, not reversible. The question facing an entire industry, then, is what next?”

IBM completely missed the role of software when it introduced the IBM PC back in 1981. The focus was on hardware and software was a means to that end. I lived through those years at IBM and saw this first hand. When presenting to a high level executive on the concept of data warehousing in 1991, his only question was how much hardware can it sell.

Microsoft saw the value of software and licensed its MS-DOS to others, as part of the IBM contract it signed to deliver PC-DOS. That made Microsoft enormously rich over the next twenty years. During IBM’s difficult years in 1992-93, someone jokingly said IBM stood for “I Blame Microsoft”.

If  (1950-1986) marks the first generation of software where IBM dominated (it started separately charging for software in 1968), the second generation (1986-1998) was dominated by Microsoft (and others like Oracle) where monetization of software occurred big time. When I joined Oracle in 1992, the revenue was barely $1B and that grew rapidly to $10B over next eight years.

Then something interesting happened – call it the third generation (1998-2004) when a new class of technology providers came to picture, like Google and Amazon. They engaged the user directly via a browser. Google showed the economics of scaling to a worldwide user base by its own proprietary software which subsequently became open source. Actually Google publishes early papers on stuff like Pregel, Dremel, and Spanner  with details for the community to implement its own version. That’s how Hadoop got created by Doug Cutting at Yahoo from the Google papers on GFS, etc. Amazon founded in 1994, is a huge user of open source in building its AWS stack. Here there was no direct software licensing charges. It was a shift to cloud-based service. Even Cloudera (one of the custodians of Hadoop) finds it hard to monetize, as the core piece is free. Now the fourth generation (2004-now) started adding new players such as Facebook, Twitter, LinkedIn, GitHub which build their stack on open source software. Except Google, the rest have given much of their internally developed code to open source.

It seems software has come full circle – started as an enabler, then large software licensing revenue stream (commercial software players like Microsoft, IBM, Oracle, SAP,..), then alternate model with no upfront pricing, but a subscription model (e.g. SalesForce, Workday,..), then finally no-charge software back to being an enabler for cloud-based services.

This conundrum was described by Mike Olson of Cloudera – “you can no longer win with a closed-source platform and you can’t build a successful standalone company purely on open source.” So the question is – what is the right model moving forward? Several creative approaches are being tried.

The software industry is already seeing tectonic moves – shift to cloud services, open source, and economically cheaper solutions than before. This is impacting the big commercial players as their new license sales are starting to decline. Oracle, for example, is moving rapidly to cloud-delivered models at the cost of short-term revenue hit. So are SAP, Microsoft, and IBM.

Software, once again, is a means to and end, as opposed to an end in and of itself. Welcome back to the future!

Entrepreneurs Should Strike A Balance Between Old And New

I am posting an article by Somesh Dash, General Partner at IVP, published today in Tech Crunch.

_______________________________________________________

In 1989, the creators of “Back to the Future 2” imagined a 2015 very similar to the one we live in today – accurately predicting technologies like flat-screen TVs, Google Glass and drones being commercially available and in popular demand.

If someone acted on building those technologies back in 1989, we’d likely be more accustomed to seeing the remote-controlled skateboards and handlebar-less Segways that zoom down Market Street just north of Silicon Valley these days. And while we may not have flying cars yet, Audi did build a car that drove itself to CES this year, and the show floor was filled with voice-activated devices similar to those that brought Marty McFly’s smart home to life.

After 26 years, the predictions of life in 2015 aren’t that off base. That’s because the film’s creators didn’t try to invent entirely new technology, but rather imagined how we would improve and innovate the technology that was already widely in use in 1989.

As a VC always on the lookout for the next big thing capable of massive adoption and global scale, watching the film now reminds me of some of the startups that turned into empires based solely on the entrepreneurial concept of striking a balance between old realities and new ideas.

For instance, office workers used to be tied to their desks or carry a USB drive in their pockets until Dropbox and Box enabled hundreds of millions of users a simple way to access files from any computing device. Booking a vacation was once an arduous, time-consuming task until online travel services like Orbitz and Kayak took travel and airline booking agents out of the equation and gave travelers more choice. Spotify and SoundCloud have helped revolutionize the digital music industry by offering listeners a new way to discover music.

All of these venture-backed companies have transformed the status quo in billion-dollar markets. While each investment entails a certain amount of risk, these companies filled a gap with a much-needed new solution. From the investor perspective, it’s sometimes easier to back the radical new invention rather than recognize advancements that address or innovate more traditional or foundational technology. Although harder to spot, these ideas can present just as much opportunity as the brand new ones.

Probably the most recent, visible example of innovating on an existing and widely used technology is the seemingly overnight success of Beats by Dre. Frustrated by the low quality of audio provided by Apple’s earbuds, Dr. Dre partnered with noted record producer Jimmy Iovine to build a headphone that allowed listeners to hear music the way the artist intended. By redesigning the headphone to reduce distortion and equalize the frequencies, consumers now have a superior audio delivery option, and Dr. Dre can now claim his title of the world’s first billionaire rapper.

While headphones may not be the “sexy” tech that comes to mind when thinking of Silicon Valley, this idea is one example of reimagining the possibilities of what an antiquated product could be. To a similar degree, I recently joined the board of a startup, Pindrop Security, an Atlanta-based company that focuses on phone fraud prevention.

While phone fraud may not be as headline grabbing as cyber crime, security on the phone channel has remained static for nearly 40 years, and Pindrop’s technology enables them to futuristically secure voice as likes of Siri and Amazon Echo create a shift toward voice as the preferred interface – much like Marty McFly and his family portrayed all those years ago in their voice-commanded home.

The type of forward thinking about aged technology led me to look more closely at the entrepreneurs I meet every day and their commonalities. The competitive startup culture can lead entrepreneurs to fall into the trap of walking the well-beaten path of those Silicon Valley success stories they aspire to be like.

However, every company looking for venture backing is claiming to “disrupt” an industry with something entirely new – and the promise of new technology begins to fall on deaf ears. It is the companies that take the road less traveled that I am more excited to speak with.

For companies in their very early days, there are a few key things to consider in order to differentiate from the other startups chasing VC cash:

  • You don’t have to have the shiny new thing. Consider what technology exists now that can be improved on to disrupt the industry.
  • You don’t have to be in the Valley. Technology hubs have been developing globally and some of the most promising startups are located outside of the U.S.
  • Solidify your company’s vision. If your idea truly is “disruptive,” competition will arise. How will your company continue to lead the market without losing sight of its core mission?
  • Does your idea stand the test of time? While your idea may be groundbreaking today, how will it stay innovative 5, 10 or 20 years from now?

When it comes to investing, there will always be a shiny new thing lobbying to become the next Silicon Valley darling. But we must not overlook those innovators that look back to the future – reimagining how an old technology can address new problems for the consumers or enterprises. It’s in finding the right balance between supporting foundational technology and the “new new” that sets leading innovators apart from the noise of Silicon Valley.

The new world of Mobile Messaging

Over the last couple of years, something dramatic has happened in how the new generation communicates. Gone are the days of email. Now it is mobile messaging – it is global and becoming the de facto means of communication. Why?

  • It is asynchronous yet instant (less cumbersome than writing an email)
  • Expressive yet fast
  • Engaging yet user controlled
  • Simple yet 24 by 7 (anytime, around the clock)
  • Instant yet secure
  • Casual yet professional
  • Personal yet mainstream
  • Mobile yet distributed
  • Easy yet productive
  • Real-time yet replay-able
  • Current yet evergreen

Global messaging leaders are: WhatsApp, Facebook Messenger, and Snapchat in the US and WeChat (China), LINE (Japan), and KakaoTalk (Korea) in Asia.

WhatsApp was launched in 2009 and has about 800 million monthly average users punching a staggering 30 billion fast messages per day across the globe. Their growth rate is over 60% last year. When Facebook paid over $19 billion to acquire it couple of years back, everyone raised their eyebrows on the price tag. It was a very strategic move given its growing relevance.

Facebook Messenger, launched in 2011 is a messaging platform. In the words of Marc Zuckerberg, “Facebook Messenger provides the ability for sales representatives to engage customers where they are — online and via mobile — to talk through shipping options, which is an important factor for nearly 70% of American shoppers. Now, shoppers will know exactly how long a package will take to make it to their doorstep, and at what expense.” It currently boasts 600 million users per month, a growth of 200% since last year.

Snapchat, launched in 2011, is the king of ephemeral messages, pictures, and videos. These things disappear after playing for a few seconds/minutes. Kids love it as there is no record for parental questioning. It has daily average user numbering 100 million and they do 2 billion stories/views per day. In a recent investment round, the valuation is over $15 Billion, with Alibaba being a key investor.

In Asia, the messaging apps. are local, by country. WeChat in China was launched in 2011 as a messaging platform. It has 550 million daily users with an average growth of 40%. LINE was launched in Japan in 2011 and has monthly users of 205 million with 13 billion messages per day. It’s revenue is just over $900 million (growth rate of 70%). In Korea, the leading messaging platform is KakaoTalk, launched in 2010 with current monthly user base at 48 Million. They send 5.2 billion messages per day. It’s revenue is at $850 million (growth rate 19%).

The interesting observation is that these messaging applications are becoming major platforms and operate across multiple operating systems (Android, iOS). The global reach of WhatsApp, and Viber (Israel company) with voice and video are making them very popular. They are extending beyond text, video, pictures, voice to doing instant transactions also.Users prefer to have more than one, so one-size-fits-all does not work.

Welcome to the exciting world of mobile messaging!

The rise of private equity in technology

Last week, a public company Informatica got acquired by two private equity funds – the Permira fund and Canada Pension Plan Investment Board (CPPIB) for $5.3B. This is the biggest leveraged buyout so far this year.

I am happy for my friend Sohaib Abbasi (we were colleagues at Oracle during the 1990s) who is CEO of Informatica after being a board member for a couple of years. During Sohaib’s time, the company entered into playing a bigger role in data archiving and life cycle management. It also made progress into offering cloud based services.

Gourav Dhillon (founder, Snaplogic) founded the company back in 1992 and was its CEO for 12 years growing it to be a $300m company after a successful IPO. It was created during the rise of data warehousing and one needed a component called ETL (Extraction, Transformation, and Loading), a process of cleansing the data from operational systems and getting it ready for analytics. I used to call this “twenty-five years of sin” that needs to be corrected!

Informatica helps companies integrate and analyze data from various sources. It counts Western Union Co, Citrix Systems Inc, American Airlines Group Inc and Bank of New York Mellon Corp among its customers. It competes with Tibco, which was taken private for $4.3 billion in December 2014 by private equity firm Vista Equity Partners. Dhillon thinks his new company Snaplogic is better off by seeing two of its competitors (Informatica and Tibco) shunted to the land of private equity, which will squeeze these companies for profit. This is financial engineering at its best and will impact customers and long-term employees negatively while rewarding top management.

Many people believe that the private equity players will eventually sell this to a big technology player, much like Crystal Decisions was acquired by Business Objects (now part of SAP) for $1.2B in 2003. The model seems to be – take a struggling public company private, work on improving its margins and value, then sell it back to a sugar daddy and make a hefty profit. We saw that happened to Skype also (from eBay to private equity to Microsoft).

The timing might be really good for this, because the areas that Informatica specializes in, are the key touch points within the enterprise: data quality and data security and data integration in support of big data projects. That explains the high value of $5.3B.

Other start-ups getting private equity funding in recent times include – Cloudera, MongoDB, etc. They provide an alternative funding resource to the traditional VCs.

Congratulations, Michael Stonebraker for winning the 2014 ACM Turing Award

This week, the 2014 ACM Turing award was given to Michael Stonebraker, professor of computer science and engineering at MIT. Mike spent 29 years at University of California, Berkeley, joining as assistant professor after his Ph.D. in 1971 from the University of Michigan. His undergraduate degree was from Princeton University. Since 2000, he has been at MIT. He is a remarkable researcher, pioneering many frontiers in database management. Personally I have interacted with him several times during my days at IBM and Oracle. We have even spoken at the same panel in couple of public forums during the 1990s.

The award citation reads, “Michael Stonebraker is being recognized for fundamental contributions to the concepts and practices underlying modern database systems.  Stonebreaker is the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems. His work on INGRES introduced the notion of query modification, used for integrity constraints and views. His later work on Postgres introduced the object-relational model, effectively merging databases with abstract data types while keeping the database separate from the programming language.”

The ACM Turing award is considered as the “nobel prize in computer science” and is named after the British mathematician Alan Turing. The first award was given in the year 1966 and included a citation and $250,000 cash. Since last year, Google has sponsored and lifted the award to $1 Million dollars. Many stalwarts like Charles Bachman (1973, for inventing the concept of a shared database), Edgar Codd (1981, for pioneering the relational database), and Jim Gray (1998, for seminal work on database and transaction processing) have been honored with the Turing award. Mike Stonebraker joins this illustrious group.

The specialty of Mike is that his research has culminated in many product companies  as the following list (partial) shows:

  • Ingres – early relational database based on Dr. Codd’s (IBM) relational data model.
  • Postgres – object-relational database, base for products like Aster Data (part of Teradata), and Greenplum (part of EMC).
  • Illustra – Object database sold to Informix (now IBM) during the 1990s
  • Vertica – columnar data store, sold to HP in 2011
  • StreamBase – stream-oriented data store
  • Goby – data integration platform
  • VoltDB – in-memory database with high-speed transaction processing
  • SciDB – scientific data management
  • Tamr – to handle sensor data from varieties of sources

He has publicly derided the NoSQL movement, mainly due to its relaxed integrity (ACID) approach which he calls a fundamental flaw. He has also said in a recent interview, “IBM’s DB2, Oracle, and Microsoft‘s SQL Server are all obsolete, facing a couple major challenges. One is that at the time, they were designed for “business data processing.” But now there is also scientific data and social media, and web logs, and you name it! The number of people with database problems is now of a much broader scope. Second, “We were writing Ingres and System R for machines with a small main memory, so they were disk-based — they were what we call ‘row stores‘.” You stored data on disk record by record by record. All major database systems of the last 30 years all looked like that – Postgres, Ingres, DB2, Oracle DB, SQL Server — they’re all disk-row stores.” He says in-memory processing is quite economical and is the trend for future. He is a bit self-serving as his company VoltDB is based on that principle.

Mike thinks Facebook has the biggest database challenge with its “social graph” model which is growing in size at alarming speed. The underlying data store is MySQL which can not handle such load. Hence they have to come up with highly scalable innovative solutions, which will be mostly home-grown as no commercial product can handle that kind of load.

Mike Stonebraker is a legend in database research and the Turing award is well-deserved for such a pioneer. Congratulations!

Big Data Visualization

Recently I listened to a discussion on Big Data Visualization hosted by Bill McKnight of the McKnight Consulting group. The panelists agreed that Big Data is shifting from the hype state to an “imperative” state. For start-up companies, there are more Big Data projects whereas true big data is still a small part of the enterprise practice. At many companies, Big Data is moving from POC (Proof of Concept) to production. Interest in visualization of data from different sources is certainly increasing. There is a growth in data-driven decision-making as evidenced by the increasing use of platforms like YARN, HIVE, and Spark. The traditional approach of RDBMS platform can not scale to meet the needs of rapidly growing volume and varieties of Big Data.

So what is the difference between Data Exploration vs. Data Visualization? Data exploration is more analytical and is used to test hypothesis, whereas visualization is used to profile data and is more structured. The suggestion is to bring visualization to the beginning of data cycle (not the end) to do better data exploration. For example, in a personalized cancer treatment, the finding and examining of output of white blood counts and cancer cells can be done upfront using data visualization. In Internet e-commerce, billions of rows of data can be analyzed to understand consumer behavior. One customer uses Hadoop and Tableau’s visualization software to do this. Tableau enables visualization of all kinds of data sources from three scenarios – cold data from a data lake on Hadoop (where source data in native format can be located); warm data from a smaller set of data; or hot data served in-memory for faster processing.

Data format can be a challenge. How do you do visualization of NoSQL data? For example, JSON data (supported by MongoDB) is nested and schema-less and is hard for BI tools. Understanding data is crucial and flattening of nested hierarchies will be needed. Nested arrays can be broken as foreign keys. Graph data is another special case, where visualization of the right amount of graphs data is critical (good UX).

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.

Apache Spark is another exciting new approach to speed up queries by utilizing memory. It consists of Spark SQL (SQL like queries), Spark string, MLLib, and GraphX. It leverages Python, Scala, and Java to do the processing. It enables users of Hadoop to have more fun with data analysis and visualization.

Big Data Visualization is emerging to be a critical component for extracting business value from data.