Tag Archives: Platform technologies

Amazing Amazon

I remember back in 2003 when I had a meeting with the then CTO of Amazon for a couple of hours. He was narrating his vision of SOA (Service Oriented Architecture), where individual business or programming functions (called services) can be stacked up in libraries and get invoked as and when required. This notion of re-usable services was not new (remember subroutines from the mainframe era or stored procedures from the client-server days?).

Subsequently we called them “web services” because they were loosely coupled applications that can be exposed as services and easily consumed by other applications using Internet standard technologies. Phrases such as XML (EXtensible Markup Language), UDDI (Universal Discovery, Description, Integration), WSDL (Web Services Definition Language), and SOAP (Simple Object Access Protocol) were new lexicons then. These were URL addressable resources that could exchange information and execute processes automatically without human intervention. Oh yes, we talked about how the equivalent of a phone dial-tone is evolving to a personal digital dial-tone (Internet) to  an application digital dial-tone (web services).

I wondered then – why a book-selling company like Amazon was speaking this language, so far off its core business? Then the CTO explained to me that Jeff Bezos wanted to monetize the excess capacity from his massive data center investments which were idle like 70% of the time. Hence the starting point was S3 (simple shared storage), where savings of the order of one-hundredth could be achieved. No one believed such claims initially, but Amazon continued its journey into “cloud computing” with offering computing power as a utility with EC2 (elastic computing cloud). I left that meeting quite amazed, to say the least with skepticism in my mind.

Fast forward 12 years. Amazon’s AWS (Amazon Web Services) is the de-facto leader in cloud infrastructure provisioning business. It is both at IaaS (Infrastructure as a Service) and PaaS (Platform as a Service) levels. Amazon became the harbinger of “Cloud Computing”, taking that laurel away from leaders such as IBM, and HP. In 2015, the AWS business brought $8B in revenue. Others such as Microsoft’s Azure or IBM’s Bluemix or Oracle’s Cloud offering are all playing catch-up to AWS. Google’s cloud is yet to be a serious contender for enterprise computing. No wonder, they have hired Diane Green (VMWare founder) as the new cloud czar with a huge financial package.

Some predict that AWS will become the largest business unit at Amazon over time. Although I just read that they are after a $400B business, that of transportation, owning their own delivery services to replace Fedex and UPS (recently Amazon has been getting its own freight liners, trucking fleets, etc.). Amazon is secretive on its new business, just as back in 2002-2003, they were way ahead on their thoughts on cloud computing. Bezos is now part of the $50B+ club (top five richest people).

Dell + (EMC+VMW) = A $67B Gamble!

Yesterday, Dell announced the largest technology M&A in history with a proposed$67B buyout of EMC and VMware (via EMC’s 80% ownership of VMW). The combined company will have over $80B in revenue, employ tens of thousands of people around the world and sell everything from PCs, servers & storage to security software and virtualization software. Not to be overlooked is the fact that Dell and EMC will be private companies and free from the scrutiny of activist investors.

Dell has to borrow a ton of money to make this deal, like $40B debt. The annual interest payment will be $2.5B! This deal has three entities – Michael Dell’s own investment, Silverlake Partners, and Singapore based Tomasek. On paper this seems like the two companies bring complimentary values – Dell sells to small to medium size companies and EMC addresses the larger enterprise needs. The big attraction for Dell is the VMWare piece that revolutionized the desk-top virtualization market. Currently VMWare is 25% of EMC’s revenue, but 50% in valuation.

The concern is that as more corporations adopt cloud storage and cloud computing for their IT needs, there is less reason to spend money on the costly software and hardware upgrades typically offered by established IT companies like EMC. But by consolidating, they can better compete against the lower-cost cloud service companies – AWS (Amazon Web Services), IBM, Alphabet (Google), and Microsoft Azure.

This is going to be a big gamble. The HP CEO circulated an internal memo suggesting how this will be great opportunity for HP, as the combined company will create a lot of chaos and confusion. At the same time, being private, the new entity can execute radical restructure. But this will be a herculean task to make the combined company a winner in the highly competitive “IT infrastructure” market.

IBM’s big commitment to Apache Spark

Last June IBM made a serious commitment to the future of Apache Spark with a series of initiatives:  

  • It will offer Apache Spark as a service on Bluemix (Bluemix is an implementation of IBM’s Open Cloud Architecture based on Cloud Foundry, an open source Platform as a Service (PaaS). Bluemix delivers enterprise-level services that can easily integrate with your cloud applications without you needing to know how to install or configure them.
  • It committed to include 3500 researchers to work on Spark-related projects.
  • It will donate IBM SystemML (its machine learning language and libraries) to Apache Spark open source

The question is why this move by IBM?

First let us look at what is Apache Spark? Developed at UC Berkeley’s AMPLab, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. In other words, Spark is the next-generation of Hadoop (came with its batch pedigree and high latency).

With other solutions for real-time analytics via in-memory processing such as RethinkDB, an ambitious Redis project or commercial in-memory SAP Hana, IBM needed a competitive offering. Other vendors betting on Spark range from Amazon to Zoomdata. IBM will run its own analytics software on top of Spark, including SystemML for machine learning, SPSS, and IBM Streams.

At this week’s Strata conference, several companies like Uber described how they have deployed Spark all the way for speedy real-time analytics.

Translytical Database

This is a new term I learnt this week, thanks to the Forrester analyst Mike Gualtieri. Terms like Translytics or Exalytics (Oracle’s phrase) do not roll off the tongue that easy. Mike defined Translytical as , “single unified database that supports transaction and analytics in real time without sacrificing transactional integrity, performance, and scale.”

[Transactions + Analytics = Translytical]

Those of us who saw the early days of Data Warehousing, we deliberately separated the two worlds, so that analytics workloads do not interfere with transaction performance. Hence snapshots of operational data were taken to do data warehousing for offline batch analysis and reporting. Mostly that was getting a retro-view of what happened. In the current scheme of things, where data is coming fast and furious from so many sources, there is need to look at trends in real time and take action. Some insights are perishable, therefore need to be acted on immediately. All data originate fast, but analytics usually done much later. Perishable insights can have exponentially more value that after-the-fact traditional historical analysis. Here is a classification of analytics:                     Past —- Learn (Descriptive Analytics),                                                                             Present —- Infer (Predictive Analytics), Detect (Streaming Analytics),                   Future —– Action (Prescriptive Analytics).

Streaming analytics (real time) require a database that can do in-memory streaming for near-zero latency for complex data and analytical operations. The traditional approach of moving data to analytics has created many silos  such as CRM stack, BI stack or Mobile stack. Translytical databases are transactional as well as analytical. Point solutions like Spark data streaming which does micro batch processing are not the answer. Such a unified database must do in-memory processing (use RAM for real-time), multi-modal and support compression and tiered data as well.  Customers are stitching together open source products such as Spark, Kafka, and Cassandra to achieve streaming analytics, but it becomes a non-trivial programming task.

The only database claiming to be Translytical is VoltDB with functions such as: in-memory processing, scale-out with shared nothing, ACID compliance for transactional integrity, reliability and fault tolerance. It also has real time analytics built in combined with integration with Hadoop ecosystem.   Such a unified database has to prove its worth in the market.

So we have come full circle – from single database to more than one to handle both transactions and analytics; now back to single database doing both.

It makes logical sense, but let us watch and see if that works.

The Software Paradox

I just read a new booklet from O’Reilly called The Software Paradox by Stephen O’Grady. You can access it here.

Here is a direct quote:

This is the Software Paradox: the most powerful disruptor we have ever seen and the creator of multibillion-dollar net new markets is being commercially devalued, daily. Just as the technology industry was firmly convinced in 1981 that the money was in hardware, not software, the industry today is largely built on the assumption that the real revenue is in software. The evidence, however, suggests that software is less valuable—in the commercial sense—than many are aware, and becoming less so by the day. And that trend is, in all likelihood, not reversible. The question facing an entire industry, then, is what next?”

IBM completely missed the role of software when it introduced the IBM PC back in 1981. The focus was on hardware and software was a means to that end. I lived through those years at IBM and saw this first hand. When presenting to a high level executive on the concept of data warehousing in 1991, his only question was how much hardware can it sell.

Microsoft saw the value of software and licensed its MS-DOS to others, as part of the IBM contract it signed to deliver PC-DOS. That made Microsoft enormously rich over the next twenty years. During IBM’s difficult years in 1992-93, someone jokingly said IBM stood for “I Blame Microsoft”.

If  (1950-1986) marks the first generation of software where IBM dominated (it started separately charging for software in 1968), the second generation (1986-1998) was dominated by Microsoft (and others like Oracle) where monetization of software occurred big time. When I joined Oracle in 1992, the revenue was barely $1B and that grew rapidly to $10B over next eight years.

Then something interesting happened – call it the third generation (1998-2004) when a new class of technology providers came to picture, like Google and Amazon. They engaged the user directly via a browser. Google showed the economics of scaling to a worldwide user base by its own proprietary software which subsequently became open source. Actually Google publishes early papers on stuff like Pregel, Dremel, and Spanner  with details for the community to implement its own version. That’s how Hadoop got created by Doug Cutting at Yahoo from the Google papers on GFS, etc. Amazon founded in 1994, is a huge user of open source in building its AWS stack. Here there was no direct software licensing charges. It was a shift to cloud-based service. Even Cloudera (one of the custodians of Hadoop) finds it hard to monetize, as the core piece is free. Now the fourth generation (2004-now) started adding new players such as Facebook, Twitter, LinkedIn, GitHub which build their stack on open source software. Except Google, the rest have given much of their internally developed code to open source.

It seems software has come full circle – started as an enabler, then large software licensing revenue stream (commercial software players like Microsoft, IBM, Oracle, SAP,..), then alternate model with no upfront pricing, but a subscription model (e.g. SalesForce, Workday,..), then finally no-charge software back to being an enabler for cloud-based services.

This conundrum was described by Mike Olson of Cloudera – “you can no longer win with a closed-source platform and you can’t build a successful standalone company purely on open source.” So the question is – what is the right model moving forward? Several creative approaches are being tried.

The software industry is already seeing tectonic moves – shift to cloud services, open source, and economically cheaper solutions than before. This is impacting the big commercial players as their new license sales are starting to decline. Oracle, for example, is moving rapidly to cloud-delivered models at the cost of short-term revenue hit. So are SAP, Microsoft, and IBM.

Software, once again, is a means to and end, as opposed to an end in and of itself. Welcome back to the future!

The rise of private equity in technology

Last week, a public company Informatica got acquired by two private equity funds – the Permira fund and Canada Pension Plan Investment Board (CPPIB) for $5.3B. This is the biggest leveraged buyout so far this year.

I am happy for my friend Sohaib Abbasi (we were colleagues at Oracle during the 1990s) who is CEO of Informatica after being a board member for a couple of years. During Sohaib’s time, the company entered into playing a bigger role in data archiving and life cycle management. It also made progress into offering cloud based services.

Gourav Dhillon (founder, Snaplogic) founded the company back in 1992 and was its CEO for 12 years growing it to be a $300m company after a successful IPO. It was created during the rise of data warehousing and one needed a component called ETL (Extraction, Transformation, and Loading), a process of cleansing the data from operational systems and getting it ready for analytics. I used to call this “twenty-five years of sin” that needs to be corrected!

Informatica helps companies integrate and analyze data from various sources. It counts Western Union Co, Citrix Systems Inc, American Airlines Group Inc and Bank of New York Mellon Corp among its customers. It competes with Tibco, which was taken private for $4.3 billion in December 2014 by private equity firm Vista Equity Partners. Dhillon thinks his new company Snaplogic is better off by seeing two of its competitors (Informatica and Tibco) shunted to the land of private equity, which will squeeze these companies for profit. This is financial engineering at its best and will impact customers and long-term employees negatively while rewarding top management.

Many people believe that the private equity players will eventually sell this to a big technology player, much like Crystal Decisions was acquired by Business Objects (now part of SAP) for $1.2B in 2003. The model seems to be – take a struggling public company private, work on improving its margins and value, then sell it back to a sugar daddy and make a hefty profit. We saw that happened to Skype also (from eBay to private equity to Microsoft).

The timing might be really good for this, because the areas that Informatica specializes in, are the key touch points within the enterprise: data quality and data security and data integration in support of big data projects. That explains the high value of $5.3B.

Other start-ups getting private equity funding in recent times include – Cloudera, MongoDB, etc. They provide an alternative funding resource to the traditional VCs.

Congratulations, Michael Stonebraker for winning the 2014 ACM Turing Award

This week, the 2014 ACM Turing award was given to Michael Stonebraker, professor of computer science and engineering at MIT. Mike spent 29 years at University of California, Berkeley, joining as assistant professor after his Ph.D. in 1971 from the University of Michigan. His undergraduate degree was from Princeton University. Since 2000, he has been at MIT. He is a remarkable researcher, pioneering many frontiers in database management. Personally I have interacted with him several times during my days at IBM and Oracle. We have even spoken at the same panel in couple of public forums during the 1990s.

The award citation reads, “Michael Stonebraker is being recognized for fundamental contributions to the concepts and practices underlying modern database systems.  Stonebreaker is the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems. His work on INGRES introduced the notion of query modification, used for integrity constraints and views. His later work on Postgres introduced the object-relational model, effectively merging databases with abstract data types while keeping the database separate from the programming language.”

The ACM Turing award is considered as the “nobel prize in computer science” and is named after the British mathematician Alan Turing. The first award was given in the year 1966 and included a citation and $250,000 cash. Since last year, Google has sponsored and lifted the award to $1 Million dollars. Many stalwarts like Charles Bachman (1973, for inventing the concept of a shared database), Edgar Codd (1981, for pioneering the relational database), and Jim Gray (1998, for seminal work on database and transaction processing) have been honored with the Turing award. Mike Stonebraker joins this illustrious group.

The specialty of Mike is that his research has culminated in many product companies  as the following list (partial) shows:

  • Ingres – early relational database based on Dr. Codd’s (IBM) relational data model.
  • Postgres – object-relational database, base for products like Aster Data (part of Teradata), and Greenplum (part of EMC).
  • Illustra – Object database sold to Informix (now IBM) during the 1990s
  • Vertica – columnar data store, sold to HP in 2011
  • StreamBase – stream-oriented data store
  • Goby – data integration platform
  • VoltDB – in-memory database with high-speed transaction processing
  • SciDB – scientific data management
  • Tamr – to handle sensor data from varieties of sources

He has publicly derided the NoSQL movement, mainly due to its relaxed integrity (ACID) approach which he calls a fundamental flaw. He has also said in a recent interview, “IBM’s DB2, Oracle, and Microsoft‘s SQL Server are all obsolete, facing a couple major challenges. One is that at the time, they were designed for “business data processing.” But now there is also scientific data and social media, and web logs, and you name it! The number of people with database problems is now of a much broader scope. Second, “We were writing Ingres and System R for machines with a small main memory, so they were disk-based — they were what we call ‘row stores‘.” You stored data on disk record by record by record. All major database systems of the last 30 years all looked like that – Postgres, Ingres, DB2, Oracle DB, SQL Server — they’re all disk-row stores.” He says in-memory processing is quite economical and is the trend for future. He is a bit self-serving as his company VoltDB is based on that principle.

Mike thinks Facebook has the biggest database challenge with its “social graph” model which is growing in size at alarming speed. The underlying data store is MySQL which can not handle such load. Hence they have to come up with highly scalable innovative solutions, which will be mostly home-grown as no commercial product can handle that kind of load.

Mike Stonebraker is a legend in database research and the Turing award is well-deserved for such a pioneer. Congratulations!

Big Data Visualization

Recently I listened to a discussion on Big Data Visualization hosted by Bill McKnight of the McKnight Consulting group. The panelists agreed that Big Data is shifting from the hype state to an “imperative” state. For start-up companies, there are more Big Data projects whereas true big data is still a small part of the enterprise practice. At many companies, Big Data is moving from POC (Proof of Concept) to production. Interest in visualization of data from different sources is certainly increasing. There is a growth in data-driven decision-making as evidenced by the increasing use of platforms like YARN, HIVE, and Spark. The traditional approach of RDBMS platform can not scale to meet the needs of rapidly growing volume and varieties of Big Data.

So what is the difference between Data Exploration vs. Data Visualization? Data exploration is more analytical and is used to test hypothesis, whereas visualization is used to profile data and is more structured. The suggestion is to bring visualization to the beginning of data cycle (not the end) to do better data exploration. For example, in a personalized cancer treatment, the finding and examining of output of white blood counts and cancer cells can be done upfront using data visualization. In Internet e-commerce, billions of rows of data can be analyzed to understand consumer behavior. One customer uses Hadoop and Tableau’s visualization software to do this. Tableau enables visualization of all kinds of data sources from three scenarios – cold data from a data lake on Hadoop (where source data in native format can be located); warm data from a smaller set of data; or hot data served in-memory for faster processing.

Data format can be a challenge. How do you do visualization of NoSQL data? For example, JSON data (supported by MongoDB) is nested and schema-less and is hard for BI tools. Understanding data is crucial and flattening of nested hierarchies will be needed. Nested arrays can be broken as foreign keys. Graph data is another special case, where visualization of the right amount of graphs data is critical (good UX).

Apache Drill is an open source, low latency SQL query engine for Hadoop and NoSQL. Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.

Apache Spark is another exciting new approach to speed up queries by utilizing memory. It consists of Spark SQL (SQL like queries), Spark string, MLLib, and GraphX. It leverages Python, Scala, and Java to do the processing. It enables users of Hadoop to have more fun with data analysis and visualization.

Big Data Visualization is emerging to be a critical component for extracting business value from data.

Lambda Architecture

I attended a Meetup yesterday in Mountain View, hosted by The Hive group on the subject of Lambda Architecture. Since I had never heard about this new phrase, my curiosity took me there. There was a panel discussion and panelists came from Hortonworks, Cloudera, MapR, Teradata, etc.

Lambda Architecture is a useful framework to think about designing big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter. Some of the key requirements in building this architecture include:

  • Fault-tolerance against hardware failures and human errors
  • Support for a variety of use cases that include low latency querying as well as updates
  • Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done
  • Extensibility so that the system is manageable and can accommodate newer features easily

The following pictures summarizes the framework.

Overview of the Lambda Architecture

The Lambda Architecture as seen in the picture has three major components.

  1. Batch layer that provides the following functionality
    1. managing the master dataset, an immutable, append-only set of raw data
    2. pre-computing arbitrary query functions, called batch views.
  2. Serving layer—This layer indexes the batch views so that they can be queried in ad hoc with low latency.
  3. Speed layer—This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

Criticism of lambda architecture has focused on its inherent complexity and its limiting influence. The batch and streaming sides each require a different code base that must be maintained and kept in sync so that processed data produces the same result from both paths. Yet attempting to abstract the code bases into a single framework puts many of the specialized tools in the batch and real-time ecosystems out of reach.

The panelists rambled on details without addressing real challenges on combining two very different approaches, thus compromising the benefits of stream with added latency of the batch world. However, there is merit to the thought process of unification of the two disparate worlds into a common framework. Real deployment will be the proof point.

Big Data coverage at CES 2015

I saw more discussion on big data at CES 2015 this week, compared to previous years. Everyone talked about data as the central core of everything. The IoT (Internet of Things), NFC (near Field Communication) and M2M (Machine to Machine) communication are enabling pieces for many industries  – security monitoring, asset and inventory tracking, healthcare, process control, building environment monitoring, vehicle tracking and telemetry, in-store customer engagement and digital signage. Big data is the big deal here.

The Big Data ecosystem includes – cloud computing, M2M/IoT, dumb terminal 2.0 (devices getting dumber – more cloud, better broadband, less about storage and more about broadband access and high quality display), and analysis. The big data opportunity is slated to be a $200B business in 2015. Every company must insert the big data ecosystem into their future roadmap or get left out. The key here is not the technology, but its business value.

The progression goes like this: Big Data ->Big Info -> Big Knowledge -> Big Insight. For example Big Data says 60 (not much meaning) , then Big Info says “Steve is 60” adding context. Then Big Knowledge says “Steve can’t hear very well” followed by Big Insight like “maybe we give Steve a hearing aid”, an actionable item. So we go from Big Data to Big Insight that becomes very useful. Several industry examples can be given:

  • Retail iBeacon technology – Apple’s technology allows smartphones to be tracked geographically. This will provide vector info about shoppers and hence allows for predictive service experience in combination with smart mirrors.
  • Insurance companies – by collecting information on drivers behavior the premiums can be adjusted by individual.
  • Medical event tracking – big data has crucial role here providing relevant information by patient.
  • Asset tracking in oil fields can help reduce costs and increase efficiency.
  • Smart cities – like San Francisco parking system SFPARK, every sensor-based parking space  can be efficiently used. You can use your smartphone to find available parking quickly.
  • and many more..

Big Data is the heart of it all – efficiently ingest, store, process, and manage unstructured data and provide meaningful analysis. Using an oil industry analogy, in the next 3-5 years we will see Big Data as the crude oil and Analytics is the new refinery.