Category Archives: Database

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

Oracle’s push into cloud solutions

I watched Larry Ellison’s keynotes at this week’s Oracle Open world conference in San Francisco. They are definitely serious in pushing their cloud offerings, even though they came in late. But Oracle claimed that they have been working on it for almost ten years. The big push is at all 3 levels – SaaS, PaaS, and IaaS. The infrastructure as a service claims faster and cheaper resources (computing, storage, and networking) to beat Amazon’s AWS. They make a good point on better security for the enterprises, given the risk of security breaches happening at greater frequency lately. One comment I have is that AWS is beyond just IaaS, they are into PaaS as well (e.g. Docker services, etc. for devops). Oracle’s big advantage is in offering SaaS for all their application suits – ERP, HCM and CRM (they call it CX as customer experience). This is not something AWS offers for the enterprise market, although apps like SalesForce and Workday are available. Microsoft has Dynamics as an ERP on their cloud.

I do agree that Oracle has an upper hand when it comes to database as a service. Larry showed performance numbers for AWS Redshift, Aurora, and DynamoDB compared to Oracle’s database (much faster). They do have a chance to beat AWS when it comes to serious enterprise-scale implementations, given their strong hold in that market. Most of these enterprises still run much of their systems on-premise. Oracle offers them an alternative to switch to the cloud version within their firewall. They also suggest the co-existence of both on-prem and cloud solutions. The total switch-over to cloud will take ten years or more, as the confidence and comfort level grows over time.

AWS has a ten year lead here and they have grown in scale and size. The current run rate for AWS is over $10B in revenue with hefty profit (over 50%). However, many clients complain about the high cost as you use more services of AWS. Microsoft Azure and Google’s cloud services are marching fast to catch up. Most of the new-age web-companies use AWS. Oracle is better off focusing on the enterprise market, their strong hold. Not to discount IBM here, who is pushing their Soft Layer cloud solutions to the enterprise customers. Mark Hurd of Oracle showed several examples of cloud deployment at large to medium size companies as well. One interesting presence at the Open World yesterday was the chief minister (like a state Governor) of the Indian state, Maharashtra (Mumbai being the big city there). He signed a deal with Oracle to help implement cloud solutions to make many cities into “smart” cities and also connecting 29000 villages digitally. This is a big win for Oracle and will set the stage for many other government outfits to follow suit.

I think more competition to AWS is welcome, as no one wants a single-vendor lock-in. Mark Hurd said that by 2020, cloud solutions will dominate the enterprise landscape. The analysts are skeptical on Oracle’s claim over AWS, but a focused Oracle on cloud is not to be taken lightly.

Jnan Dash

Stack Fallacy? What is it?

Back in January, Tech Crunch published an article on this subject called Stack Fallacy, written by Anshu Sharma of Storm Ventures. Then today I read this Business Insider article on the reason why Dropbox is failing and it is the Stack Fallacy.  Sharma describes Stack Fallacy as “the mistaken belief that it is trivial to build the layer above yours.”

Many companies trivialize the task of building layers above their core competency layer and that leads to failure. Oracle is a good example, where they thought it was no big deal to build applications (watching the success of SAP in the ERP layer initially built on the Oracle database). I remember a meeting with Hasso Plattner, founder of SAP back in the early 1990s when I was at Oracle. He said SAP was one of the biggest customers of Oracle at that time and now Oracle competes with them. For lack of any good answer, we said that we are friends in the morning and foes in the afternoon and welcomed him to the world of  “co-opetition”. Subsequently SAP started moving out of Oracle DB and was enticed by IBM to use DB2. Finally SAP built its own database (they bought Sybase and built the in-memory database Hana). Oracle’s applications initially were disasters as they were hard to use and did not quite meet the needs of customers. Finally they had to win the space by acquiring Peoplesoft and Siebel.

Today’s Business Insider article says, “…a lot of companies often overvalue their level of knowledge in their core business stack, and underestimate what it takes to build the technology that sits one stack above them.  For example, IBM saw Microsoft take over the more profitable software space that sits on top of its PCs. Oracle likes to think of Salesforce as an app that just sits on top of its database, but hasn’t been able to overtake the cloud-software space they compete in. Google, despite all the search data it owns, hasn’t been successful in the social-network space, failing to move up the stack in the consumer-web world. Ironically, the opposite is true when you move down the stack. Google has built a solid cloud-computing business, which is a stack below its search technology, and Apple’s now building its own iPhone chips, one of the many lower stacks below its smartphone device”.

With reference to Dropbox, the article says that it underestimated what it takes to build apps a layer above (Mailbox, Carousel), and failed to understand its customers’ needs — while it was investing in the unimportant areas, like the migration away from AWS. Dropbox is at a phase where it needs to think more about the users’ needs and competing with the likes of Google and Box, rather than spending on “optimizing for costs or minor technical advantages”.

Not sure, I agree with that assessment. Providing efficient and cost-effective cloud storage is Dropbox’s core competency and they are staying pretty close to that. The move away from AWS is clearly aimed at cost savings, as AWS can be a huge burden on operational cost, plus it has its limitations on effective scaling. In some ways, Dropbox is expanding its lower layers for future hosting. It’s focus on enterprise-scale cloud storage is the right approach, as opposed to Box or Google where the focus is on consumers.

But the Stack Fallacy applies more to Apple doing its own iPhone chips, or Dell wrongfully going after big data. At Oracle the dictum used to be, “everything is a database problem – if you have a hammer, then everything looks like a nail”.

RocksDB from Facebook

I attended a HIVE-sponsored Meetup yesterday evening titled, “Rocking the database world with RocksDB”. Since I had never heard of RocksDB, I was curious to learn how it is rocking the database world.

Facebook built this key value store storage layer originally to use for MySQL (instead of InnoDB), as MySQL is used heavily at Facebook. They claim that was not the only motivation. Then in 2013, they decided to open source RocksDB. Last evening’s speaker in an earlier post on November, 2013 had said, “Storing and accessing hundreds of petabytes of data is a huge challenge, and we’re constantly improving and overhauling our tools to make this as fast and efficient as possible. Today, we are open-sourcing RocksDB, an embeddable, persistent key-value store for fast storage that we built and use here at Facebook.”

RocksDB is also ideal for SSD (Flash store) and claims fast performance. The team was excited when MongoDB opened up to other storage engines back in 2014 summer. For a period of time, MongoDB plus RocksDB was a fast combination. Then MongoDB decided to acquire WiredTiger ( a competitor) in December, 2014 to contribute to the performance, scalability, and hardware efficiency of MongoDB. That left RocksDB out of the official engagement with MongoDB. But they built something called MongoRocks that claims to be very fast. It seems several MongoDB users prefer MongoRocks over the native combo of MongoDB with WiredTiger.

Several users of RocksDB talked about their experience, specially in the IoT world where sensor data can be processed at the edge (ingestion, aggregation, and some transformation) before being sent to the cloud servers. The only issue I saw is the fact that there is no “real” owner of RocksDB as a deliverable solution. There is no equivalent of a Cloudera (For Hadoop) or Confluent (for Kafka) who can provide value-additions and support for the user base. It’s all open source download and do-your-own stuff till now. So serious production-level deployment is still a risky affair. For now, it’s a developer’s play tool.

In Memoriam – Ed Lassettre

I was out of the country when my old colleague from IBM days, Ed Lassettre passed away last November. I only found out earlier this month about his demise from a mutual friend from IBM Almaden Research. Ed was one of the best computer software professionals I knew and respected.

He was at IBM’s Santa Teresa Lab (now called Silicon Valley Lab) when I started there back in 1981 after my five-year stint at IBM Canada. That year he got promoted to a Senior Technical Staff member (STSM), the very first at the lab to get that honor. Subsequently he became an IBM Fellow, the highest technical honor. His reputation of being one of the key software engineers for IBM’s MVS operating system preceded him. Ed had spent a few years at IBM’s Poughkeepsie Lab in upstate New York. He did his undergraduate  and post-graduate studies at Ohio State University in Math. He had deep insights into the intricacies of high performance computing systems. When we were building DB2 at the IBM lab, Ed was providing guidance on its interface with the operating system.

Subsequently I went to IBM’s Austin Lab for two years in the mid-1980s to lay the foundation of the DB2 product for the PC (which at the time lacked the processing power and memory of the large mainframes). Hence our design had to accommodate to those limitations. The IBM executives wanted someone to audit our design before giving the green signal for development. I could not think of a better person than Ed Lassettre to do that. At my request Ed spent some time and gave a very positive report on our design. He had great credibility in the technical community. Many times, I sought his views on technical matters and he provided timely advice. His wisdom was complemented by a tremendous humility, a rare feature in our industry.

I had left IBM back in 1992 for Oracle and lost touch with Ed. Later on I found that he had retired from IBM and joined Microsoft Research. He was a good friend of the late Jim Gray, also at Microsoft Research at the time. Ed retired from Microsoft in 2013 at the age of 79! He was quite well-known in the HPTC (High Performance Technical Computing) world.

RIP, Ed Lassettre, a great computer scientist and friend! You will be missed.

Strategic Technologies as per Gartner

I have known Gartner for decades during my IBM and Oracle days. Even though I have observed how they invent new terms to stuff we already know (a bit annoying, but I guess that’s their business), they do a decent job in capturing key strategic trends.

In a recent article, I saw ten strategic technology trends and this is how they are grouped: the first 3 address merging the physical and the virtual worlds and the emergence of the digital mesh (their new phrase); The next 3 trends cover the algorithmic business, where much happens in the background in which people are not directly involved; the final 4 trends address the new architecture and platform trends needed to support the digital and algorithmic business.

The first 3 trends:

  • The Device Mesh – In the postmobile world the focus shifts to the mobile user who is surrounded by a mesh of devices, each with an IP address always communicating.
  • Ambient User Experience – Seamless flow of experience across a shifting set of devices. Think of shifting from IoT, to automobiles, smartphones, etc.
  • 3D Printing Materials – This will necessitate the assembly line and supply chain processes to exploit 3D printing.

The next 3 trends:

  • Information of Everything – This information goes beyond textual, audio and video and includes sensory and contextual stuff.How do you bring meaning to a chaotic deluge of information? Much work is needed here.
  • Advanced Machine Learning – Deep Neural Networks (DNNs) go beyond classic computing and information management to create systems that can autonomously learn to perceive the world on their own. DNNs (an advanced form of machine learning applicable to large complex datasets) will make smart machines “intelligent”.
  • Autonomous Agents & Things – Like robots, autonomous vehicles, virtual personal assistants and smart advisors.

The final 4 trends:

  • Adaptive Security Architecture – how to combat the hacker industry beyond the perimeter defense and rule-based security?
  • Advanced Systems Architecture – this is what Gartner said, “Fueled by field-programmable gate arrays (FPGAs) as an underlying technology for neuromorphic architectures, there are significant gains such as being able to run at speeds of greater than a teraflop with high-energy efficiency”.
  • Mesh App and Service Architecture – Monolithic, linear application designs like the 3-tier architecture are giving way to loosely coupled integrative approach. Containers(e.g. Docker) are emerging as a critical technology for enabling agile development and microservice architectures. What is needed is a back-end cloud scalability and front-end device mesh experience.
  • Internet of Things Platforms – The management, security, integration plus standards are needed for the IoT platform to succeed.

These are all known areas, but I liked the way Gartner grouped them in a logical sequence.

 

Cassandra Summit 2015

I attended my first Cassandra Summit 2015 this week at the Santa Clara convention center. I was quite surprised to see more than 6000 people attend this event, much bigger than last year (2000 attendees) with 130 sessions. It was a proof to the growing popularity of Cassandra’s NoSQL database platform. CTO and cofounder  Jonathan Ellis (formerly from Rackspace) described new release 2.2 and 3.0 functions.

With its major addition of JSON support in 2.2, Cassandra basically eliminated the difference with MongoDB. They have their own query language called CQL, an SQL-like construct. Now with JSON support, the developer community will see some big advantages. They have functions like collections, udf (user defined types), and deeper nesting. Release 3.0 will see a brand new storage engine, a vast improvement to their previous key-value store engine with better space efficiency. Release 3.0 will include materialized views. Bragging about their fast performance and efficient distributed database functionality, Jonathan joked about MongoDB as the “snapchat for databases” (a reference to occasional data loss because of weak consistency). He emphasized three key elements: availability (onstage they dramatized the shutting down of many nodes in two data centers with Cassandra still running), scale, and performance (both read and write).

However, when it comes to streaming analytics, one Cassandra user explained how he combined Spark, Kafka, and Cassandra to achieve the same – a non-trivial programming feat. Cassandra CEO Billy Bosworth emphasized that they solve the transaction workload problem (always-on) and not geared for analytics. I understood that two key customers are Apple running their iTune application on Cassandra and Netflix. Majority of use cases were web-centric applications where speed and scale are key requirements. Several case studies indicated customers replacing MySQl with Cassandra.

In a panel discussion on monetizing open source software, the comment was made that Cassandra is foundational and customers are happy to pay for the fast performance and scale using their enterprise edition. In that sense, they are different than the RedHat model.

It was an interesting experience to see a new-generation database product gaining wider acceptance.