Category Archives: Database

Splice Machine – What is it?

Those of you who have never heard of Splice Machine, don’t worry. You are in the company of many. So I decided to listen to a webinar last week that said the following in its announcement: learn about benefits of a modern IoT application platform that can capture, process, store, analyze and act on the large streams of data generated by IoT devices. The demonstration will include:

  • High Performance Data Ingestion
  • Analytics and Transformation on Data-In-Motion
  • Relational DBMS, Supporting Hybrid OLTP and OLAP Processing
  • In-Memory and Non-Volatile, Row-based and Columnar Storage mechanisms
  • Machine Learning to support decision making and problem resolution

That was a tall order. Gartner has a new term HTAP – Hybrid Transactional and Analytical Processing. Forrester uses “Translytical” to describe this platform where you could do both OLTP and OLAP. I had written a blog on Translytical database almost two years back. So I did attend the webinar and it was quite impressive. The only confusion was the liberal use of IoT in its marketing slogan. By that they want to emphasize “streaming data” (ingest, store, manage).

In Splice Machine’s website, you see four things: Hybrid RDBMS, ANSI SQL, ACID Transactions, and Real-Time Analytics. A white paper advertisement says, “Your IoT applications deserve a better data platform”. In looking at the advisory board members, I recognized 3 names – Roger Bamford, ex-Oracle and an investor, Ken Rudin, ex-Oracle, and Marie-Anne Niemet, ex-TimeTen. The company is funded by Mohr Davidow Ventures, and Interwest Partners amongst others.

There is a need for bringing together the worlds of OLTP (Transaction workloads) and Analytics or OLAP workloads into a common platform. They have been separated for decades and that’s how the Data Warehouse, MDM, OLAP cubes, etc. got started. The movement of data between the OLTP world and OLAP has been handled by ETL vendors such as Informatica. With the popularity of Hadoop, the DW/Analytics world is crowded with terms like Data Lake, ELT (first load, then transform), Data Curation, Data Unification, etc. A new architecture called Lambda (not to be confused with AWS Lambda for serverless computing) claims to unify the two worlds – OLTP and real-time streaming and analytics.

Into this world, comes Splice Machine with its scale-out data platform. You can do your standard ACID-compliant OLTP processing, data ingestion via Spark streaming and Kafka topics, query processing via ANSI SQL, and get your analytical workload without ETL. They even claim support of procedural language like PL/SQL for Oracle data. With their support of machine learning, they demonstrated predictive analytics. The current focus is on verticals like Healthcare, Telco, Retail, and Finance (Wells fargo), etc.

In the cacophony of Big Data and IoT noise, it is hard to separate facts from fiction. But I do see a role for a “unified” approach like Splice Machine. Again, the proof is always in the pudding – some real-life customer deployment scenarios with performance numbers will prove the hypothesis and their claim of 10x faster speed with one-fourth the cost.

Advertisements

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

Oracle’s push into cloud solutions

I watched Larry Ellison’s keynotes at this week’s Oracle Open world conference in San Francisco. They are definitely serious in pushing their cloud offerings, even though they came in late. But Oracle claimed that they have been working on it for almost ten years. The big push is at all 3 levels – SaaS, PaaS, and IaaS. The infrastructure as a service claims faster and cheaper resources (computing, storage, and networking) to beat Amazon’s AWS. They make a good point on better security for the enterprises, given the risk of security breaches happening at greater frequency lately. One comment I have is that AWS is beyond just IaaS, they are into PaaS as well (e.g. Docker services, etc. for devops). Oracle’s big advantage is in offering SaaS for all their application suits – ERP, HCM and CRM (they call it CX as customer experience). This is not something AWS offers for the enterprise market, although apps like SalesForce and Workday are available. Microsoft has Dynamics as an ERP on their cloud.

I do agree that Oracle has an upper hand when it comes to database as a service. Larry showed performance numbers for AWS Redshift, Aurora, and DynamoDB compared to Oracle’s database (much faster). They do have a chance to beat AWS when it comes to serious enterprise-scale implementations, given their strong hold in that market. Most of these enterprises still run much of their systems on-premise. Oracle offers them an alternative to switch to the cloud version within their firewall. They also suggest the co-existence of both on-prem and cloud solutions. The total switch-over to cloud will take ten years or more, as the confidence and comfort level grows over time.

AWS has a ten year lead here and they have grown in scale and size. The current run rate for AWS is over $10B in revenue with hefty profit (over 50%). However, many clients complain about the high cost as you use more services of AWS. Microsoft Azure and Google’s cloud services are marching fast to catch up. Most of the new-age web-companies use AWS. Oracle is better off focusing on the enterprise market, their strong hold. Not to discount IBM here, who is pushing their Soft Layer cloud solutions to the enterprise customers. Mark Hurd of Oracle showed several examples of cloud deployment at large to medium size companies as well. One interesting presence at the Open World yesterday was the chief minister (like a state Governor) of the Indian state, Maharashtra (Mumbai being the big city there). He signed a deal with Oracle to help implement cloud solutions to make many cities into “smart” cities and also connecting 29000 villages digitally. This is a big win for Oracle and will set the stage for many other government outfits to follow suit.

I think more competition to AWS is welcome, as no one wants a single-vendor lock-in. Mark Hurd said that by 2020, cloud solutions will dominate the enterprise landscape. The analysts are skeptical on Oracle’s claim over AWS, but a focused Oracle on cloud is not to be taken lightly.

Jnan Dash

Stack Fallacy? What is it?

Back in January, Tech Crunch published an article on this subject called Stack Fallacy, written by Anshu Sharma of Storm Ventures. Then today I read this Business Insider article on the reason why Dropbox is failing and it is the Stack Fallacy.  Sharma describes Stack Fallacy as “the mistaken belief that it is trivial to build the layer above yours.”

Many companies trivialize the task of building layers above their core competency layer and that leads to failure. Oracle is a good example, where they thought it was no big deal to build applications (watching the success of SAP in the ERP layer initially built on the Oracle database). I remember a meeting with Hasso Plattner, founder of SAP back in the early 1990s when I was at Oracle. He said SAP was one of the biggest customers of Oracle at that time and now Oracle competes with them. For lack of any good answer, we said that we are friends in the morning and foes in the afternoon and welcomed him to the world of  “co-opetition”. Subsequently SAP started moving out of Oracle DB and was enticed by IBM to use DB2. Finally SAP built its own database (they bought Sybase and built the in-memory database Hana). Oracle’s applications initially were disasters as they were hard to use and did not quite meet the needs of customers. Finally they had to win the space by acquiring Peoplesoft and Siebel.

Today’s Business Insider article says, “…a lot of companies often overvalue their level of knowledge in their core business stack, and underestimate what it takes to build the technology that sits one stack above them.  For example, IBM saw Microsoft take over the more profitable software space that sits on top of its PCs. Oracle likes to think of Salesforce as an app that just sits on top of its database, but hasn’t been able to overtake the cloud-software space they compete in. Google, despite all the search data it owns, hasn’t been successful in the social-network space, failing to move up the stack in the consumer-web world. Ironically, the opposite is true when you move down the stack. Google has built a solid cloud-computing business, which is a stack below its search technology, and Apple’s now building its own iPhone chips, one of the many lower stacks below its smartphone device”.

With reference to Dropbox, the article says that it underestimated what it takes to build apps a layer above (Mailbox, Carousel), and failed to understand its customers’ needs — while it was investing in the unimportant areas, like the migration away from AWS. Dropbox is at a phase where it needs to think more about the users’ needs and competing with the likes of Google and Box, rather than spending on “optimizing for costs or minor technical advantages”.

Not sure, I agree with that assessment. Providing efficient and cost-effective cloud storage is Dropbox’s core competency and they are staying pretty close to that. The move away from AWS is clearly aimed at cost savings, as AWS can be a huge burden on operational cost, plus it has its limitations on effective scaling. In some ways, Dropbox is expanding its lower layers for future hosting. It’s focus on enterprise-scale cloud storage is the right approach, as opposed to Box or Google where the focus is on consumers.

But the Stack Fallacy applies more to Apple doing its own iPhone chips, or Dell wrongfully going after big data. At Oracle the dictum used to be, “everything is a database problem – if you have a hammer, then everything looks like a nail”.

RocksDB from Facebook

I attended a HIVE-sponsored Meetup yesterday evening titled, “Rocking the database world with RocksDB”. Since I had never heard of RocksDB, I was curious to learn how it is rocking the database world.

Facebook built this key value store storage layer originally to use for MySQL (instead of InnoDB), as MySQL is used heavily at Facebook. They claim that was not the only motivation. Then in 2013, they decided to open source RocksDB. Last evening’s speaker in an earlier post on November, 2013 had said, “Storing and accessing hundreds of petabytes of data is a huge challenge, and we’re constantly improving and overhauling our tools to make this as fast and efficient as possible. Today, we are open-sourcing RocksDB, an embeddable, persistent key-value store for fast storage that we built and use here at Facebook.”

RocksDB is also ideal for SSD (Flash store) and claims fast performance. The team was excited when MongoDB opened up to other storage engines back in 2014 summer. For a period of time, MongoDB plus RocksDB was a fast combination. Then MongoDB decided to acquire WiredTiger ( a competitor) in December, 2014 to contribute to the performance, scalability, and hardware efficiency of MongoDB. That left RocksDB out of the official engagement with MongoDB. But they built something called MongoRocks that claims to be very fast. It seems several MongoDB users prefer MongoRocks over the native combo of MongoDB with WiredTiger.

Several users of RocksDB talked about their experience, specially in the IoT world where sensor data can be processed at the edge (ingestion, aggregation, and some transformation) before being sent to the cloud servers. The only issue I saw is the fact that there is no “real” owner of RocksDB as a deliverable solution. There is no equivalent of a Cloudera (For Hadoop) or Confluent (for Kafka) who can provide value-additions and support for the user base. It’s all open source download and do-your-own stuff till now. So serious production-level deployment is still a risky affair. For now, it’s a developer’s play tool.

In Memoriam – Ed Lassettre

I was out of the country when my old colleague from IBM days, Ed Lassettre passed away last November. I only found out earlier this month about his demise from a mutual friend from IBM Almaden Research. Ed was one of the best computer software professionals I knew and respected.

He was at IBM’s Santa Teresa Lab (now called Silicon Valley Lab) when I started there back in 1981 after my five-year stint at IBM Canada. That year he got promoted to a Senior Technical Staff member (STSM), the very first at the lab to get that honor. Subsequently he became an IBM Fellow, the highest technical honor. His reputation of being one of the key software engineers for IBM’s MVS operating system preceded him. Ed had spent a few years at IBM’s Poughkeepsie Lab in upstate New York. He did his undergraduate  and post-graduate studies at Ohio State University in Math. He had deep insights into the intricacies of high performance computing systems. When we were building DB2 at the IBM lab, Ed was providing guidance on its interface with the operating system.

Subsequently I went to IBM’s Austin Lab for two years in the mid-1980s to lay the foundation of the DB2 product for the PC (which at the time lacked the processing power and memory of the large mainframes). Hence our design had to accommodate to those limitations. The IBM executives wanted someone to audit our design before giving the green signal for development. I could not think of a better person than Ed Lassettre to do that. At my request Ed spent some time and gave a very positive report on our design. He had great credibility in the technical community. Many times, I sought his views on technical matters and he provided timely advice. His wisdom was complemented by a tremendous humility, a rare feature in our industry.

I had left IBM back in 1992 for Oracle and lost touch with Ed. Later on I found that he had retired from IBM and joined Microsoft Research. He was a good friend of the late Jim Gray, also at Microsoft Research at the time. Ed retired from Microsoft in 2013 at the age of 79! He was quite well-known in the HPTC (High Performance Technical Computing) world.

RIP, Ed Lassettre, a great computer scientist and friend! You will be missed.

Strategic Technologies as per Gartner

I have known Gartner for decades during my IBM and Oracle days. Even though I have observed how they invent new terms to stuff we already know (a bit annoying, but I guess that’s their business), they do a decent job in capturing key strategic trends.

In a recent article, I saw ten strategic technology trends and this is how they are grouped: the first 3 address merging the physical and the virtual worlds and the emergence of the digital mesh (their new phrase); The next 3 trends cover the algorithmic business, where much happens in the background in which people are not directly involved; the final 4 trends address the new architecture and platform trends needed to support the digital and algorithmic business.

The first 3 trends:

  • The Device Mesh – In the postmobile world the focus shifts to the mobile user who is surrounded by a mesh of devices, each with an IP address always communicating.
  • Ambient User Experience – Seamless flow of experience across a shifting set of devices. Think of shifting from IoT, to automobiles, smartphones, etc.
  • 3D Printing Materials – This will necessitate the assembly line and supply chain processes to exploit 3D printing.

The next 3 trends:

  • Information of Everything – This information goes beyond textual, audio and video and includes sensory and contextual stuff.How do you bring meaning to a chaotic deluge of information? Much work is needed here.
  • Advanced Machine Learning – Deep Neural Networks (DNNs) go beyond classic computing and information management to create systems that can autonomously learn to perceive the world on their own. DNNs (an advanced form of machine learning applicable to large complex datasets) will make smart machines “intelligent”.
  • Autonomous Agents & Things – Like robots, autonomous vehicles, virtual personal assistants and smart advisors.

The final 4 trends:

  • Adaptive Security Architecture – how to combat the hacker industry beyond the perimeter defense and rule-based security?
  • Advanced Systems Architecture – this is what Gartner said, “Fueled by field-programmable gate arrays (FPGAs) as an underlying technology for neuromorphic architectures, there are significant gains such as being able to run at speeds of greater than a teraflop with high-energy efficiency”.
  • Mesh App and Service Architecture – Monolithic, linear application designs like the 3-tier architecture are giving way to loosely coupled integrative approach. Containers(e.g. Docker) are emerging as a critical technology for enabling agile development and microservice architectures. What is needed is a back-end cloud scalability and front-end device mesh experience.
  • Internet of Things Platforms – The management, security, integration plus standards are needed for the IoT platform to succeed.

These are all known areas, but I liked the way Gartner grouped them in a logical sequence.