Category Archives: Database

Big Data & Analytics – what’s ahead?

Recently I read somewhere this statement – As we end 2017 and look ahead to 2018, topics that are top of mind for data professionals are the growing range of data management mandates, including the EU’s new General Data Protection Regulation that is directed at personal data and privacy, the growing role of artificial intelligence (AI) and machine learning in enterprise applications, the need for better security in light of the onslaught of hacking cases, and the ability to leverage the expanding Internet of Things.

Here are the key areas as we look ahead:

  • Business owners demand outcomes – not just a data lake to store all kinds of data in its native format and API’s.
  • Data Science must produce results – Play and Explore is not enough. Learn to ask the right questions. Visualization of analytics from search.
  • Everyone wants Real Time – Days and weeks too slow, need immediate actionable outcomes. Analytics & recommendations based on real time data.
  • Everyone wants AI (artificial intelligence) – Tell me what I don’t know.
  • Systems must be secure – no longer a mere platitude.
  • ML (machine learning) and IoT at massive scale – Thousands of ML models. Need model accuracy.
  • Blockchain – need to understand its full potential to business – since it’s not transformational, but a foundational technology shift.

In the area of big data, a combination of new and long-established technologies are being put to work. Hadoop and Spark are expanding their roles within organizations. NoSQL and NewSQL databases bring their own unique attributes to the enterprise, while in-memory capabilities (such as Redis) are increasingly being utilized to deliver insights to decision makers faster. And through it all, tried-and-true relational databases continue to support many of the most critical enterprise data environments.

Cloud becomes the de-facto deployment choice for both users and developers. Serverless technology with FaaS (Function as a Service) is getting rapid adoption amongst developers. According to IDC, enterprises are undergoing IT transformation as they rethink their business operations, including how they use information and what technology to deploy. In line with that transformation, nearly 80% of large organizations already have a hybrid cloud strategy in place. The modern application architecture, sometimes referred to as SMAC (social, mobile, analytics, cloud) is becoming standard everywhere.

The DBaaS (database as a service) is still not as widespread as other cloud services. Microsoft is arguably making the strongest explicit claim for a converged database system with its Azure Cosmo DB as DBaaS. Cosmo DB claims to support four data models – key-value, column-family, document, and graph. However, databases have been slower to migrate to the cloud than other elements of computing infrastructure mainly for security and performance reasons. But DBaaS adoption is poised to accelerate. Some of these cloud based DBaaS systems – Cosmo DB, Spanner from Google, and AWS DynamoDB – now offer significant advantages over their on-premise counterparts.

One thing for sure, big data and analytics will continue to be vibrant and exciting in 2018.

Advertisements

AWS re:Invent 2017

In a few decades when the history of computing will be written, a major section will be devoted to cloud computing. The headline of the first section would read something like this – How did a dot-com era book-selling company became the father of cloud computing? While the giants like IBM, HP, and Microsoft were sleeping, Amazon started a new business eleven years ago in 2006 called AWS (Amazon Web Services). I still remember the afternoon when I had spent couple of hours with the CTO of Amazon (not Werner Vogel, his predecessor, a dutch gentleman) back in 2004 discussing the importance of SOA (Service Oriented Architecture). When I asked why was he interested, he mentioned how CEO Jeff Bezos has given a marching order to monetize the under-utilized infrastructure in their data centers. Thus AWS arrived in 2006 with S3 for storage and EC2 for computing.

Advance the clock by 11 years. At this week’s AWS Re-Invent event in Las Vegas it was amazing to listen to Andy Jassy, CEO of AWS who gave a 2.5 hour keynote on how far AWS has come. There were 43,000 people attending this event (in its 6th year) and another 60,000 were tuned in via the web. AWS has a revenue run rate of $18B with a 42% Year-to-Year growth. It’s profit is over 60% thus contributing significantly to Amazon’s bottom line. It has hundreds of thousands of customers starting from majority web startups to Fortune 500 enterprise players in all verticals. It has the strongest partner ecosystem. Garter group said AWS has a market share of 44.1% (39% last year), larger than all others combined. Customers like Goldman Sachs, Expedia, and National Football League were on stage showing how they fully switched to AWS for all their development and production.

Andy covered four major areas – computing, database, analytics, and machine learning with many new announcement of services. AWS already offers over 100 services. Here is a brief overview.

  • Computing – 3 major areas: Instances of EC2 including new GPU processor for AI, Containers (services such as Elastic Container Services and new ones like EKS – Elastic Kubernetes Services), and Serverless (Function as a Service with its Lambda services). The last one, Serverless is gaining fast traction in just last 12 months.
  • Database – AWS is starting to give real challenge to incumbents like Oracle, IBM and Microsoft. It has three offerings – AWS Aurora RDBMS for transaction processing, DynamoDB and Redshift. Andy announced Aurora Multi-Master for replicated read and writes across data centers and zones. He claims it is the first RDBMS with scale-out across multiple data centers and is lot cheaper than Oracle’s RAC solution. They also announced Aurora Serverless for on-demand, auto-scaling app dev. For No-SQL, AWS has DynamoDB (key-value store). They also have Amazon Elastic Cache for in-memory DB. Andy announced Dynamo DB Global Tables as a fully-managed, multi-master, multi-region DB for customers with global users (such as Expedia). Another new service called Amazon Neptune was announced for highly connected data (fully managed Graph database). They also have Redshift for data warehousing and analytics.
  • Analytics – AWS provides Data Lake service on S3 which enables API access to any data in its native form. They have many services like Athena, Glue, Kinesis to access the data lake. Two new services were announced – S3 Select (a new API to select and retrieve S3 data from within an object), Glacier Select (access less frequently used data in the archives).
  • Machine Learning – Amazon claims it has been using machine learning for 20 years in its e-commerce business to understand user’s preferences. A new service was announced called Amazon Sagemaker which brings storage, data movement, management of hosted notebook, and ML algorithms like 10 top commonly used ones (eg. Time Series Forecasting). It also accommodates other popular libraries like Tensorflow, Apache MxNet, and Caffe2. Once you pick an algorithm, training is much easier with Sagemaker. Then with one-click, the deployment happens. Their chief AI fellow Dr. Matt Wood demonstrated on stage how this is all done. They also announced AWS DeepLens, a video camera for developers with a computer vision model. This can detect facial recognition and image recognition for apps. New services announced besides the above two are – Amazon Kinesis Video streams (video ingestion), Amazon Transcribe (automatic speech recognition), Amazon Translate (between languages), and Amazon Comprehend (fully managed NLP – Natural Language Processing).

It was a very impressive and powerful presentation and shows how deeply committed and dedicated the AWS team is. Microsoft Azure cloud, Google’s computing cloud, IBM’s cloud and Oracle’s cloud all seem way behind in terms of AWS’s breadth and depth. It will be to customer’s benefit to have couple of AWS alternatives as we march along the cloud computing highway. Who wants a single-vendor lock-in?

Blockchain 101

There is a lot of noise on Blockchain these days. Back in May, 2015 The Economist wrote a whole special on Bockchain and it said, “The “blockchain” technology that underpins bitcoin, a sort of peer-to-peer system of running a currency, is presented as a piece of innovation on a par with the introduction of limited liability for corporations, or private property rights, or the internet itself”. It all started after the 2008 financial crisis, when a seminal paper written by Satoshi Nakamoto on Halloween day (Oct 31, 2008) caught the attention of many (the real identity of the author is still unknown). The name of the paper was “Bitcoin: A peer to peer electronic cash system”. Thus began a cash-less, bank-less world of money exchange over the internet using blockchain technology. Bitcoin’s value has exceeded $6000 and market cap is over $100B. VC’s are rushing to invest in cryptocurrency like never before.

The September 1, 2017 issue of Fortune magazine’s cover page screamed “Blockhain Mania”. The article said, “A blockchain is a kind of ledger, a table that businesses use to track credits and debits. But it’s not just any run-of-the-mill financial database. One of blockchain’s distinguishing features is that it concatenates (or “chains”) cryptographically verified transactions into sequences of lists (or “blocks”). The system uses complex mathematical functions to arrive at a definitive record of who owns what, when. Properly applied, a blockchain can help assure data integrity, maintain auditable records, and contracts into programmable software. It’s a ledger, but on the bleeding edge”.

So welcome to the new phase of network computing where we switch from “transfer of information” to “transfer of values”. Just as TCP/IP became the fundamental protocol for communication and helped create today’s internet with the first killer app Email (SMTP), blockchain will enable exchange of assets (the first app being Bitcoin for money). So get used to new terms like cryptocurrency, DLS (distributed ledger stack), nonce, ethereum, smart contracts, pseudo anonymity, etc. The “information internet” becomes the “value internet”. — Patrick Byrne, CEO of Overstock said, “Over the next decade, what the internet did to communications, blockchain is going to do to about 150 industries”. — In a recent article in Harvard Business Review, authors Joi Ito, Neha Narula, and Robleh Ali said, “The blockchain will do to the financial system what the internet did to media”.

The key elements of blockchain are the following:

  • Distributed Database – each party on a blockchain has access to entire DB and its complete history. No single party controls the data or the info. Each party can verify records without an intermediary.
  • Peer-to-Peer Transmission (P2P) – communication directly between peers instead of thru a central node.
  • Transparency with Pseudonymity – each transaction and associated value are visible to anyone with access to system. Each node/user has a unique 30-plus-character alphanumeric address. Users can choose to be anonymous or provide proof of identity. Transactions occur between blockchain addresses.
  • Irreversibility of records – once a transaction is entered in the DB, they can not be altered, because they are linked to every xaction record before them (hence the term ‘chain’).
  • Computational Logic – blockchain transactions can be tied to computational logic and in essence programmed.

The heart of the system is a distributed database that is write-once, read-many with a copy replicated at each node. It is transaction processing in a highly distributed network with guaranteed data integrity, security, and trust. Blockchain also provides automated, secure coordination system with remuneration and tracking. Even if it started with “money transfer” via Bitcoin, the underpinnings can be applied to any assets. The need for a central coordinating agency such as bank becomes unnecessary. Assets such as mortgages, bonds, stocks, loans, home titles, auto registries, birth and death certificates, passport, visa, etc. can all be exchanged without intermediaries. The Feb, 2017 HBR article said, “Blockchain is a foundational technology (not disruptive). It has the potential to create new foundations for our economic & social systems.”

We did not get into the depth of the technology here, but plenty of literature is available for you to read. Major vendors such as IBM, Microsoft, Oracle, HPE are offering blockchain as an infrastructure service for enterprise asset management.

Splice Machine – What is it?

Those of you who have never heard of Splice Machine, don’t worry. You are in the company of many. So I decided to listen to a webinar last week that said the following in its announcement: learn about benefits of a modern IoT application platform that can capture, process, store, analyze and act on the large streams of data generated by IoT devices. The demonstration will include:

  • High Performance Data Ingestion
  • Analytics and Transformation on Data-In-Motion
  • Relational DBMS, Supporting Hybrid OLTP and OLAP Processing
  • In-Memory and Non-Volatile, Row-based and Columnar Storage mechanisms
  • Machine Learning to support decision making and problem resolution

That was a tall order. Gartner has a new term HTAP – Hybrid Transactional and Analytical Processing. Forrester uses “Translytical” to describe this platform where you could do both OLTP and OLAP. I had written a blog on Translytical database almost two years back. So I did attend the webinar and it was quite impressive. The only confusion was the liberal use of IoT in its marketing slogan. By that they want to emphasize “streaming data” (ingest, store, manage).

In Splice Machine’s website, you see four things: Hybrid RDBMS, ANSI SQL, ACID Transactions, and Real-Time Analytics. A white paper advertisement says, “Your IoT applications deserve a better data platform”. In looking at the advisory board members, I recognized 3 names – Roger Bamford, ex-Oracle and an investor, Ken Rudin, ex-Oracle, and Marie-Anne Niemet, ex-TimeTen. The company is funded by Mohr Davidow Ventures, and Interwest Partners amongst others.

There is a need for bringing together the worlds of OLTP (Transaction workloads) and Analytics or OLAP workloads into a common platform. They have been separated for decades and that’s how the Data Warehouse, MDM, OLAP cubes, etc. got started. The movement of data between the OLTP world and OLAP has been handled by ETL vendors such as Informatica. With the popularity of Hadoop, the DW/Analytics world is crowded with terms like Data Lake, ELT (first load, then transform), Data Curation, Data Unification, etc. A new architecture called Lambda (not to be confused with AWS Lambda for serverless computing) claims to unify the two worlds – OLTP and real-time streaming and analytics.

Into this world, comes Splice Machine with its scale-out data platform. You can do your standard ACID-compliant OLTP processing, data ingestion via Spark streaming and Kafka topics, query processing via ANSI SQL, and get your analytical workload without ETL. They even claim support of procedural language like PL/SQL for Oracle data. With their support of machine learning, they demonstrated predictive analytics. The current focus is on verticals like Healthcare, Telco, Retail, and Finance (Wells fargo), etc.

In the cacophony of Big Data and IoT noise, it is hard to separate facts from fiction. But I do see a role for a “unified” approach like Splice Machine. Again, the proof is always in the pudding – some real-life customer deployment scenarios with performance numbers will prove the hypothesis and their claim of 10x faster speed with one-fourth the cost.

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

Oracle’s push into cloud solutions

I watched Larry Ellison’s keynotes at this week’s Oracle Open world conference in San Francisco. They are definitely serious in pushing their cloud offerings, even though they came in late. But Oracle claimed that they have been working on it for almost ten years. The big push is at all 3 levels – SaaS, PaaS, and IaaS. The infrastructure as a service claims faster and cheaper resources (computing, storage, and networking) to beat Amazon’s AWS. They make a good point on better security for the enterprises, given the risk of security breaches happening at greater frequency lately. One comment I have is that AWS is beyond just IaaS, they are into PaaS as well (e.g. Docker services, etc. for devops). Oracle’s big advantage is in offering SaaS for all their application suits – ERP, HCM and CRM (they call it CX as customer experience). This is not something AWS offers for the enterprise market, although apps like SalesForce and Workday are available. Microsoft has Dynamics as an ERP on their cloud.

I do agree that Oracle has an upper hand when it comes to database as a service. Larry showed performance numbers for AWS Redshift, Aurora, and DynamoDB compared to Oracle’s database (much faster). They do have a chance to beat AWS when it comes to serious enterprise-scale implementations, given their strong hold in that market. Most of these enterprises still run much of their systems on-premise. Oracle offers them an alternative to switch to the cloud version within their firewall. They also suggest the co-existence of both on-prem and cloud solutions. The total switch-over to cloud will take ten years or more, as the confidence and comfort level grows over time.

AWS has a ten year lead here and they have grown in scale and size. The current run rate for AWS is over $10B in revenue with hefty profit (over 50%). However, many clients complain about the high cost as you use more services of AWS. Microsoft Azure and Google’s cloud services are marching fast to catch up. Most of the new-age web-companies use AWS. Oracle is better off focusing on the enterprise market, their strong hold. Not to discount IBM here, who is pushing their Soft Layer cloud solutions to the enterprise customers. Mark Hurd of Oracle showed several examples of cloud deployment at large to medium size companies as well. One interesting presence at the Open World yesterday was the chief minister (like a state Governor) of the Indian state, Maharashtra (Mumbai being the big city there). He signed a deal with Oracle to help implement cloud solutions to make many cities into “smart” cities and also connecting 29000 villages digitally. This is a big win for Oracle and will set the stage for many other government outfits to follow suit.

I think more competition to AWS is welcome, as no one wants a single-vendor lock-in. Mark Hurd said that by 2020, cloud solutions will dominate the enterprise landscape. The analysts are skeptical on Oracle’s claim over AWS, but a focused Oracle on cloud is not to be taken lightly.

Jnan Dash

Stack Fallacy? What is it?

Back in January, Tech Crunch published an article on this subject called Stack Fallacy, written by Anshu Sharma of Storm Ventures. Then today I read this Business Insider article on the reason why Dropbox is failing and it is the Stack Fallacy.  Sharma describes Stack Fallacy as “the mistaken belief that it is trivial to build the layer above yours.”

Many companies trivialize the task of building layers above their core competency layer and that leads to failure. Oracle is a good example, where they thought it was no big deal to build applications (watching the success of SAP in the ERP layer initially built on the Oracle database). I remember a meeting with Hasso Plattner, founder of SAP back in the early 1990s when I was at Oracle. He said SAP was one of the biggest customers of Oracle at that time and now Oracle competes with them. For lack of any good answer, we said that we are friends in the morning and foes in the afternoon and welcomed him to the world of  “co-opetition”. Subsequently SAP started moving out of Oracle DB and was enticed by IBM to use DB2. Finally SAP built its own database (they bought Sybase and built the in-memory database Hana). Oracle’s applications initially were disasters as they were hard to use and did not quite meet the needs of customers. Finally they had to win the space by acquiring Peoplesoft and Siebel.

Today’s Business Insider article says, “…a lot of companies often overvalue their level of knowledge in their core business stack, and underestimate what it takes to build the technology that sits one stack above them.  For example, IBM saw Microsoft take over the more profitable software space that sits on top of its PCs. Oracle likes to think of Salesforce as an app that just sits on top of its database, but hasn’t been able to overtake the cloud-software space they compete in. Google, despite all the search data it owns, hasn’t been successful in the social-network space, failing to move up the stack in the consumer-web world. Ironically, the opposite is true when you move down the stack. Google has built a solid cloud-computing business, which is a stack below its search technology, and Apple’s now building its own iPhone chips, one of the many lower stacks below its smartphone device”.

With reference to Dropbox, the article says that it underestimated what it takes to build apps a layer above (Mailbox, Carousel), and failed to understand its customers’ needs — while it was investing in the unimportant areas, like the migration away from AWS. Dropbox is at a phase where it needs to think more about the users’ needs and competing with the likes of Google and Box, rather than spending on “optimizing for costs or minor technical advantages”.

Not sure, I agree with that assessment. Providing efficient and cost-effective cloud storage is Dropbox’s core competency and they are staying pretty close to that. The move away from AWS is clearly aimed at cost savings, as AWS can be a huge burden on operational cost, plus it has its limitations on effective scaling. In some ways, Dropbox is expanding its lower layers for future hosting. It’s focus on enterprise-scale cloud storage is the right approach, as opposed to Box or Google where the focus is on consumers.

But the Stack Fallacy applies more to Apple doing its own iPhone chips, or Dell wrongfully going after big data. At Oracle the dictum used to be, “everything is a database problem – if you have a hammer, then everything looks like a nail”.