Category Archives: Big Data

Data Sharehouse?

This is yet another new term in our lexicon. The San Mateo, California-based startup Snowflake announced this week a new offering with this name, as a free add-on to the data warehouse it built for cloud computing. Now companies using Snowflake’s technology, officially called Snowflake Data Sharing, can share any part of their data warehouses, subject to defined security policies and controls on access, with each other.

Snowflake’s data sharehouse allows companies to provide direct access to structured and unstructured data without the need to copy the data to a new location. Current approaches include file-sharing, electronic data interchange, application programming interfaces and email, but all of them have issues ranging from lack of security to cumbersome methods of providing data access to the right people. Jon Bock, Snowflake’s marketing chief compared the difference in data sharing on Snowflake versus other methods to the difference between streaming music and compact discs. “It looks [to the data recipient] just as if the data resides on their own data warehouse,” he said.

The catch is that every participant must be a Snowflake customer using their data warehouse in the cloud. So this is another way to grow their market. We have seen this approach in the 1990s when Exchanges were introduced by the likes of Oracle for B2B data interchange. That did not go very far. Of course cost was a big factor, but the policy agreement on common formats and security for data exchange was another issue. Snowflake claims to solve this by having one source of truth in the cloud.

Of course companies, like manufacturers and suppliers, advertisers and publishers have been sharing data for quite a long time, but it has been cumbersome via technologies like EDI (electronic data interchange, developed in the 1940s), email, file sharing, APIs and more. That kind of sharing takes time and wasn’t created for the current situation, in which businesses need live data processed in real time to keep a competitive edge.

According to Bob Muglia, Snowflake’s CEO (ex-Microsoft), the data sharehouse changes the game and democratizes the possibilities, because anyone can access the service. Rather than being charged a subscription fee, users pay only according to the amount of data they have processed. Snowflake’s data sharing service is free to data providers, data consumers pay for the compute resources they use. Not only that, but data providers and consumers make their arrangements independent of Snowflake Computing which is the infrastructure provider.

In an increasingly collaborative world there is little doubt that sharing data easily, and in real time, without sacrificing security, privacy, governance and compliance is of great value. Whether it will create entirely new markets has yet to be seen, but actionable data-driven insights are likely to be huge differentiators in the digital economy.

It is a clever move, but time will tell if this will enable smooth data exchange or create more chaos.

A conference in Bangalore

I was invited to speak at a conference called Solix Empower 2017 held in Bangalore, India on April 28th, 2017. It was an interesting experience. The conference focused on Big Data, Analytics, and Cloud. Over 800 people attended the one-day event with keynotes and parallel tracks on wide-ranging subjects.

I did three things. First, I was part of the inaugural keynote where I spoke on “Data as the new Oxygen” showing the emergence of data as a key platform for the future. I emphasized the new architecture of containers and micro-services on which are machine learning libraries and analytic tool kits to build modern big data applications.

Then I moderated two panels. The first was titled, ” The rise of real-time data architecture for streaming applications” and the second one was called, “Top data governance challenges and opportunities”. In the first panel, the members came from Hortonworks, Tech Mahindra, and ABOF (Aditya Birla Fashion). Each member described the criticality of real-time analytics where trends/anomalies are caught on the fly and action is taken immediately in a matter of seconds/minutes. I learnt that for online e-commerce players like ABOF, a key challenge is identifying customers most likely to refuse goods delivered at their door (many do not have credit cards, hence there is COD or cash on delivery). Such refusal causes major loss to the company. They do some trend analysis to identify specific customers who are likely to behave that way. By using real-time analytics, ABOF has been able to reduce such occurrences by about 4% with significant savings. The panel also discussed technologies for data ingestion, streaming, and building stateful apps. Some comments were made on combining Hadoop/EDW(OLAP) plus streaming(OLTP) into one solution like the Lambda architecture.

The second panel on data governance had members from Wipro, Finisar, Solix and Bharti AXA Insurance. These panelists agreed that data governance is no longer viewed as the “bureaucratic police and hence universally disliked” inside the company and it is taken seriously by the upper management. Hence policies for metadata management, data security, data retirement, and authorization are being put in place. Accuracy of data is a key challenge. While organizational structure for data governance (like a CDO, chief data officer) is still evolving, there remains many hard problems (specially for large companies with diverse groups).

It was interesting to have executives from Indian companies reflect on these issues that seem no different than what we discuss here. Big Data is everywhere and global.

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

The end of Cloud Computing?

A provocative title for sure when everyone thinks we just started the era of cloud computing. I recently listened to a talk by Peter Levine, general partner at Andreessen Horowitz on this topic which makes a ton of sense. The proliferation of intelligent devices and the rise of IoT (Internet of Things) lead us to a new world beyond what we see today in cloud computing (in terms of scale).

I have said many times that the onset of cloud computing was like back to the future of centralized computing. We had IBM mainframes, dominating the centralized computing era during the 1960s and 1970s. The introduction of PCs created the world of client-server computing (remember the wintel duopoly?) from 1980s till 2000. Then the popularity of the mobile devices started the cloud era in 2005, thus taking us back to centralized computing again. The text message I send you does not go from my device to your device directly, but gets to a server somewhere in the cloud first and then to your phone. The trillions of smart devices forecasted to appear as sensors in automobiles, home appliances, airplanes, drones, engines, and almost any thing you can imagine (like in your shoe) will drastically change the computing paradigm again. Each of these “edge intelligent devices” can not go back and forth to the cloud for every interaction. Rather they would want to process data at the edge to cut down latency. This brings us back to a new form of “distributed computing” model – kind of back to a vastly expanded version of the “PC era”.

Peter emphasized that the cloud will continue to exist, but its role will change from being the central hub to a “learning center” where curated data from the edge (only relevant data) resides in the cloud. The learning gets pushed back to the edge for getting better at its job. The edge of the cloud does three things – sense, infer, and act. The sense level handles massive amount of data like in a self-driving car (10GB per mile), thus making it like a “data center on wheels”. The sheer volume of data is too much to push back to the cloud. The infer piece is all machine learning and deep learning to detect patterns, improve accuracy and automation. Finally, the act phase is all about taking actions in real-time. Once again, the cloud plays the central role as a “learning center” and the custodian of important data for the enterprise.

Given the sheer volume of data created, peer-to-peer networks will be utilized to lessen load on core network and share data locally. The challenge is huge in terms of network management and security. Programming becomes more data-centric, meaning less code and more math. As the processing power of the edge devices increases, the cost will come down drastically. I like his last statement that the entire world becomes the domain of IT meaning we will have consumer-oriented applications with enterprise-scale manageability.

This is exciting and scary. But whoever could have imagined the internet in the 1980s or the smartphone during the 1990s, let alone self-driving cars?

IoT Analytics – A panel discussion

I was invited to participate in a panel called “IoT Analytics” last Thursday, March 23rd. This was organized for the IoT Global Council by Erick Schonfeld of Traction Technology Partner (New York). Besides me there were two other speakers: Brandon Cannaday, cofounder and chief product officer of Losant and Patrick Stuart, head of products at SkyCatch. For those of you not familiar with IoT, it stands for Internet of Things. There is another term called IIoT for Industrial Internet of Things. IoT has been in the lexicon for last few years signifying the era of “pervasive computing” where devices with an IP address can be everywhere – the freeze, microwave, thermostats, door knobs, cars, airplanes, electric motors, various sensors,…..constantly sending data. The phrases “connected home” or “connected car” are an upshot of the IoT phenomenon. However Gartner group showed IoT to be at the peak of the “hype cycle” couple of years back.

I emphasized on the “pieces of the puzzle” or the components of IoT Analytics – data ingestion at scale, handling streaming data pipeline, data curation and unification, and storing the results in a highly scalable NoSQL data store, as the steps before analytics can happen. Just dumping everything into a Hadoop data lake only addresses 5% of the problem (data ingestion). Transforming the data and curating it to make sense is a non-trivial step. Then I spoke about analytics which has several components – descriptive (what happened and why?), predictive (what is probably going to happen?), and prescriptive (what should I do about it?). Streaming analytics must filter, aggregate, enrich, and analyze high throughput of data from disparate sources to identify patterns, detect urgent situations (like a temperature spike in an engine), and automate immediate action in real time.

Patrick of SkyCatch showed how they are serving the construction industry in taking images (via drones) and accurately creating “earth maps” for self-driving bulldozers, thus saving human labor cost. Another example was taking images of actual progress in large construction sites and contrasting it against plan, to show offsets, thus detecting delays and taking corrective actions in time.

Brandon of Losant showed example of a large utility company in Australia that supplies high powered (expensive) pumps with sensors. By collecting data from the sensors and monitoring it centrally, they can identify problems and notify the maintenance teams for taking corrective actions. Previously they had to fly people around for maintenance and this new IoT Analytics has saved the company lots of cost. Both are startup companies in the IoT Analytics space and are tackling immediate issues in real time.

It was a good panel and I learnt a lot from my co-panelists.

The resurgence of AI/ML/DL

We have been seeing a sudden rise in the deployment of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). It looks like the long “AI winter” is finally over.

  • According to IDC, AI-related hardware, software and services business will jump from $8B this year to $47B by 2020.
  • I have also read comments like, “AI is like the Internet in the mid 1990s and it will be pervasive this time”.
  • According to Andrew Ng, chief scientist at Baidu, “AI is the new electricity. Just as 100 years ago electricity transformed industry after industry, AI will now do the same.”
  • Peter Lee, co-head at Microsoft Research said,  “Sales teams are using neural nets to recommend which prospects to contact next or what kind of products to recommend.”
  • IBM Watson used AI in 2011, not DL. Now all 30 components are augmented by DL (investment from $500M – $6B in 2020).
  • Google had 2 DL projects in 2012, now it is more than 1000 (Search, Android, Gmail, Translation, Maps, YouTube, Self-driving cars,..).

It is interesting to note that AI was mentioned by Alan Turing in a paper he wrote back in 1950 to suggest that there is possibility to build machines with true intelligence. Then in 1956, John McCarthy organized a conference at Dartmouth and coined the phrase Artificial Intelligence. Much of the next three decades did not see much activity and hence the phrase “AI Winter” was coined. Around 1997, IBM’s Deep Blue won the chess match against Kasparov. During the last few years, we saw deployments such as Apple’s Siri, Microsoft’s Cortana, and IBM’s Watson (beating Jeopardy game show champions in 2011). In 2014, DeepMind team used a deep learning algorithm to create a program to win Atari games.

During last 2 years, use of this technology has accelerated greatly. The key players pushing AI/ML/DL are – Nvidia, Baidu, Google, IBM, Apple, Microsoft, Facebook, Twitter, Amazon, Yahoo, etc. Many new players have appeared – DeepMind, Numenta, Nervana, MetaMind, AlchemyAPI, Sentient, OpenAI, SkyMind, Cortica, etc. These companies are all targets of acquisition by the big ones. Sunder Pichai of Google says, “Machine learning is a core transformative way in which we are rethinking everything we are doing”. Google’s products deploying these technologies are – Visual Translation, RankBrain, Speech Recognition, Voicemail Transcription, Photo Search, Spam Filter, etc.

AI is the broadest term, applying to any technique that enables computers to mimic human intelligence, using logic, if-then rules, decision trees, and machine learning. The subset of AI that includes abstruse statistical techniques that enable machines to improve at tasks with experience is machine learning. A subset of machine learning called deep learning is composed of algorithms that permit software to train itself to perform tasks, like speech and image recognition, by exposing multi-layered neural networks to vast amounts of data.

I think the resurgence is a result of the confluence of several factors, like advanced chip technology such as Nvidia Pascal GPU architecture or IBM TrueNorth (brain-inspired computer chip), software architectures like microservice containers, ML libraries, and data analytics tool kits. Well known academia are heavily being recruited by companies – Geoffrey Hinton of University of Toronto (Google), Yann LeCun of New York University (Facebook), Andrew Ng of Stanford (Baidu), Yoshua Bengio of University of Montreal, etc.

The outlook of AI/ML/DL is very bright and we will see some real benefits in every business sector.

Data-driven enterprise

87bcf8ea-34c4-44f7-a9be-e6982c226924-originalI moderated a panel of 3 CIOs last Sunday at the Solix Empower conference on the subject of data-driven enterprise. The three CIO’s came from different industries. Marc Parmet of the TechPar group spent many years at Avery Dennison after stints at Apple and IBM. Sachin Mathur leads the IT innovations at Terex Corp., a large company supplying cranes and other heavy equipments. PK Agarwal, currently dean at Northeastern University, used to be the CIO for the Government of California. Here are some of the points covered:

  • I reminded the audience that we are at the fourth paradigm in science (as per the late Jim Gray). A thousand year ago, science was experimental, then few hundred years back science became theoretical (Newton’s law, Maxwell’s law..), fifty years ago, science became computational (simulation via a computer). Now the fourth paradigm is data-driven science where experiment, theory, and computation must be combined to one holistic discipline. Actually science hit the “big data” problem long before the commercial world.
  • Top level management is starting to understand that data is the oxygen, but they are yet to fully make their organizations data-driven. Just having a data warehouse with analytics and reporting does not make it data-driven, but they do see the value of predictive analytics and deep learning for competitive advantage.
  • While business-critical applications continue to run on-premise, newer, less critical apps such as collaboration and email (e.g. Lotus Notes) are moving to the public cloud. One said that they are evaluating migrating current Oracle ERP to a cloud version. Data security and reliability are critical needs. One panelist talked about not just private, public or hybrid cloud, but “scattered” cloud which will be highly distributed.
  • Out of the 3V’s of big data (volume, variety, and velocity), variety seems to be of higher need – images, pictures, videos combined with sensors deployed in manufacturing and factory automation. For industries such as retail and telcos, volume dominates. The velocity part will become more and more critical as streaming of these data in real-time will need fast ingestion and analysis-on-the-fly for timely decision making. This is the emerging world of IoT where devices with an IP address will be everywhere – individuals, connected homes, autonomous cars, connected factories. They will produce huge amounts of data volume. Cluster computing with Hadoop/Spark will be the most economical technology to deal with this load. Much work lies ahead.
  • There will be serious shortage of “big data” or “data science” skills, of the order of 4-5 million in next few years. Hence universities such as Northeastern is setting up new curriculum on data science. Today’s data scientist must have knowledge of the business, algorithms, comp. science, statistical modeling plus he/she must be good story teller. Unlike the past, it’s not just answering questions, but figuring out what questions to ask. Such skills will be at a premium as enterprises become more data-driven.

We discussed many other points. It was a fun panel.