A conference in Bangalore

I was invited to speak at a conference called Solix Empower 2017 held in Bangalore, India on April 28th, 2017. It was an interesting experience. The conference focused on Big Data, Analytics, and Cloud. Over 800 people attended the one-day event with keynotes and parallel tracks on wide-ranging subjects.

I did three things. First, I was part of the inaugural keynote where I spoke on “Data as the new Oxygen” showing the emergence of data as a key platform for the future. I emphasized the new architecture of containers and micro-services on which are machine learning libraries and analytic tool kits to build modern big data applications.

Then I moderated two panels. The first was titled, ” The rise of real-time data architecture for streaming applications” and the second one was called, “Top data governance challenges and opportunities”. In the first panel, the members came from Hortonworks, Tech Mahindra, and ABOF (Aditya Birla Fashion). Each member described the criticality of real-time analytics where trends/anomalies are caught on the fly and action is taken immediately in a matter of seconds/minutes. I learnt that for online e-commerce players like ABOF, a key challenge is identifying customers most likely to refuse goods delivered at their door (many do not have credit cards, hence there is COD or cash on delivery). Such refusal causes major loss to the company. They do some trend analysis to identify specific customers who are likely to behave that way. By using real-time analytics, ABOF has been able to reduce such occurrences by about 4% with significant savings. The panel also discussed technologies for data ingestion, streaming, and building stateful apps. Some comments were made on combining Hadoop/EDW(OLAP) plus streaming(OLTP) into one solution like the Lambda architecture.

The second panel on data governance had members from Wipro, Finisar, Solix and Bharti AXA Insurance. These panelists agreed that data governance is no longer viewed as the “bureaucratic police and hence universally disliked” inside the company and it is taken seriously by the upper management. Hence policies for metadata management, data security, data retirement, and authorization are being put in place. Accuracy of data is a key challenge. While organizational structure for data governance (like a CDO, chief data officer) is still evolving, there remains many hard problems (specially for large companies with diverse groups).

It was interesting to have executives from Indian companies reflect on these issues that seem no different than what we discuss here. Big Data is everywhere and global.

Data Unification at scale

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands “predictive analytics” and “streaming analytics”.

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called “data lake” where you store disparate data in their native format, seems to address the “ingest” problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this – start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

The end of Cloud Computing?

A provocative title for sure when everyone thinks we just started the era of cloud computing. I recently listened to a talk by Peter Levine, general partner at Andreessen Horowitz on this topic which makes a ton of sense. The proliferation of intelligent devices and the rise of IoT (Internet of Things) lead us to a new world beyond what we see today in cloud computing (in terms of scale).

I have said many times that the onset of cloud computing was like back to the future of centralized computing. We had IBM mainframes, dominating the centralized computing era during the 1960s and 1970s. The introduction of PCs created the world of client-server computing (remember the wintel duopoly?) from 1980s till 2000. Then the popularity of the mobile devices started the cloud era in 2005, thus taking us back to centralized computing again. The text message I send you does not go from my device to your device directly, but gets to a server somewhere in the cloud first and then to your phone. The trillions of smart devices forecasted to appear as sensors in automobiles, home appliances, airplanes, drones, engines, and almost any thing you can imagine (like in your shoe) will drastically change the computing paradigm again. Each of these “edge intelligent devices” can not go back and forth to the cloud for every interaction. Rather they would want to process data at the edge to cut down latency. This brings us back to a new form of “distributed computing” model – kind of back to a vastly expanded version of the “PC era”.

Peter emphasized that the cloud will continue to exist, but its role will change from being the central hub to a “learning center” where curated data from the edge (only relevant data) resides in the cloud. The learning gets pushed back to the edge for getting better at its job. The edge of the cloud does three things – sense, infer, and act. The sense level handles massive amount of data like in a self-driving car (10GB per mile), thus making it like a “data center on wheels”. The sheer volume of data is too much to push back to the cloud. The infer piece is all machine learning and deep learning to detect patterns, improve accuracy and automation. Finally, the act phase is all about taking actions in real-time. Once again, the cloud plays the central role as a “learning center” and the custodian of important data for the enterprise.

Given the sheer volume of data created, peer-to-peer networks will be utilized to lessen load on core network and share data locally. The challenge is huge in terms of network management and security. Programming becomes more data-centric, meaning less code and more math. As the processing power of the edge devices increases, the cost will come down drastically. I like his last statement that the entire world becomes the domain of IT meaning we will have consumer-oriented applications with enterprise-scale manageability.

This is exciting and scary. But whoever could have imagined the internet in the 1980s or the smartphone during the 1990s, let alone self-driving cars?

IoT Analytics – A panel discussion

I was invited to participate in a panel called “IoT Analytics” last Thursday, March 23rd. This was organized for the IoT Global Council by Erick Schonfeld of Traction Technology Partner (New York). Besides me there were two other speakers: Brandon Cannaday, cofounder and chief product officer of Losant and Patrick Stuart, head of products at SkyCatch. For those of you not familiar with IoT, it stands for Internet of Things. There is another term called IIoT for Industrial Internet of Things. IoT has been in the lexicon for last few years signifying the era of “pervasive computing” where devices with an IP address can be everywhere – the freeze, microwave, thermostats, door knobs, cars, airplanes, electric motors, various sensors,…..constantly sending data. The phrases “connected home” or “connected car” are an upshot of the IoT phenomenon. However Gartner group showed IoT to be at the peak of the “hype cycle” couple of years back.

I emphasized on the “pieces of the puzzle” or the components of IoT Analytics – data ingestion at scale, handling streaming data pipeline, data curation and unification, and storing the results in a highly scalable NoSQL data store, as the steps before analytics can happen. Just dumping everything into a Hadoop data lake only addresses 5% of the problem (data ingestion). Transforming the data and curating it to make sense is a non-trivial step. Then I spoke about analytics which has several components – descriptive (what happened and why?), predictive (what is probably going to happen?), and prescriptive (what should I do about it?). Streaming analytics must filter, aggregate, enrich, and analyze high throughput of data from disparate sources to identify patterns, detect urgent situations (like a temperature spike in an engine), and automate immediate action in real time.

Patrick of SkyCatch showed how they are serving the construction industry in taking images (via drones) and accurately creating “earth maps” for self-driving bulldozers, thus saving human labor cost. Another example was taking images of actual progress in large construction sites and contrasting it against plan, to show offsets, thus detecting delays and taking corrective actions in time.

Brandon of Losant showed example of a large utility company in Australia that supplies high powered (expensive) pumps with sensors. By collecting data from the sensors and monitoring it centrally, they can identify problems and notify the maintenance teams for taking corrective actions. Previously they had to fly people around for maintenance and this new IoT Analytics has saved the company lots of cost. Both are startup companies in the IoT Analytics space and are tackling immediate issues in real time.

It was a good panel and I learnt a lot from my co-panelists.

Technology for a cashless society

I happened to be in India last November when prime minister Modi announced the demonetization program, where 86% of the currency in the form of two paper bills (Rs. 500 and 1000 denomination) were made defunct. People were given time to deposit their existing currencies in the bank. Those who had unusually high volume of such currencies were supposed to declare the legal source or face stiff penalties such as 60-75% tax. The goal was to catch the money hoarders and black marketers who avoid paying taxes on such undeclared money.

Four months later, I happened to visit India last February. Everyone suggested I download an app. called Paytm. I could transfer money from a bank account instantly. What was convenient with Paytm was that I could use it at gas stations, small stores, and even at roadside vendor shops. Everyone seems to have installed the Paytm station where you point the smartphone with a Paytm barcode and the transaction happens instantly. You can check your balance any time. I noticed people are paying for phone, utilities, and other conveniences without having to carry loads of cash. To incentivize more usage, discounts are doled out by many vendors.

Paytm, based in Delhi, has raised $738 million from investors inside and outside India (Alibaba, SAIF Partners, Goldman Sachs, Singapore’s Tomasek, Taiwan’s Mediatek, etc) at a valuation over $5B. Paytm wallet users exceed 200m. Clearly the demonetization has come as a boon in dramatically increasing it’s usage. They have also started their international operation by making the digital wallet available in Canada earlier this year.

Why the digital wallets have not taken off big in the US? We have Apple Pay and Google Wallet for a while, but their usage has not been spectacular. One of the reasons may be the wide use of credit and debit cards that consumers are used to. But in a developing country like India where credit/debit card usage is quite low, a digital wallet like Paytm scores big. The company expects to reach profitability next year and may be one of the new unicorns (>$10B) soon.

Secret of Sundar Pichai’s success

I watched Sundar Pichai’s recent interaction with the students at I.I.T. (Indian Institute of Technology) Kharagpur, India, where he graduated back in 1993. Besides our common country of birth, I had never heard of Sundar until his rapid rise at Google a few years back. I have never met him or listened to him at conferences. So this was the first time, I had a chance to listen to his remarks and his answers to many questions from the audience of 3500 students at his alma mater earlier this week.

Growing up not far from I.I.T. Kharagpur, I was very aware of this institution. It was the first I.I.T. in India established during the 1950s. Other I.I.T’s like at Kanpur, Delhi, Mumbai and Chennai came later. These were the original 5 Indian Institute of Technologies. Lately many new ones have been added.

Sundar did his undergraduate studies in Metallurgy (study about metals). Then how did he switch from that into software? That was one of the questions from a student. He said that he loved Fortran language during his student days and that love for programming continued. The message he was giving was for everyone to pursue their own interest & passion. He mentioned that unlike in India, students at US universities sometimes do not decide their majors, way into their 3rd or 4th year of studies. Sundar’s passion was to build products that would impact a very large number of global users. During his interview at Google, he was asked what he thought of Gmail, which he had never seen nor used. Then the fourth interviewer actually showed it to him. Subsequently, he gave his opinion to the remaining 3 interviewers on what he thought was wrong with Gmail and how to improve it. He emphasized time and again the need to step out of the comfort zone and get an all rounded experience. Today’s students need not be afraid to take some risks and be willing to fail.

Besides technical leadership, Sundar possesses an amazing quality; egoless-ness, so rare to find in Silicon Valley executive community. He said that he truly believes in empowering his team and letting them execute with full trust. This is easier said that done, based on my experience at IBM and Oracle. Large organizations suffer from ego-driven leadership causing great amount of friction and anguish. Sunder’s rise at Google was due to his amazing ability to get teams to work very effectively. From Search, he went to manage Chrome, then he was given Android. His ability to work thru the complexities of products, fiefdoms, and internal rivalries was so evident that he was elevated to the CEO position so quickly. Humility is his hallmark combined with clarity of vision and efficient execution.

He made an interesting comment about the vision at Google. Larry Page said that the moonshot projects are worthwhile because the bar is so high (no competition). Even if you fail, you are still ahead with your knowledge and experience.

It was fun listening to Sundar’s simple and honest answers & remarks.

The new Microsoft

Clearly Satya Nadella has made a huge difference at Microsoft since taking office in 2014. The stock in 2016 hit an all time high since 1999. So investors are happy. Here are the key changes he has made since taking the role as CEO:

  • Skipped Windows 9 and went straight from Windows 8 to Windows 10, a great release. However revenues from Window is declining with the reduction of PC sales.
  • Released Microsoft Office for iPad. Also releasing the Outlook product on iPhone & Android.
  • Embraced Linux by joining the Linux Foundation, previously anathema to Microsoft’s window-centric culture.
  • Spent $2.5B to buy Mojang, the studio behind hit game Minecraft.
  • Introduced Microsoft’s first laptop, The Surface Book.
  • Revealed Microsoft HoloLens, the super-futuristic holographic goggles.
  • Created the new partner program to provide Microsoft products on non-Windows platforms. Hired ex-Qualcomm exec Peggy Johnson to head the bus-dev group.
  • Enhanced company morale and employee excitement.
  • The biggest gamble was the purchase of Linked-In last June for a whopping $26.2B.

It’s important to understand the significance of the Linked-In purchase. Adam Rifkin (I worked with him twelve years back at KnowNow, a smart guy) recently wrote an article on this topic. I like his comment that in a world of machine learning, uniquely valuable data is the new network effect. The right kind of data is now the force multiplier that can catapult organizations past any competitors who lack equivalent data. So data is the new barrier to entry. Adam also makes a statement that the most valuable data is perishable and not static. Software is eating the world and AI is eating software meaning AI is eating data and popping out software.

Now let’s map what this means to the Linked-In purchase by Microsoft which sees the network effects of Linked-In’s data. What Google gets from search, Facebook gets from likes, and Amazon gets from shopping carts, Microsoft will get such insights from Linked-In’s data for its CRM services. Adam makes a point that the global CRM market in 2015 was worth $26.3B – almost exactly what Microsoft paid. It is the fastest growing area of enterprise software. Hence Marc Benioff of SalesForce was not very happy with this acquisition.

The new Microsoft is ready to fight the enterprise software battle with incumbents like SalesForce, Oracle, SAP and Workday.