Category Archives: Software

Apache Drill + Arrow = Dremio

A new company just emerged from stealth mode yesterday, called Dremio, backed by Redpoint and Lightspeed in a Series A funding of $10m back in 2015. The founders came from MapR, but were active in Apache projects like Drill and Arrow. The same VC’s backed MapR and had the Dremio founders work out of their facilities during the stealth phase. Now the company has around 50 people in their Mountainview, California office.

Apache Drill acts as a single SQL engine that, in turn, can query and join data from among several other systems. Drill can certainly make use of an in-memory columnar data standard. But while Dremio was still in stealth, it wasn’t immediately obvious what Drill’s strong intersection with Arrow might be. But yesterday the company launched a namesake product that also acts as a single SQL engine that can query and join data from among several other systems, and it accelerates those queries using Apache Arrow. So it is a combo of (Drill + Arrow): schema-free SQL for variety of data sources plus a columnar in-memory analytics execution engine.

Dremio believes that BI today involves too many layers. Source systems, via ETL processes, feed into data warehouses, which may then feed into OLAP cubes. BI tools themselves may add another layer, building their own in-memory models in order to accelerate query performance. Dremio thinks that’s a huge mess and disintermediates things by providing a direct bridge between BI tools and the source system they’re querying. The BI tools connect to Dremio as if it were a primary data source, and query it via SQL. Dremio then delegates the query work to the true back-end systems through push-down queries that it issues. Dremio can connect to relational databases (DB2, Oracle, SQL Server, MySQL, PostgreSQL), NoSQL stores (MongoDB, Amazon Redshift, HBase, MapR-FS), Hadoop, cloud blob stores like S3, and ElasticSearch.

Here’s how it works: all data pulled from the back-end data sources is represented in memory using Arrow. Combined with vectorized (in-CPU parallel processing) querying, this design can yield up to a 5x performance improvement over conventional systems (company claims). But a perhaps even more important optimization is Dremio’s use of what it calls “Reflections,” which are materialized data structures that optimize Dremio’s row and aggregation operations. Reflections are sorted, partitioned, and indexed, stored as files on Parquet disk, and handled in-memory as Arrow-formatted columnar data. This sounds similar to ROLAP aggregation tables).

Andrew Brust from ZDNet said, “While Dremio’s approach to this is novel, and may break a performance barrier that heretofore has not been well-addressed, the company is nonetheless entering a very crowded space. The product will need to work on a fairly plug-and-play basis and live up to its performance promises, not to mention build a real community and ecosystem. These are areas where Apache Drill has had only limited success. Dremio will have to have a bigger hammer, not just an Arrow”.

A conference in Bangalore

I was invited to speak at a conference called Solix Empower 2017 held in Bangalore, India on April 28th, 2017. It was an interesting experience. The conference focused on Big Data, Analytics, and Cloud. Over 800 people attended the one-day event with keynotes and parallel tracks on wide-ranging subjects.

I did three things. First, I was part of the inaugural keynote where I spoke on “Data as the new Oxygen” showing the emergence of data as a key platform for the future. I emphasized the new architecture of containers and micro-services on which are machine learning libraries and analytic tool kits to build modern big data applications.

Then I moderated two panels. The first was titled, ” The rise of real-time data architecture for streaming applications” and the second one was called, “Top data governance challenges and opportunities”. In the first panel, the members came from Hortonworks, Tech Mahindra, and ABOF (Aditya Birla Fashion). Each member described the criticality of real-time analytics where trends/anomalies are caught on the fly and action is taken immediately in a matter of seconds/minutes. I learnt that for online e-commerce players like ABOF, a key challenge is identifying customers most likely to refuse goods delivered at their door (many do not have credit cards, hence there is COD or cash on delivery). Such refusal causes major loss to the company. They do some trend analysis to identify specific customers who are likely to behave that way. By using real-time analytics, ABOF has been able to reduce such occurrences by about 4% with significant savings. The panel also discussed technologies for data ingestion, streaming, and building stateful apps. Some comments were made on combining Hadoop/EDW(OLAP) plus streaming(OLTP) into one solution like the Lambda architecture.

The second panel on data governance had members from Wipro, Finisar, Solix and Bharti AXA Insurance. These panelists agreed that data governance is no longer viewed as the “bureaucratic police and hence universally disliked” inside the company and it is taken seriously by the upper management. Hence policies for metadata management, data security, data retirement, and authorization are being put in place. Accuracy of data is a key challenge. While organizational structure for data governance (like a CDO, chief data officer) is still evolving, there remains many hard problems (specially for large companies with diverse groups).

It was interesting to have executives from Indian companies reflect on these issues that seem no different than what we discuss here. Big Data is everywhere and global.

The end of Cloud Computing?

A provocative title for sure when everyone thinks we just started the era of cloud computing. I recently listened to a talk by Peter Levine, general partner at Andreessen Horowitz on this topic which makes a ton of sense. The proliferation of intelligent devices and the rise of IoT (Internet of Things) lead us to a new world beyond what we see today in cloud computing (in terms of scale).

I have said many times that the onset of cloud computing was like back to the future of centralized computing. We had IBM mainframes, dominating the centralized computing era during the 1960s and 1970s. The introduction of PCs created the world of client-server computing (remember the wintel duopoly?) from 1980s till 2000. Then the popularity of the mobile devices started the cloud era in 2005, thus taking us back to centralized computing again. The text message I send you does not go from my device to your device directly, but gets to a server somewhere in the cloud first and then to your phone. The trillions of smart devices forecasted to appear as sensors in automobiles, home appliances, airplanes, drones, engines, and almost any thing you can imagine (like in your shoe) will drastically change the computing paradigm again. Each of these “edge intelligent devices” can not go back and forth to the cloud for every interaction. Rather they would want to process data at the edge to cut down latency. This brings us back to a new form of “distributed computing” model – kind of back to a vastly expanded version of the “PC era”.

Peter emphasized that the cloud will continue to exist, but its role will change from being the central hub to a “learning center” where curated data from the edge (only relevant data) resides in the cloud. The learning gets pushed back to the edge for getting better at its job. The edge of the cloud does three things – sense, infer, and act. The sense level handles massive amount of data like in a self-driving car (10GB per mile), thus making it like a “data center on wheels”. The sheer volume of data is too much to push back to the cloud. The infer piece is all machine learning and deep learning to detect patterns, improve accuracy and automation. Finally, the act phase is all about taking actions in real-time. Once again, the cloud plays the central role as a “learning center” and the custodian of important data for the enterprise.

Given the sheer volume of data created, peer-to-peer networks will be utilized to lessen load on core network and share data locally. The challenge is huge in terms of network management and security. Programming becomes more data-centric, meaning less code and more math. As the processing power of the edge devices increases, the cost will come down drastically. I like his last statement that the entire world becomes the domain of IT meaning we will have consumer-oriented applications with enterprise-scale manageability.

This is exciting and scary. But whoever could have imagined the internet in the 1980s or the smartphone during the 1990s, let alone self-driving cars?

The new Microsoft

Clearly Satya Nadella has made a huge difference at Microsoft since taking office in 2014. The stock in 2016 hit an all time high since 1999. So investors are happy. Here are the key changes he has made since taking the role as CEO:

  • Skipped Windows 9 and went straight from Windows 8 to Windows 10, a great release. However revenues from Window is declining with the reduction of PC sales.
  • Released Microsoft Office for iPad. Also releasing the Outlook product on iPhone & Android.
  • Embraced Linux by joining the Linux Foundation, previously anathema to Microsoft’s window-centric culture.
  • Spent $2.5B to buy Mojang, the studio behind hit game Minecraft.
  • Introduced Microsoft’s first laptop, The Surface Book.
  • Revealed Microsoft HoloLens, the super-futuristic holographic goggles.
  • Created the new partner program to provide Microsoft products on non-Windows platforms. Hired ex-Qualcomm exec Peggy Johnson to head the bus-dev group.
  • Enhanced company morale and employee excitement.
  • The biggest gamble was the purchase of Linked-In last June for a whopping $26.2B.

It’s important to understand the significance of the Linked-In purchase. Adam Rifkin (I worked with him twelve years back at KnowNow, a smart guy) recently wrote an article on this topic. I like his comment that in a world of machine learning, uniquely valuable data is the new network effect. The right kind of data is now the force multiplier that can catapult organizations past any competitors who lack equivalent data. So data is the new barrier to entry. Adam also makes a statement that the most valuable data is perishable and not static. Software is eating the world and AI is eating software meaning AI is eating data and popping out software.

Now let’s map what this means to the Linked-In purchase by Microsoft which sees the network effects of Linked-In’s data. What Google gets from search, Facebook gets from likes, and Amazon gets from shopping carts, Microsoft will get such insights from Linked-In’s data for its CRM services. Adam makes a point that the global CRM market in 2015 was worth $26.3B – almost exactly what Microsoft paid. It is the fastest growing area of enterprise software. Hence Marc Benioff of SalesForce was not very happy with this acquisition.

The new Microsoft is ready to fight the enterprise software battle with incumbents like SalesForce, Oracle, SAP and Workday.

The top five most-valued companies are Tech. – almost

On this first day of August 2016, I saw that the top most-valued companies are tech. companies, and the fifth one is almost there. Here is the list.

  1. Apple ($appl): $566 billion
  2. Alphabet ($goog): $562B
  3. Microsoft ($msft): $433B
  4. Amazon ($amzn): $365B
  5. Exxon Mobile ($xom): $356B
  6. Facebook ($fb): $353B

The big move is Amazon’s beating Exxon Mobile (used to be number 1 for many years) to the fourth spot. The switch came after Amazon posted its fifth straight quarter of profits last week as the oil giant’s profits tumbled 59 percent during the same rough period. If Exxon continues its drop, then Facebook will beat it in days.

This is quite remarkable! Other than Microsoft and Apple, the other 3 companies are much younger, Facebook being the youngest one. Their rapid rise is due to the growth of the Internet with its associated areas of search, e-commerce, and social networking. Interestingly Amazon survived the dot-com bust of the early 2000-2001 time unlike Yahoo, AOL, etc. Contrast this to the $4.8B valuation of Yahoo’s core business acquired by Verizon last week! Also, the fastest growing and most profitable of Amazon’s 3 businesses (Books, any commercial items, and AWS) is the cloud infrastructure piece called AWS (Amazon Web Services) with a run-rate of $10B this year. This is way ahead of Microsoft’s Azure cloud or Google’s cloud solutions. 

The importance of cloud is obvious as Oracle just paid $9.3B last week to acquire Netsuite, a company that was funded by Larry Ellison. With a 40% ownership of Netsuite, he gets a hefty $3.5B from this deal. Paradoxically, Amazon lead the way to cloud computing – not IBM, not HP, not EMC/VMWare, and not Microsoft or Google. So no wonder, Amazon is reaping the benefits!

Yahoo going to Verizon is so unexciting!

So finally it was Verizon paying $4.8B to acquire Yahoo’s core business. Business Insider said, “Yahoo, which was founded in 1994, was one of the world’s leading internet businesses but has gone through tough times in the past several years. Yahoo’s peak value was $125 billion in 2000, and even in 2008, Microsoft wanted to pay $45 billion for the company, so a $4.8 billion sale price pales in comparison.

This deal is also more or less the logical extension of Verizon’s $4 billion deal last year to acquire AOL, which is still run by Tim Armstrong, whom Yahoo CEO Marissa Mayer worked with at Google back in the day. Yahoo and AOL, after all, are fairly similar old-school content-and-advertising internet businesses. Here is the reaction from a competitor Sprint – CEO Marcelo Claure said Monday that Verizon’s purchase of Yahoo is just the latest in a long history of deals by telecom firms trying to get into the content business, none of which have panned out.

Although this deal sounds like a sad end to Yahoo, an icon of the early Internet players, Marissa Mayer tried to paint it as a success. Why not? She will walk out with almost $50m if fired from her job. It is a big let-down for her, specially after the high expectations when she was hired in 2012. She was supposed to turn this company around with big revenue growth. None of that happened. Rather she spent a ton of money for very little return. Take the case of Tumblr, which was mostly a waste (after paying $1.1B). As they say, you ruin the company and then walk out with a huge amount of money. Sad but true.

As of today, the business that will stay behind post-acquisition by Verizon includes Yahoo’s cash, its shares in Alibaba and Yahoo Japan, Yahoo’s convertible notes, certain minority investments, and Yahoo’s non-core patents (called the Excalibur portfolio). These remaining businesses will be rebranded after the completion of the acquisition in early 2017.

It will be interesting to see how Verizon brings some synergy across its 3 similar, but overlapping offerings  – AOL, Yahoo and its own go90.

Musings on a June morning!

Here in Silicon Valley, every day brings some new tech. news that gets your attention. A sample of current news:

  • Uber starts helicopter service in Sao Paolo, Brazil. That city of 20 million people gets horrific gridlocks that can stretch for hours. Hence people are taking advantage of a quick helicopter ride, say to the airport to catch a flight on time. Apparently there are plenty of helicopters sitting idle and are taking advantage of such a service. Uber is really disrupting the transportation industry.
  • Elon Musk wants to combine two of his companies – Tesla and Solar City. Actually Tesla is acquiring Solar City for $3 Billion. Musk argues that the combo would be good, as Tesla is looking for future solar-powered batteries for its cars.
  • There are still questions on the Microsoft-LinkedIn deal from last week – a $26.2 Billion price tag. Big mergers in the past decades have not shown great results. Remember Compaq+DEC ($9.6B in 1998), HP+Compaq ($19B in 2002), HP+EDS ($13.9B in 2008), Oracle+Peoplesoft ($10.3B in 2004), AOL+Time Warner ($181B in 2000), and Symantec+Veritas ($13.5B in 2005)? Then there are the big write-offs such as Microsoft+Nokia or HP+Autonomy. Only a few were winners. The rest resulted in depressed share prices, corporate confusion, and layoffs. So we will have to see how both Dell+EMC ($67B last fall) and Microsoft+Linkedin perform in years to come.
  • Nikesh Aurora quits Softbank after two years, because he is not going to be the CEO as expected. Masayoshi Son, the founder/CEO said he is going to stay for another 10 years as CEO. The truth seems to be board members questioning Aurora’s investments and some conflict of interest as he advises Silver Lake, a competitor. He invested heavily in Indian startups like Ola (Uber competitor) and Snapdeal.
  • Video messaging is becoming a hot technology future item. Snapchat, Facebook, and Twitter are all jumping into that field. But the leader seems to be a Berlin company called Dubsmash which has gained over 150 million users over last couple of years. They seem to lead in the content and delivery game and now have a new release emphasizing the video messaging platform.
  • Digital Advertising is gaining ground big time by the media companies. The shift to video ads in platforms like Youtube and Facebook is growing fast. I am  advising a company called Strike Social which is leading in the Trueview ad campaign buying and management business. I do see how fast it is growing. Using technology as a differentiator to provide cost optimization is the key here.