As I have discussed earlier, one of the important disruption happening currently in the Hadoop ecosystem is to make Hadoop more real time. Cloudera has already developed Impala which provides user to make real time queries. Mapr with Apache Drill project provides a declarative language like SQL to query underlined Hadoop data in real time. Google is already has its proprietary analytics engine and provides service called BigQuery to query petabytes of data. Facebook is also on other hand putting lot of efforts to increase performance of Hive and its ecosystem.
There are other start-ups like Platfora and QuBole taking different route for solving this big puzzle still using Hadoop. According to Ben Werther founder of Platfora, business cannot wait days and weeks to get answers. It has to be instantaneous. The true innovation lies in the data agility and exploration made possible by Hadoop.
Hadoop is a great platform for solving the Bigdata. It has one of the big ecosystem in Bigdata. But the real issue with Hadoop is, it is not meant for all. Still the platform is restricted only for handful of researchers and calls for substantial amount of learning. This is the gap where most of the companies wants fill in many different ways.
Platfora got 20 million, Series B funding recently and has been working for more than a year to make Hadoop available to business users. There are other first generation Hadoop start-ups built technologies around Hadoop. But most of them are run queries in batch oriented. This is where Platfora and QuBole differ in their implementation. Platfora has built an analytics platform from ground up. It has Vizborads for analytics, scale out in-memory data processing engine and hadoop data refinery as key components. Platfora wants to change the way traditional BI works.
One more start-up called QuBole also heading on the same lines. They have released QuBole data platform as service which run on Amazon EC2 and Hadoop. QuBole founders Asish Thusoo and Joydeep Sen Sarma were from Facebook data infrastructure team where they have used and implemented Hive platform and they knew the bottlenecks of Hadoop and Hive. One of the main design goal of QuBole is to make Hadoop more real time and a simple platform for business user to run their queries.
To make queries run interactively, they use HQL and Hive but the entire platform is designed to run queries in real time. The platform scales seamlessly up and down. The users need not to provision the no of systems needed upfront. In typical scenario it is unlike that user knows how many machines are required for query. It is underlined platform’s job to allocate and free the machines in the cluster on demand based on the workload and past statistics.
It is good to see the NON Hadoop in memory databases like SAP HANA and recent announcement of Amazon RedShift also trying to build their technologies to solve big data more like traditional transactional database.
We need to see how much Hadoop can leap frog in solving this. As of now there is lot of innovation going on to make Hadoop more real time and more simple for many different use cases. Recently, Microsoft has also provided support for Hadoop on its Azure platform. This implementation is pure Hadoop Vanilla. Can we see Microsoft bringing something innovative to the Hadoop table or simply it follows the crowd ?