Thursday, November 15, 2012

Project Apache Drill and Impala wants to SQLize Hadoop for real time data access

There are lot of efforts going on for making real time data access using Hadoop ecosystem. It is evident that Hadoop is getting synonymous with the defacto BigData Architecture with in enterprises. The ecosystem is on sprawl and growing very rapidly as it provides a fundamental opportunity to solve petabytes of un-structured data for many different companies and communities. Organization from space, weather, genetic, carbon foot print, retail, financial using Hadoop for solving their Bigdata.

Hadoop is used primarily as a batch processing engine to crunch petabytes of data but not meant for real time processing. This triggered companies to think differently and allow Hadoop to access the data more like relational databases with real time queries.

There are two recent initiatives from MapR and Cloudera for making hadoop real time using SQL syntax more on the lines of Hive and HQL. This is not an entirely new concept with other analytics database vendors such as GreenPlum and Aster data. These vendors provide a SQL like interface for MapReducing large scale data analytics. However their design principles are different.

Apache Drill project is inspired by Google Dremel. Google Dremel is a scalable interactive query system used by thousands of Google engineers every day for querying their large scale data sets. This takes the advantage of Google GFS and Big Table and built on top of this. Google’s BigQuery is based on Dremel and exposed as a service. There are other open source projects developed for real time access like Storm and S4. The real difference between Dremel and Storm or S4 is later are streaming engines they are not meant for ad-hoc queries while Dremel architected for querying very large data sets in real time.
Apache Drill trying to achieve the same success of Dremel in Google in the Hadoop ecosystem. The design goal of Drill is to scale as many as 10,000 servers and querying petabytes of data with trillion records within seconds interactively. The project is backed by MapR which is one of the most visible vendors in Hadoop World.

Apache Drill architecture is designed to interact and scale well with existing Hadoop ecosystems and takes advantages of existing technologies rather than completely re inventing and being a different product.  It has four main components

    Nested query language : It is a purpose built nested query language, parses the query and builds an execution plan. The query language is called DrQL in Drill and  is more like SQL and HQL declarative language. It also supports Mongo Query Language as add on.

    Distributed execution engine : It takes care of physical plan and the under lined columnar storage and fail overs. Drill uses columnar storage like Dryad and Dremel.

    Nested data formats : This layer is built more like pluggable model so that it can work with multiple data formats. Drill can work with free form and schema based data formats. Schema less JSON and BSON data types and with schema protocol buffer, AVRO , JSON CSV.

    Scalable data sources : This layer supports various data sources. It is designed to support Hadoop and NoSQL in mind.

The other vendor wants to take advantage in this space is Cloudera. Cloudera is also well known brand in Hadoop’s ecosystem. Very recently, Cloudera announced a project called Impala based on Google Dremel architecture same as Drill. It is already in beta stage and Cloudera is promising to drop a production release by first quarter of 2013.

Unlike Hive, Impala directly access the data thru its purpose built query engine to provide more real time access. The Impala queries are not converted to MapReduce during runtime like Hive.

Impala allows users to query data both on HDFS and HBase and has inbuilt support for joins and aggregation functions. The query syntax would be very similar to SQL and HQL as it uses the same metadata supported by Hive. Like project Drill, impala also supports both sequence files and non sequence files. Supports CSV files and compressed file formats like snappy , GZIP, BZIP. It also works on additional formats like Avro, RCFile , LZO text files. According to Cloudera blog, Impala also wants to support a new Trevni  columnar format developed by Doug Cuttings.  

Cloudera bets big on Impala, as Impala can co-exists with existing Hadoop ecosystem and provides a better SQL like interface for querying peta bytes of data in real time. Still users will use pig , hive and map reduce for more complex batch analysis in cases where the declarative language are not an exact fit. All the ecosystem components can co-exist and provide a rich platform for Big data crunching and analysis. Projects like Drill and Impala can fill the void to strengthen the Hadoop Ecosystem for increase its adaptability across the various enterprises.

1 comment:

  1. The Hadoop platform was designed to solve problems where you have a lot of data, mixture of complex and structured data.
    Hadoop Development