Monday, December 10, 2012

Trevni : A columnar file format for Cloudera Impala

Trevni is columnar file format developed by Doug Cutting for storing data in columnar format and will be core storage engine and part of  Cloudera Impala project. Project impala delivers real time queries on Hadoop file system.

Trevni features :
  1. Inspired by CIF/COF based column oriented database architecture.  CIF based columnar architecture works well with MapReduce.
  2. Stores data based on columns which provides good compression of data as data stored in single column will have same kind of data. Retrieval of data will be fast as the minimal scanning required for accessing the data within the same column as compared row store.
  3. To achieve scalable, distributed query evaluation, data sets are partitioned into row groups containing distinct collection of rows. Then each row group stores data vertically like column store. To understand more see below figure 1.
  4. Maximizes the size of row groups in order to reduce the Disc IO seeking latency.  Each row group size can be > 100 mb. This will help in reading sequentially to reduce the disk IO.
  5. Each row group will be written as separate file. All values of a column will be written in contiguously to get optimized IO performance.
  6. Reducing no of row groups results in reducing the no of HDFS file created and hence it reduces the load on the name node. So it is better to have few files per data set means fewer row groups.
  7. Allows dynamic access of data within row group. It also supports co-location of columns with in row group as per CIF storage architecture.   
  8. It also supports nested column structure for semi structured data in the form of arrays and dictionaries.
  9. Application specific data will be maintained at every level like file, column and block. Check sums have been used at block level for providing data integrity.
  10.  Provides many data type support like int, long , float , double , string and byte data type for complex aggregated data. It also supports NULLs and NULL occupies zero bytes which is one of the key differences between column storage and row storage to save disk space. 

Figure - 1 : Illustrates the row group concept in columnar store.

No comments:

Post a Comment