Businesss, Technology, Trends: BigData Analysis with Project Spark and Shark

Wednesday, January 2, 2013

SPARK :

Developed by AMPLabs, UC Berkeley
Developers Michael Franklin and Matei Zaharia
Alternative to MapReduce parallel processing engine.
In-memory storage for very fast iterative queries removing temporary writes of intermediate data like MapReduce jobs.
After each map and shuffle the data is written to local disk in Hadoop. Which increases the further execution time. This bottle neck removed in SPARK by making the results available in the memory itself.
Spark writes data to RDD (Resilient Distributed Datasets) which can live memory and hence Spark provides the necessary execution improvements.
Up to 100x faster than Hadoop.
Compatible with existing Hadoop ecosystem and works well with existing HDFS systems.
Spark can co-exist with existing Hadoop cluster using Mesos cluster manager.
It is better suited for iterative algorithms like Logistic Regression and Matrix Factorization compared plain data processing algorithms.
Developed by Scala and provides clean APIs in Java and Scala. Python APIs will be added soon.

SHARK :

Meant for Hive replacement with high degree of speed improvement.
Built on top of SPARK data-parallel execution engine.
Uses SQL like declarative language and works on SPARK infrastructure.
Can execute complex queries using JOINs and GROPU BY
Uses column-oriented store to improve performance. The columnar compression provides better reduction in storage.
All the queries run in memory to improve the performance.
Shark provides descent integration with Machine Learning using (Resilient Distributed Datasets). User can call these functions using SQL like syntax. This minimizes the complexity involved in using machine language.
The entire software stack ( SHARK + SPARK + ecosystem ) is called as BDAS ( Berkeley Data Analysis Stack ).

Businesss, Technology, Trends