Tuesday, October 9, 2012

An alternative framework for Mahout : CRAB

Have you imagined every time you purchase any item from online sites like Amazon, BestBuy etc you might have figured out there are other items have been displayed as recommendations for you. For example I buy book from Amazon, it recommends list of books purchased by other shoppers with similar interest. This is possible as the online stores actual process millions of data and finds out the item purchased by similar users who are having a common buying behaviour. This helps the online sites sell more to their users based on the user preferences and their online behaviours. Most of the times users will not know what items they are looking for over the net. The recommendation system helps them to discover similar items based on their interest.  

In today’s overcrowded world with millions of items, it is very difficult to search and narrow down our requirements. In that context the online stores provides a filtration of data and presented in most pleasant way. At times we discover items which we might not heard of.

The same is not only true for online retailers. In most of the social sites, we discover friends and people with similar interest. This is done by processing all the socio interests expressed over the net and finding similarities between them. In linkedin you will find jobs, professionals, groups with similar interest. This is a facilitated by underline recommendation system infrastructure. Building a recommendation system could become a fairly complex process as the number of variables are going to increase. Of course the important variable is your amount of data to be processed.

Mahout played a very critical role in solving this problem. But it is not that trivial to build applications with Mahout. Though, it provides a comprehensive set of tools to work with Machine Learning. This is where CRAB fits the bill. The main objective of CRAB is to provide a very simple way to build the recommendation engine.

Crab is a flexible, fast recommender engine for python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (NumPy, SciPy, Metaplotlib).

The project is started in 2010 by Muricoca incorporated as an alternative to Mahout. It is developed using python so it is much easier to work with for an average programmer compared to Mahout which is built using Java. It has implemented User based, Item based and sloped based Collaborative filtering algorithms.

Demo Example can be found at:

Wednesday, October 3, 2012

Hadoop simplified frameworks

There are many frameworks available for reducing the complexity involved in writing MapReduce programs for Hadoop.

In this arcticle I have discussed few of them and most of them are actively developed and have many production implementation. These framework will increase your productivity as they provide high level features wrapping the low level complexity. Few of them even allow you write directly java code with out thinking MapReduce.

Before using them make sure you will evaluate and understand the frameworks better so that you will not end up in selecting  wrong framework. Also think of your long term needs. In my experience any framework provide the basic tenets but as you explore and progress, you will have issues in doing writing thing in write way. Its going to be even more worst, in case those features are not actively supported by the framework. It is no different from selecting any other frameworks for any other job. All those rules applied. 

1. Cascading
2. Pangool
3. Pydoop
4. MRJobw 
5. Happy
6. Dumbo

Cascading :

  • It is an abstraction layer works on top Apache hadoop.
  • Can create and execute complex workflows
  • Works with any JVM based language ( Java, Ruby , Clojure )
  • Primarily workflow gets executed using pipes.
  • It  follows “source-pipe-sink” paradigm, where data is captured from different sources, follows reusable  ‘pipes’ that perform data analysis process.
  • The developers can write JVM based code without really thinking MapReduce.
  • Supported commercially by Concurrent inc.
  • URL : http://www.cascading.org

Pangool :

  • Works on most of the Hadoop distribution.
  • Easier map reduce development.
  • Support for Tuples instead of just key/value pairs.
  • Efficient and easy to use secondary sorting
  •  Efficient, easy to use reduce-side joins
  • Performance and flexibility like Hadoop without really worrying about the Hadoop complexit
  • First -class multiple inputs and outputs
  • Built in serialization support with thrift and protostuff
  • Commercial support  from DataSalt
  • URL : http://pangool.net/

Pydoop :

  • Provides simple python API for MapReduce.
  • It is based CPython package and being a CPython module provides access to an extensible set of python libs like numPY, sciPy etc.
  • More interactive High level Hadoop API available for executing complex jobs.  
  • Provides high level of HDFS API
  • Developed by http://www.crs4.it/
  • URL : pydoop.sourceforge.net

MRJob :

  • Simplified MapReduce scripts.
  • Developed by Yelp and actively being used in many production environments.
  • Built for Hadoop. Built using python.
  • Simple to use compared direct python streaming.
  • Available for running Hadoop on Amazon Elastic map reduce (EMR).
  • Can be used for running complex Machine Learning algorithms and log processing on Hadoop cluster.
  • URL: https://github.com/Yelp/mrjob

Happy :

  • Simplified Hadoop framework built on Jython .
  • Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function.
  • Using run() you can execute the job.
  • It also can be used for complex data processing and implemented in production environment.
  • URL : http://code.google.com/p/happy/

Dumbo :
  • Dumbo is also built using python.
  • Python API for writing MapReduce.
  • All the low level features nicely wrapped and provided as unix pipes.
  • It has many nice inbuilt functions/APIs to write highlevel MapReduce scripts.
  • URL : https://github.com/klbostee/dumbo

Monday, October 1, 2012

Map Reduce code in python for getting highest score

My Data set : ( scores.dat )


mapper.py ( /user/hduser/mapper.py )

#!/usr/bin/env python

import sys

for line in sys.stdin:
   (val1,val2,val3) = line.strip().split(",")
   print "%s\t%s" % (val1, val3)

reducer.py ( /user/hduser/reducer.py )

#!/usr/bin/env python

from operator import itemgetter
import sys

( last_name , max_val ) = ( None , -sys.maxint )

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    name,val = line.strip().split("\t")

    if last_name and last_name != name:
print "%s\t%s" %(last_name , max_val)
(last_name , max_val) = (name, int(val))
(last_name , max_val) = (name , max(max_val,int(val)))

if last_name:
    print "%s\t%s" % (last_name,max_val)

Test first with in unix shell:
cat scores.dat | python mapper.py | sort -k1,1 | python reducer.py

akash    122
nikhil     89
sachin    145

Map reduce  command for running the job :
${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/scores -output /user/hduser/scores-out

Attending pycon India 2012

PyCon India 2012:

It is indeed a great nice experience in participating at PyCon India 2012 and presenting Talk on Bigdata, Hadoop and Python.


I  am sharing my presentation 

Please keep watching this blog. I will also provide all the codes presented during the talk.