Hadoop as an ecosystem has evolved and garnered by many enterprises for solving their Big Data needs. However with current set of development tools, making Hadoop run and able to get what user wants is not a trivial task. In one hand there are many start-ups making Hadoop real time and more suitable for real time query processing while others making the entire ecosystem more simple to use. Hadoop is not a platform for only querying data. It also helps in solving a diverse set of use cases from log processing to genome analysis. The Hadoop ecosystem is fairly complex and getting matured to execute a wide variety of problems. So beyond real time queries Hadoop can be implemented to solve many different Big Data needs and all of them need a fairly simple development environment to get started with Hadoop. MortarData is one such start-up trying to ease the entire Hadoop development by many folds.
MortarData CEO K Yung and his team working on this technology for a while and their simple USP is “Getting ready with Hadoop in one hour”. Mortar launched Hadoop platform as service on Amazon. Amazon also has Amazon elastic MapReduce which is more a general platform for Hadoop compared to what Mortar is trying to achieve. Mortar on other hand built a Hadoop infrastructure which can run using simple Python or PIG scripts. Mortar also provides features to share public datasets and codes for analysis to every one for to get started easily. Any one is interested to share their public data set and code for analysis large scale data sets can share using Github. It also provides other database storage support like Amazon S3 and MongoDB other than HDFS. The data can be populated from these external databases to HDFS to run the MapReduce as when it required. The platform allows users to install python based analytical tools like NumPy, SciPy an NLTK. According to Yung there will be more Tools will be added to the platform as we progress.
I think more and more people will use these kinds of platforms as it really removes the whole Hadoop installation process and managing Hadoop cluster which is by itself a complex process. However, simple development environments are not big differentiator, these companies need to focus on how to do auto scaling, and other ways to minimize the cost of running Hadoop clusters based on their past workloads. Other areas could be more simple diagnostic and management tools to help the debug process fairly simple and trivial. Allowing, important ecosystem libraries to be pre-configured compared to do a manual installation. These are the couple of core areas where I think most of work will be done in future.