Data Platform 2.0 at InMobi

Srikanth Sundarrajan

Srikanth Sundarrajan on April 14, 2014

Over the last four years, InMobi has been a huge consumer of various open source big data technologies and a leading contributor to various open source projects in this area. Our data systems enable our entire business team to access hundreds of TBs of data stored in our Hadoop warehouse seamlessly. Based on our learning and experience we are making further improvements to our data systems.

To truly unlock the value of data hidden in our peta-byte scale Hadoop data store, we are making it simple and easy to build big-data applications over Hadoop. We are leading the efforts on developing Pipeline designer along with the community in Apache Falcon. Through this effort, we hope to cut down development and validation time of big data applications and enable our researchers and engineers to perform experiments more effectively with reduced cycle times. The platform will enable users to seamlessly run applications over Hadoop MR, Apache Storm, Apache Tez or Apache Spark on YARN/HDFS as appropriate. Hear more on this at the Hadoop Summit 2014, San Jose as we discuss the details in our talk on “Hadoop First ETL Authoring using Apache Falcon”


* This is a mock

Analytics and Business Intelligence systems are important focus area for us. Hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Despite recent improvements in storage structures in the Hadoop warehouses such as ORC, queries over Hadoop still typically adopts a full scan approach, which tends to be costly in terms of query turn around times. On the other end of the spectrum are columnar SQL databases, which lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance are fairly complex and not easy for most users. We are building a system at InMobi - Grill, to precisely solve this problem. Grill has a simple metadata layer, which provides an abstract view over tiered data stores. System is automatically able to pick the right data store based on the cost of the query and the latency goals of the query. Grill also has query life cycle management functions to schedule, execute and service user queries. Hear more on this topic through our talk “HQL over Tiered Data Warehouse” at the Hadoop Summit 2014, San Jose.