Hadoop++ / HAIL

Hadoop++ / HAIL

This page describes my contributions to Hadoop++/HAIL. See the detailed webpage at Saarland University here.

Hadoop MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. Hadoop MapReduce, however, suffers from poor performance due to inefficient scan-oriented data processing using a hard-coded query processing pipeline. This is undesirable for several analytical tasks in which users want fast access to their data.

The goal of Hadoop++ is to significantly improve the performance of Hadoop MapReduce by integrating database techniques such as data layouts, indexes, and join processing into Hadoop's storage and query processing engines. The challenge is to scale these techniques to very large data sets, preserve the fault-tolerance, and keep the MapReduce interface intact. In Hadoop++, we represent the Hadoop's query processing pipeline as a database-style physical query execution plan and introduce two additional data processing techniques to improve this plan: (i) indexed data access for quickly accessing the relevant portions of a dataset, and (ii) co-partitioned join processing for efficiently combining two or more datasets. As a result, the performance of Hadoop++ is 20x faster than Hadoop and even matches the performance of a well-configured parallel DBMS. paper

The Hadoop Aggressive Indexing Library (HAIL), a follow-up of Hadoop++, takes indexing in Hadoop even further. HAIL utilizes idle CPU cycles in the Hadoop MapReduce pipeline to aggressively create as many indexes as possible, each for a different data block replica when uploading data. This has several consequences. First, several indexes are available at query time for incoming MapReduce jobs. As a result, there is a higher likelihood of being able to do an index scan, which is much faster than a full scan. Second, there is no preparation time to create the indexes, as in Hadoop++. Instead, indexes are available as soon as the data is uploaded. Finally, since HAIL integrates index creation tightly with the Hadoop upload pipeline, the overhead of creating the index is negligible in the overall data upload costs. As a result, indexes are created almost for free, in contrast to Hadoop++, which has very high index creation costs. HAIL runs up to 68x faster than Hadoop while having comparable or better (due to binary representation) data upload time. paper

Related Publications

Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Jens Dittrich
WWHow! Freeing Data Storage from Cages
CIDR 2013, Asilomar, USA.

Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad
Only Aggressive Elephants are Fast Elephants
VLDB 2012/PVLDB, Istanbul, Turkey.

Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, Jens Dittrich
Trojan Data Layouts: Right Shoes for a Running Elephant
ACM SOCC 2011, Cascais, Portugal.

Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad.
Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
VLDB 2010, Singapore.

Talks and Posters

Seven Things to Know When Buying Shoes for an Elephant!
Presenter: Alekh Jindal
ACM SOCC 2011, Cascais, Portugal.

Trojan Data Layouts (Poster)
Presenters: Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz
ACM SOCC 2011, Cascais, Portugal.

Large-Scale Data Analysis: Bridging the Gap
Presenters: Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty
Information Systems Lab 2009, Saarbruecken, Germany.

Map-Reduce: Distributed Computing Made Easy!
Presenter: Alekh Jindal
MPI Retreat 2009, Braunshausen, Germany.

Patent

Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Joerg Schad
Replicated Data Storage System and Methods.
WO Patent WO2013139379