MIT Database Group


Hadoop++ / HAIL


This page describes my contributions to Hadoop++/HAIL. See the detailed webpage at Saarland University here.

Hadoop MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. Hadoop MapReduce, however, suffers from poor performance due to inefficient scan-oriented data processing using a hard-coded query processing pipeline. This is undesirable for several analytical tasks in which users want fast access to their data.



Hadoop++

The goal of Hadoop++ is to significantly improve the performance of Hadoop MapReduce by integrating database techniques such as data layouts, indexes, and join processing into Hadoop's storage and query processing engines. The challenge is to scale these techniques to very large data sets, preserve the fault-tolerance, and keep the MapReduce interface intact. In Hadoop++, we represent the Hadoop's query processing pipeline as a database-style physical query execution plan and introduce two additional data processing techniques to improve this plan: (i) indexed data access for quickly accessing the relevant portions of a dataset, and (ii) co-partitioned join processing for efficiently combining two or more datasets. As a result, the performance of Hadoop++ is 20x faster than Hadoop and even matches the performance of a well-configured parallel DBMS. paper



HAIL

The Hadoop Aggressive Indexing Library (HAIL), a follow-up of Hadoop++, takes indexing in Hadoop even further. HAIL utilizes idle CPU cycles in the Hadoop MapReduce pipeline to aggressively create as many indexes as possible, each for a different data block replica when uploading data. This has several consequences. First, several indexes are available at query time for incoming MapReduce jobs. As a result, there is a higher likelihood of being able to do an index scan, which is much faster than a full scan. Second, there is no preparation time to create the indexes, as in Hadoop++. Instead, indexes are available as soon as the data is uploaded. Finally, since HAIL integrates index creation tightly with the Hadoop upload pipeline, the overhead of creating the index is negligible in the overall data upload costs. As a result, indexes are created almost for free, in contrast to Hadoop++, which has very high index creation costs. HAIL runs up to 68x faster than Hadoop while having comparable or better (due to binary representation) data upload time. paper



Related Publications
Talks and Posters
Patent
  • Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Joerg Schad
    Replicated Data Storage System and Methods.
    WO Patent WO2013139379