Tim Kraska, Cloud Computing Database Research

Below a list of some of my current research projects:

NAM logo The goal of this project is to explore the frontier of which components of a data management system can be replaced by learned components. In a recent paper called the The Case for Learned Index Structures we showed that core index/data structures can be considered as models: a BTree-Index is a model to map a key to the position a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. As such, all existing index structures can be replaced with other types of models, including deep-learning models, which we termed learned indexes. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of- magnitude in memory over several real-world data sets.

This result was featured in in O'Reilly's data show podcast and received a lot of attention on social media, where it created comments like "Wow! This could have huge benefits >> The Case for Learned Index Structures" by Kirk Borne, Principal Data Scientist at Booz Allen or "This paper blew my mind. I spent 100% of college and grad school on indexes. Geez, didn't think that would be a waste of time. ;-) #ML meets 1960's data structures and crushes them" by Steven Sinofsky, Board Partner at A16Z and former President of Microsoft's Windows Division.

Yet, why our initial results are very promising many open research challenges remain to make the idea viable for real-world systems. However, if it is possible to overcome the current limitations of learned indexes and broaden the scope to other components (e.g., query optimization), the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs. Most importantly, it might be possible to transform the complexity class of some algorithms (e.g., the O(log n) insert performance of B-Trees to O(1)). Similarly, learned components might make it possible to take better advantage of GPU/TPUs for systems.

For more information: http://dsg.csail.mit.edu/mlforsystems/

Unfolding the potential of Big Data for a broader range of users requires a paradigm shift in the algorithms and tools used to analyse data. Exploring complex data sets needs more than a simple question and response interface. Ideally, the user and the system engage in a ``conversation'', each party contributing what it does best. The user can contribute judgement and direction, while the machine can contribute its ability to process massive amounts of data, and perhaps even predict what the user might require next. Even with sophisticated visualizations, digesting and interpreting large, complex data sets often exceeds human capabilities. ML and statistical techniques can help in these situations by providing tools that clean, filter and identify relevant data subsets. Unfortunately, support for ML is often added as an afterthought: the techniques are buried in black-boxes and executed in an all-or-nothing manner. Results can often take hours to compute, which is not acceptable for interactive data exploration. Moreover, users want to see the result as it evolves. They want to interrupt, change the parameters, features or even the whole pipeline. In other words, currently Data Scientists still use the same text-style batch-interfaces from the 80s, whereas we should explore data more like as envisioned by many movies from James Bond to Minority Report or as outlined in the Microsoft User Interface Vision. While there exist some work on creating very novel interfaces, like the one from the Minority Report, these often ignore the system and ML aspect, and are not really usable in practice, whereas on the other hand the systems community tends to ignore the user interface aspect.

As part of the Northstar project we envision a completely new approach to conducting exploratory analytics. We speculate that soon many conference rooms will be equipped with an interactive whiteboard, like the Microsoft Surface Hub, and that we can use such whiteboards to avoid the week-long back-and-forth interactions between data scientists and domain experts. Instead, we believe that the two can work together during a single meeting using such an interactive whiteboard to visualize, transform and analyze even most complex data on the spot. This setting will undoubtedly help users to quickly arrive at an initial solution, which can be further refined offline. Our hypothesis is that we can make data exploration much easier for laymen users while automatically protecting them from many common errors. Furthermore we hypothesize, that we can develop an interactive data exploration system, which provides meaningful results in sub-seconds even for complex ML pipelines over very large data sets. The techniques will not only make ML more accessible to a broader range of users, but also ultimately enable more discoveries compared to any batch-driven approach.

Northstar consists of several sub-projects including: Vizdom, IDEA, Alpine Meadow, TuPAQ, and QUDE.

For more information: http://northstar.mit.edu/

The Network-Attached-Memory (NAM) was created because of our experience with Tupleware on modern enterprise clusters. We found, that the next generation of high-performance RDMA-capable networks requires a fundamental rethinking of not only analytical frameworks but also more traditional distributed database management systems (DDBMSs). DDBMSs are commonly designed under the assumption that the network is the bottleneck and thus must be avoided as much as possible. This assumption no longer holds. With InfiniBand FDR 4x, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel, and the bandwidth increases even more with the most recent EDR standard. Moreover, with increasing advances in RDMA, transfer latencies improve at a similar pace. Through a theoretical analysis and experimental evaluation, we found that the "old" distributed database design is not capable of taking full advantage of fast networks and we therefore suggested a new design based on a new abstraction called Network-Attached-Memory (NAM). More surprisingly, we found that the common wisdom that distributed transactions do not scale does also no longer hold true with the next generation of networks and an according redesign of the system. Our initial results show that we can achieve a stunning 2 million distributed transactions per second over eight dual-socket e5v2 machines and Infiniband FDR 4x, versus 32,000 transactions per second using the more traditional approach.

As part of the Quantifying the Uncertainty in Data Exploration (QUTE) project, we started to develop techniques to automatically quantify the different types of errors within data exploration pipelines. Of course, there are the obvious types of uncertainty, which are usually visualized using error bars and computed through sampling techniques. However, many other types of error exist, which are often ignored. For example, it is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. However, even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? (2) What is the impact of any unknown (i.e., unobserved) data on query results? As a first step within the QUDE project, we are developing and analyzing techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Surprisingly, our statistical techniques are parameter-free and do not assume prior knowledge about the distribution. Through theoretical analysis and a series of experiments, we show that estimating the impact of unknown unknowns is invaluable for assessing the results of aggregate queries over integrated data sources. For this project, we work together with the Center of Evidence-Based Medicine (EBM) at Brown University.

MLbase: Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming|many users do not understand the trade-os and challenges of parameterizing and choosing between dierent learning techniques. Furthermore, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, and (3) a set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge.

Tupleware: Current analytics frameworks (e.g., Hadoop, Spark and Flink) are designed to process massive data sets distributed across huge clusters. These assumptions represent the problems faced by giant Internet companies but neglect the needs of the typical user. Non-tech companies like banks and retailers---or even the typical data scientist---seldom operate deployments of thousands of machines, instead preferring smaller clusters with more reliable hardware. In fact, recent industry surveys reported that the median Hadoop cluster has fewer than ten nodes, and over 65% of users operate clusters smaller than 50 nodes. Furthermore, while the raw data might be extremely large, after the first cleaning and preprocessing steps, the data ML models are trained on rarely exceeds 1-2 TB in size and fits easily into the main memory of a modern cluster. Targeting complex analytics workloads on smaller clusters fundamentally changes the way we should design analytics tools.

With Tupleware, we built a first analytical framework specifically for "normal" users, not the Googles and Facebooks. Tupleware's architecture brings together ideas from the database, compiler and network communities to create a powerful end-to-end solution for data analysis that compiles workflows of user-defined functions into distributed programs. One key contribution of Tupleware are the new hybrid optimization rules, which combine low-level optimizations normally done by compilers with the high-level query optimizations that databases use, to automatically optimize analytical workflows to better take advantage of modern hardware. Our initial results shows that for common ML tasks, Tupleware can achieve up to three orders-of-magnitude performance improvement over alternative systems, like Spark or Hadoop.

Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this project, we explore techniques to fuse the two computational paradigms in a single system called S-Store to simultaneously accommodate OLTP and streaming applications.

Together with many researchers from the Intel Science & Technology Center for Big Data and SAP we are exploring a new type of federated databases to address the growing need for managing information that spans multiple data models. Currently our focus is especially on graph data and processing model and its connection to the relational model and transaction processing.

The recent Amazon East-Coast data-center failure ones more demonstrated the need for multi-data center deployments and recovery. However, multi-data center deployments challenge many of the existing design decisions for existing cloud services. For example, the latency between data centers is so high and unreliable, that traditional strong consistency protocols are not applicable. At the same time, data-center failures render a huge fraction of the nodes inside a system unavailable, making replication even more important. The goal of this project is to investigate how alternative architectures, programming and consistency models as well as recovery techniques can help to build multi-data center applications. More infos at: http://mdcc.cs.berkeley.edu/

Some queries cannot be answered by machines only. Processing such queries requires human input, e.g., for providing information that is missing from the database, for performing computationally difficult functions and for matching, ranking, or aggregating results based on fuzzy criteria. With CrowdDB we explore a new database design, that uses human input via crowdsourcing to process queries that neither database systems nor search engines can answer adequately. CrowdDB uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many concepts from traditional database systems, there are also important differences. From a conceptual perspective, the traditional closed-world assumption for query processing no longer holds for human input, which is essentially unbounded. From an implementation perspective, CrowdDB uses an operator-based query engine but this engine is extended with special operators that generate User Interfaces in order to solicit, integrate and cleanse human input.

Newly-released web applications often succumb to a "Success Disaster", where overloaded database machines and resulting high response times destroy a previously good user experience. Unfortunately, the data independence provided by a traditional relational database, while useful for rapid development, only exacerbates the problem by hiding potentially expensive queries under simple declarative expressions. As a result, developers of these applications are increasingly abandoning relational databases in favor of imperative code written against distributed key/value stores, losing the many benefits of data independence in the process. Instead, we propose PIQL, a declarative language that also provides scale independence by calculating an upper bound on the number of key/value store operations that will be performed for any query. Coupled with a service level objective (SLO) compliance prediction model and PIQL's scalable database architecture, these bounds make it easy for developers to write success-tolerant applications that support an arbitrarily large number of users while still providing acceptable performance.

Together with SAP and the Hasso-Plattner-Institute, we are currently exploring new techniques and scheduling algorithms for placing multi-tenancy applications in the cloud. SAP became a software-as-a-service provider with their on-demand platform Business by Design. Those new SAP services offer hosting multi-tenant application with very strong availability and response-time guarantees. At the moment, those guarantees are mainly achieved by overprovisioning the hardware for the service. The goal of this project is to develop new techniques and scheduling algorithms, which are more cost effective, while still guaranteeing the service level agreements even in the presence of major failures.

With this project, we explored the opportunities and limitations of using cloud computing as an infrastructure for general-purpose web-based database applications. Part of this work was to to analyzes alternative client-server and indexing architectures as well alternative consistency protocols. Furthermore, we proposed a new transaction paradigm, Consistency Rationing, which not only defines the consistency guarantees on the data instead of at transaction level, but also allows for switching consistency guarantees automatically at runtime. Thus, the system allows to adapt and balance consistency on the fly against possible inconsistency risk. The outcome of this work was published at SIGMOD08 and VLDB09 and is partly commercialized by the 28msec Inc.

Cloud computing has changed the view on data management by focusing primarily on cost, flexibility and availability instead of consistency and performance at any price as traditional DBMS do. As a result, cloud data storages run on commodity hardware, are designed to be scalable, easy to maintain and highly fault-tolerant often providing relaxed consistency guarantees. The success of key-value stores like Amazon's S3 or the variety of open-source systems reflect this shift. Existing solutions, however, still lack substantial functionality provided by a traditional DBMS (e.g., support for transactions and a declarative query language) and are tailored to specific scenarios creating a jungle of services. That is, users have to decide for a specific service and are later locked into this service, preventing the evolution of the application, leading to misuse of services and expensive migrations to other services. During my time at ETH we started to build our own highly scalable database, which provides a completely modularized architecture and is not tailored to a specific use case. For example, Cloudy supports stream processing, as well as SQL and simple key-value requests. This is project is continued at ETH and I am still partially involved in the development.

Zorba is a general purpose XQuery processor implementing in C++ the W3C family of specifications. The query processor has been designed to be embeddable in a variety of environments such as other programming languages extended with XML processing capabilities, browsers, database servers, XML message dispatchers, or smartphones. Its architecture employs a modular design, which allows customizing the Zorba query processor to the environment's needs. In particular, the architecture of the query processor allows a pluggable XML store (e.g. main memory, DOM stores, persistent disk-based large stores, S3 stores). Zorba can be accessed through APIs from the following languages: C, C++, Ruby, Python, Java, and PHP. Zorba runs on most platforms and is available under the Apache license v2. My part in this project was the design and implementation of the main run-time components (e.g. iterator contract, variable binding, FLWOR ...). Zorba is still continued and actively developed by a team of 6 full-time developers and several voluntary contributors.

MXQuery is a low-footprint, extensible open-source XQuery engine implemented in Java. Besides a high level of compliance with XQuery 1.0, it provides a wide coverage of upcoming W3C standards proposals (Update, Fulltext, Scripting) and support for a wide range of Java Platforms (including mobile/embedded devices). We used MXQuery as our research platform for data stream processing/CEP with XQuery. The proposed windowing extension we developed during this project got accepted at the W3C for the upcoming XQuery 1.1 standard.

Tim Kraska

Associate Professor, MIT

Current Research Projects

Learned Systems Components

Northstar

Network-Attached-Memory (NAM)

The Impact of the Unknown Unknowns

Past Research Projects

MLbase

Tupleware

S-Store

Polystore System

Multi-Data-Center Consistency (MDCC)

CrowdDB

Performance Insightful Query Language (PIQL)

Failure-Aware Multi-Tenancy Placement

Building a Database on Cloud Infrastructure

Cloudy/Smoky - a distributed storage and streaming service in the cloud

Zorba - a general purpose XQuery processor implemented in C++

MXQuery & Windowing for XQuery