Tim Kraska, Cloud Computing Database Research

2018 VLDB Early Career Research Contribution Award
2017 VMware Systems Research Award
2017 Early Career Research Achievement Award from Brown University
2017 Alfred P. Sloan Research Fellow in Computer Science
2017 VMware Early Career Faculty Grant
2016 ACM TODS - Best of SIGMOD invitation
2015 Google Research Award
2015 NSF Career award
2015 AFOSR Young Investigator Research Award
2015 Honorable Mention for TCDE Early Career Award
2015 VLDB Best Demo Award
2015 Robotics Science and Systems (AAAI-RSS) Blue Sky Award
2013 ICDE Best Paper Award
2011 VLDB Best Demo Award
2010 Prospective Researcher Fellowship, Swiss National Science Foundation.
2008 ACM TODS - Best of SIGMOD invitation
2006 DAAD Short-Term Scholarship , DAAD, Germany
2005 School of Information Technology Scholarship for outstanding achievements, University of Sydney, Australia
2005 Siemens Prize for Solving an Industry Problem in Research Project Work for the master thesis, University of Sydney

Google Faculty Research Award (PI) and Focussed Research Award for DSAIL

Research Award for DSAIL

Tupleware logo Together with SAP we are exploring new data models for Polystores. (Co-PI)

Together with Oracle and Mellanox we are investigating the implications of RDMA on OLTP and OLAP data management systems. (PI)

III: Medium: Learning-based Synthesis of Data Processing Engines (PI): Modern data-processing systems are designed to be general-purpose systems, in that they can handle a wide variety of applications and data. Unfortunately, this general-purpose nature causes these systems to achieve below-optimal performance for every single application and user. Rather technical compromises have to be made to support a wide range of use cases, often leading to orders-of-magnitude worse performance than what a highly customized system would be able to achieve. At the same time, developing a database system from scratch for each individual application and user is neither economical nor practical. The goal of this project is to explore how machine learning can be used to automatically customize a database system for a specific application or user to achieve so called 'instance-optimality'. If successful, this project will transform the way that modern database systems that underpin the Internet and many enterprise computing systems are built, resulting in systems with much better performance or systems that are able to process large datasets using much less hardware than current systems.

Data-Driven Discovery of Models (PI): Understanding the complex and increasingly data-intensive world around us relies on the construction of robust empirical models, i.e., representations of real, complex systems that enable decision makers to predict behaviors and answer “what-if” questions. Today, construction of complex empirical models is largely a manual process requiring a team of subject matter experts and data scientists. With ever more data becoming available via improved sensing and open sources, the opportunity exists to build models to speed scientific discovery, enhance Department of Defense/Intelligence Community’s intelligence, and improve United States Government logistics and workforce management, but capitalizing on this opportunity is fundamentally limited by the availability of data scientists.

The Data-Driven Discovery of Models (D3M) program aims to develop automated model discovery systems that enable users with subject matter expertise but no data science background to create empirical models of real, complex processes. This capability will enable subject matter experts to create empirical models without the need for data scientists, and will increase the productivity of expert data scientists via automation. The D3M automated model discovery process, depicted in the figure, will be enabled by three key technologies to be developed in the course of the program:

Fast AI: Datacenter and Edge Computing (PI): A major challenge highlighted by our Air Force stakeholders (PEO Digital, PEO Fighters and Bombers, 412th Test Wing, Air Force Materiel Command) and others is in ingesting large amounts of diverse data and processing it in a timely manner to achieve a more global view, detect complex interconnected anomalies and predict future problems. Unfortunately, without a paradigm shift in how we build systems to store, integrate, and continuously analyze the increasing amount of sensor, image, and video data, the Air Force will be severely limited in their capabilities to extract value from collected data as Moore's law ends and human resources to perform data integration and analysis are limited.

In this project, we take a radically different approach to (1) cloud storage, (2) integration, and (3) analysis for normal and abnormal behavior. We propose three components, one for each sub-problem but integrated into a single system, which deeply embed machine learning within the system itself to allow it to self-optimize for a set of applications. Previous results of this approach have already shown great promise achieving, and often outperforming, alternative solutions by orders-of-magnitude in performance.

III: Medium: Quantifying the Unknown Unknowns for Data Integration (PI): As the amount and variety of data available online explodes, it is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, this project will develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) for analytical queries. This will help to better understand answers in the presence of incomplete information across fields ranging from business and the military to medical applications.

This project will develop and exploit the following paradoxical statistical phenomenon: the ability to see certain data items more than once (across multiple data sets) enables one to estimate parameters of data items that have never been seen at all. This project will therefore develop new statistical techniques which take advantage of overlapping datasets, and software backed by both theory and experiments. This will enable users with overlapping incomplete data sets to actively "see the unseen," and in many cases perform as though they had access to missing information not represented in any of their data sources. The project will also focus on data validation, and how to use multiple unreliable data sources to correct each other. Further, as the proposed analysis is nuanced and novel, the project will also explore how to best convey valuable insights to the user, via interactive visualizations of the predictions

Tupleware logo Data Management for Small High-Performance Clusters (PI) Two current hardware trends will fundamentally change the design of modern parallel analytical systems: (1) high-performance RDMA capable networks such as Infiniband FDR/EDR and (2) high-end many-core machines with considerable amounts of main memory. Existing parallel analytical systems, such as Spark and Hadoop, are built in a fundamentally wrong way to effectively leverage the benefits of these trends since these systems target the wrong hardware: huge cloud deployments with cheap but low-end machines connected via high-latency low-bandwidth networks. This is not the infrastructure most businesses nor defense agencies operate. Instead, for the increasing need for advance statistical machine learning techniques and agile analytics we see the future in Small High-Performance Computing (SHPC) clusters. Already today, SHPC clusters that are equipped with fast networks and terabytes of main memory are reasonably affordable. Together with Airforce we are exploring how data management systems, especially analytical systems, for SHPC clusters have to change.

CAREER: Query Compilation Techniques for Complex Analytics on Enterprise Clusters (PI): Sharing of data sets can provide tremendous mutual benefits for industry, researchers and nonprofit organizations. For example, companies can profit from the fact that university researchers explore their data sets and make discoveries, which help the company to improve their business. At the same time, researchers are always on the search for real world data sets to show that their newly developed techniques work in practice. Unfortunately, many attempts to share relevant data sets between different stakeholders in industry and academia fail or require a large investment to make data sharing possible. A major obstacle is that data often comes with prohibitive restrictions on how it can be used (requiring e.g., the enforcement of legal terms or other policies, handling data privacy issues, etc.). In order to enforce these requirements today, lawyers are usually involved in negotiation the terms of each contract. It is not atypical that this process of creating an individual contract for data sharing ends up in protracted negotiations, which are both disconnected from what the actual stakeholders aim to do and fraught as both sides struggle with the implications and possibilities of modern security, privacy, and data sharing techniques. Worse, fear of missing a loophole in how the data might be (mis)used often prevents many data sharing efforts from even getting off the ground. To address these challenges, our new data sharing spoke will enable data providers to easily share data while enforcing constraints on the use of the data. This effort has two key components:(1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations and (2) Developing a prototype data sharing software platform, ShareDB, which enforces the terms and restrictions of the developed licenses. We believe these efforts will have a transformative impact on how data sharing takes place. By moving data out of the silos of individuals and single organizations and into the hands of broader society, we can tackle many societally significant problems.

BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing (PI): Sharing of data sets can provide tremendous mutual benefits for industry, researchers and nonprofit organizations. For example, companies can profit from the fact that university researchers explore their data sets and make discoveries, which help the company to improve their business. At the same time, researchers are always on the search for real world data sets to show that their newly developed techniques work in practice. Unfortunately, many attempts to share relevant data sets between different stakeholders in industry and academia fail or require a large investment to make data sharing possible. A major obstacle is that data often comes with prohibitive restrictions on how it can be used (requiring e.g., the enforcement of legal terms or other policies, handling data privacy issues, etc.). In order to enforce these requirements today, lawyers are usually involved in negotiation the terms of each contract. It is not atypical that this process of creating an individual contract for data sharing ends up in protracted negotiations, which are both disconnected from what the actual stakeholders aim to do and fraught as both sides struggle with the implications and possibilities of modern security, privacy, and data sharing techniques. Worse, fear of missing a loophole in how the data might be (mis)used often prevents many data sharing efforts from even getting off the ground. To address these challenges, our new data sharing spoke will enable data providers to easily share data while enforcing constraints on the use of the data. This effort has two key components:(1) Creating a licensing model for data that facilitates sharing data that is not necessarily open or free between different organizations and (2) Developing a prototype data sharing software platform, ShareDB, which enforces the terms and restrictions of the developed licenses. We believe these efforts will have a transformative impact on how data sharing takes place. By moving data out of the silos of individuals and single organizations and into the hands of broader society, we can tackle many societally significant problems.

III: Medium: 20/20: A System for Human-in-the-Loop Data Exploration (Co-PI): Explorative data analysis plays a key role in data-driven discovery in a wide range of domains including science, engineering and business. In order for data analysis to become a commodity during a period when their user base is continually expanding and diversifying, human productivity and ease-of-use must become first-class design considerations for any database system. Unfortunately, data tools that are user friendly and designed to improve human productivity are still sorely lacking. This project will enable users at different skill levels to interact with and explore their large datasets far easier and faster than they do today. Rather than spending a lot of precious time to build complex analytics tasks, this work will offer a more agile, responsive and user-friendly system based on direct manipulation of the visual representations (e.g., charts, graphs, maps) of the data sets and analysis results. The system can also be used as a learning tool: e.g., a teacher could walk students through a complex dataset to verify specific hypothesis. This project will make large-scale data exploration more accessible to more users. Overall, it will accelerate discovery and breakthroughs in many domains such as e-commerce, finance and science. This research will be incorporated in undergraduate and graduate coursework. The outreach activities include special research and education-focused programs that are geared towards undergraduates and high school girls.

Tim Kraska

Associate Professor, MIT

Awards

Grants