Data Science Machine

The Data Science Machine is an end-to-end software system that is able to automatically develop predictive models from relational data. The Machine was created by Max Kanter and Kalyan Verramachaneni at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. The system automates two of the most human-intensive components of a data science endeavor: feature engineering, and selection and tuning of the machine learning methods that build predictive models from those features. First, an algorithm called Deep Feature Synthesis automatically engineers features. Next, through an approach called Deep Mining, the Machine composes a generalized machine learning pipeline that includes dimensionality reduction methods, feature selection methods, clustering, and classifier design. Finally, it tunes the parameters through a Gaussian Copula Process.

Paper: Deep Feature Synthesis: Towards Automating Data Science Endeavors (PDF)

Putting it to test against real data scientists

Starting in May 2015, after developing the initial system, we put the Data Science Machine to the test by playing it against human data scientists in three different publicly held competitions on KAGGLE. We chose competitions where the data is relational—that is, it is not one of the competitions where features are already given to the competitors and the competitors are simply finding the best machine learning approach. Here are the details of the three competitions:

KDD Cup 2014 - Project Excitement: Using past projects’ histories on DonorsChoose.org, predict if a crowd-funded project is "exciting".

IJCAI - Repeat Buyer Prediction: Using past merchant and customer shopping data, predict if a customer making a purchase during a promotional period will turn into a repeat buyer.

KDD Cup 2015 - Student Dropout: Using student interaction with resources on an online course, predict if the student will dropout in the next 10 days.

_{KDD: Knowledge Discovery and Data Mining

IJCAI: International Joint Conferences on Artificial Intelligence}

How did it do in these competitions?

We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

Competition Name	Number of Teams	Dates	% of Top Submission’s Score	% of Teams Worse
KDD Cup 2014	473	7/8/14 - 7/15/14	86.5%	69.3%
IJCAI 2015	156	4/15/15 - 6/20/15	93.7%	32.7%
KDD Cup 2015	277	5/1/15 - 7/12/15	95.7%	85.6%

The Data Science Machine

Putting it to test against real data scientists

How did it do in these competitions?

Blog

Why did we create the Data Science Machine?

Source Code and Data

Related Publications

Talks

Press

Acknowledgements