The Data Science Machine

Max Kanter Kalyan Verramachaneni
CSAIL, MIT

The Data Science Machine is an end-to-end software system that is able to automatically develop predictive models from relational data. The Machine was created by Max Kanter and Kalyan Verramachaneni at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. The system automates two of the most human-intensive components of a data science endeavor: feature engineering, and selection and tuning of the machine learning methods that build predictive models from those features. First, an algorithm called Deep Feature Synthesis automatically engineers features. Next, through an approach called Deep Mining, the Machine composes a generalized machine learning pipeline that includes dimensionality reduction methods, feature selection methods, clustering, and classifier design. Finally, it tunes the parameters through a Gaussian Copula Process.

Paper: Deep Feature Synthesis: Towards Automating Data Science Endeavors (PDF)


Putting it to test against real data scientists

Starting in May 2015, after developing the initial system, we put the Data Science Machine to the test by playing it against human data scientists in three different publicly held competitions on KAGGLE. We chose competitions where the data is relational—that is, it is not one of the competitions where features are already given to the competitors and the competitors are simply finding the best machine learning approach. Here are the details of the three competitions:

KDD Cup 2014 - Project Excitement: Using past projects’ histories on DonorsChoose.org, predict if a crowd-funded project is "exciting".

IJCAI - Repeat Buyer Prediction: Using past merchant and customer shopping data, predict if a customer making a purchase during a promotional period will turn into a repeat buyer.

KDD Cup 2015 - Student Dropout: Using student interaction with resources on an online course, predict if the student will dropout in the next 10 days.

KDD: Knowledge Discovery and Data Mining
IJCAI: International Joint Conferences on Artificial Intelligence

How did it do in these competitions?

We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

Competition Name Number of Teams Dates % of Top Submission’s Score % of Teams Worse
KDD Cup 2014 473 7/8/14 - 7/15/14 86.5% 69.3%
IJCAI 2015 156 4/15/15 - 6/20/15 93.7% 32.7%
KDD Cup 2015 277 5/1/15 - 7/12/15 95.7% 85.6%


Blog

Why did we create the Data Science Machine?

In recent years, more and more data has begun to be collected, and is starting to come online (with cloud infrastructure). As data scientists who regularly work with this data, we noticed a few important aspects: Read more.


Source Code and Data

The code for Deep Feature Synthesis will be available soon via github

Authors are currently seeking permission of the competition organizers to be able to share the data.


Related Publications

Paper: Deep Feature Synthesis: Towards Automating Data Science Endeavors (PDF)


Talks

Data Science Robot
Kalyan Veeramachaneni, TEDx Springfield Massachusetts (video coming soon)

Data Science Machine
Kalyan Veeramachaneni, GE Annual Whitney Symposium (video coming soon)

Deep Feature Synthesis: Towards Automating Data Science Endeavors
Max Kanter, MIT Annual Big Data Symposium


Press

Click here for full press mentions


Acknowledgements

We acknowledge the algorithm developed as part of the Deep Mining project which became the second part of the Data Science Machine.