June 10, 2013
I participated in the dunnhumby Product Launch Competition last month. The competition asked us to predict the sales of products in week 26, given sales figures in weeks 1 through 13. As a training set, we had week 1-26 sales figures of about 5000 products, then were given a test set of about 1000 products with only week 1-13 sales. The full details can be found here. The event took place at hack/reduce a "big data" hackerspace at Third Street and Binney Street in Cambridge.
The entire competition was run on Kaggle, a data mining competition platform where people or companies post datasets and let people around the world submit solutions. There are time limits, leaderboards, and prizes for each competition.
George Tucker, Alex Levin (also CSAIL PhD students), and I formed "Team SidPac" (after the graduate dorm that we've all lived in) and we won! More precisely, we were the top team in Boston and came 6th overall (out of 100+ teams) on Kaggle. You can find our code in our Github repository.
Subsequently, we gave a presentation at the Boston Data Mining Meetup group (also at Hack/Reduce), on June 4, with a detailed overview of our methods. About 100 people were there and they asked great questions. Our presentation slides are here.
Most of the details of what we did can be found in our code or in the presentation slides, but I did want to re-iterate or share a few key points:
Rapid Workflow: In a ten-hour competition, getting the data into a usable form and quickly building models are essential. I was pretty amazed at how quickly George was able to parse the data and make relevant descriptive graphs in Matlab --- at the time, I was still writing parse scripts in Python, and I thought I was pretty fast at these types of things. We didn't move beyond ridge regression in our models, so Matlab was sufficient for our needs, but it ultimately limited the models that we could use. If we had been able to build a model faster in Python using numpy and scikit-learn, we probably would have gone with that during the rest of the day.
Dropbox for Code Sharing: This might sound a little scary, but it worked very well for us in this situation. Each of us worked on different files, and then they automatically synced and just worked. Granted, we were three people in the same room working on a ten-hour project, and I certainly wouldn't recommend Dropbox for large codebases, but it's seriously worth considering if you've ever forgotten to check in your commits or gotten confused about resolving conflicts. I worked on another, larger project earlier this year using Dropbox for code sharing with a distributed team, and because we kept in close communication via email, chat, phone, or the occasional physical meeting, it also worked.
Communicating to Diverse Audiences: I enjoyed the Meetup just as much as the competition itself -- it was great to talk to people with many different professional backgrounds, and a wider audience than we are used to in academic seminars. Machine learning and data mining can be intimidating to the outsider, with its own jargon and folk knowledge. I speak with some experience --- I certainly felt this way a lot of time in my first years in graduate school, and I still feel I know only a fraction of what I should.
Paradoxically, though, people of many backgrounds, not just people with sophisticated modeling skills, can make really valuable contributions to any machine learning project. In the realm of code and software engineering, cleaning the data, engineering out features, coding the algorithms and evaluation metrics, and putting all of these components into a modular, efficient pipeline are all critical. More broadly, people with specific knowledge in the area can contribute insights into looking at the data, some of which might turn out to be powerful features. I've started to think more about how we might go about democratizing data mining; meetups like this, in my view, are a good start.