January 21, 2014
The MIT Big Data Challenge ended yesterday. The challenge consisted of two parallel contests using the same dataset: 1) a prediction challenge to forecast the number of taxi rides in key Boston downtown locales and 2) a visualization challenge to gain insights into taxi patterns.
I played the prediction challenge and it was a fun experience. Perhaps more importantly, I think the challenge is just scratching the surface of what's possible in civic analytics. All of the code I wrote is now publicly available on Github in a repository called supervised-learning, which gives an idea of my data processing, feature engineering, and modeling approaches.
My work was primarily done in Python using the IPython notebook. The datasets themselves aren't in my repository (perhaps you can ask the organizers of the contest directly) but I think that it could be a useful framework for other supervised machine learning projects (Kaggle competitions, research projects, or rapid prototyping for larger projects). In general, the out-of-the-box performance of the regression models in scikit-learn was very good, and combined with the power and versatility of Python's libraries, I would recommend this workflow to anyone doing a project that has a similar scale.
Previous: Amanda Cox from the New York Times