The predictions are here.

Another explanation and discussion is at pollster.com

Summary

Many sites that I enjoy attempt to predict the outcome of elections by averaging the last several polls. But this isn't quite right -- suppose you wanted to average the last three polls: one from today, one from yesterday, and one from a month ago. You should assign more importance to the more recent polls, but how much more? I use a technique from statistics called the Kalman Filter to combine all available polling data to yield the best possible (*) predictions of the outcome of each senatorial election for 2006.

The dark lines in the figures represent the estimated level of support for each candidate; the circles are individual poll results; the shaded areas show 95% confidence intervals. Notice how the confidence intervals expand when no new polls are taken. In the absence of new information, our previous guesses become increasingly unreliable, since the underlying political situation could be changing. Another advantage of the Kalman Filter is that it permits tighter confidence intervals than the margin-of-error of individual polls; if several different polls with consistent results are taken at around the same period of time, the overall confidence may be quite high.

To estimate the likelihood of a Democratic takeover of the senate, I ran 100000 simulations based on their odds of winning each individual seat. The odds of a takeover are simply the proportion of simulations in which the Democrats won 51 seats.

Assumptions

The predictions shown here are only "optimal" (i.e., the best possible predictions) under several assumptions. Some of these assumptions are probably wrong, but the Kalman filter is generally fairly robust to that kind of thing -- it's used in a lot of real-world engineering applications, such as auto-pilot, satellite navigation, and guided missiles. This method does not assume that polls are unbiased -- a poll may reliably favor a specific candidate for reasons that are a by-product of the methodology. For example, suppose Democrats come home for dinner later than Republicans; if so, then a poll that is based on phone calls at 7PM may be biased towards the Republican candidate. In my model, bias is modeled explicitly, so if a poll consistently rates a given candidate higher than average, this bias will be learned and subtracted out. In the future, I hope to present data about the estimated bias of each polling operation.

Details

When I have a little more time, I'll discuss the specifics of how I parameterized the Kalman filter here. Statistically-minded readers may appreciate that my model is sometimes called an "augmented Kalman Filter," since the bias terms are included as part of the system state. The system state also includes a "velocity" term that models the change in support from day to day; this permits the system to model political "momentum." The observation vector y consists of all poll results for all candidates on a given day, with zeros for all polls that do not report result on that day.

Acknowledgements

The data is from the PollMaster at electoral-vote.com -- he kindly posts a daily spreadsheet of all of the poll results. I used Kevin Murphy's Bayes Net Toolbox for its implementation of the Kalman Filter. I'll try to post my own code shortly. Thanks of course to MIT, for letting me use their computers for a project that is only tangentially related to my research.

Contact

I am Jacob Eisenstein (jacobe), and I'm a PhD student in Computer Science at MIT.