The predictions are here.
Summary
Many sites that I enjoy attempt to
predict the outcome of elections by averaging the last several polls. But
this isn't quite right -- suppose you wanted to average the last three polls:
one from today, one from yesterday, and one from a month ago. You should
assign more importance to the more recent polls, but how much more? I use a
technique from statistics called the Kalman Filter to
combine all available polling data to yield the best possible (*) predictions of the outcome of each senatorial election
for 2006.
The dark lines in the figures represent the estimated level of support for
each candidate; the circles are individual poll results; the shaded areas
show 95%
confidence intervals. Notice how the confidence intervals expand when no
new polls are taken. In the absence of new information, our previous guesses
become increasingly unreliable, since the underlying political situation
could be changing. Another advantage of the Kalman Filter is that it permits
tighter confidence intervals than the margin-of-error of individual polls; if
several different polls with consistent results are taken at around the same
period of time, the overall confidence may be quite high.
To estimate the likelihood of a Democratic takeover of the senate, I ran 100000
simulations based on their odds of winning each individual seat. The odds of
a takeover are simply the proportion of simulations in which the Democrats won 51 seats.
Assumptions
The predictions shown here are only "optimal" (i.e., the best possible
predictions) under several assumptions. Some of these assumptions are
probably wrong, but the Kalman filter is generally fairly robust to that
kind of thing -- it's used in a lot of real-world engineering
applications, such as auto-pilot, satellite navigation, and guided missiles.
- Polling errors are mutually
independent. If poll A makes a certain error, say, 5 points in favor of
the republican, that doesn't tell us anything about what kind of error poll B
might make.
- Polling errors follow a normal
distribution.
- The true level of support for a candidate is based only on the level of
support on the previous day, plus some random change. The candidate's
popularity from a month ago doesn't matter. In statistics, a model
that has this property is referred
to as a Markov chain.
- There is no "bandwagon" effect -- high poll results today
won't cause more people to support a candidate tomorrow.
- All polls have the same margin-of-error. This is obviously
not true, but unfortunately my data source does not list margin-of-error
at the moment. I assume a margin-of-error of 4%.
If you have another idea about where I can automatically download polling data,
please get in touch.
- Polls are instantaneous events. Real polls are often conducted
over several days. I assume that the poll is a snapshot of the support for
a candidate on the middle day of the polling period.
This method does not assume that polls are unbiased -- a poll may
reliably favor a specific candidate for reasons that are a by-product of the
methodology. For example, suppose Democrats come home for dinner later than
Republicans; if so, then a poll that is based on phone calls at 7PM may be
biased towards the Republican candidate. In my model, bias is modeled
explicitly, so if a poll consistently rates a given candidate higher than
average, this bias will be learned and subtracted out. In the future, I
hope to present data about the estimated bias of each polling operation.
Details
When I have a little more time, I'll discuss the specifics
of how I parameterized the Kalman filter here. Statistically-minded readers
may appreciate that my model is sometimes called an "augmented Kalman
Filter," since the bias terms are included as part of the system state.
The system state also includes a "velocity" term that models the change in
support from day to day; this permits the system to model political "momentum."
The observation vector y consists of all poll results for all candidates
on a given day, with zeros for all polls that do not report result on
that day.
Acknowledgements
The data is from the PollMaster at electoral-vote.com -- he kindly
posts a daily spreadsheet of all of the poll results. I used Kevin Murphy's
Bayes Net Toolbox for its implementation of
the Kalman Filter. I'll try to post my own code shortly. Thanks of course to MIT, for letting me use their
computers for a project that is only tangentially related to my research.
Contact
I am Jacob Eisenstein (jacobe), and I'm a PhD student in Computer Science at MIT.