Nate Silver has built one of the best models out there. It's accurate, consistent, and totally scientific.
One advantage of being totally scientific is that his model can be replicated.
Silver, over the past five years, has explained in depth how his model actually works.
Looking through that, we now have a step-by step-process on how to be Nate Silver.
Here's what you'll need:
- A copy of Microsoft Excel
- A definitive source of campaign finance data and election data
- A vast polling database with 15 years of impeccably accurate information.
So the first of these supplies can be picked up at your local computer store and several online resources.
The last one may be difficult to find — as early as 2010 Silver was working from a database of " 4,670 distinct polls from 264 distinct pollsters covering 869 distinct electoral contests" — but once you have it ready to go you're all set to start your FiveThirtyEight model.
Step One: Collect all the recent polls that you can find from within a state. We'll label each of these polls by a variable like P1, P2, P3, or just generally speaking Pi
Step Two: Figure out the poll weights. Now, we need to find the weighting for each of the polls — w1, w2, w3, or more generally wi — and that's not exactly an easy task. Each wi is made up of three different values. The "recency" of the poll (we'll call this Ri), the sample size of the poll (Ni for each Pi) and the Pollster Rating for the company that did the poll (Qi for each Pi).
- Ri is expressed as an exponential decay function. The older the poll gets, the lower this number gets. If the same polling company releases a newer poll, Ri also goes down. We know that in 2008, the Ri was expressed at the "half-life" of the poll.
- Ni has to do with the sample size. The more people that the poll Pi sampled, the higher Ni gets. There are diminishing returns, though, so the rate that Ni grows slows as the sample size gets larger and larger.
- The Pollster rating requires the use of that massive database of polling we mentioned earlier. This analyzes the accuracy of the poll in retrospect. Better pollsters have bigger values of Qi.
Step Three: Get Qi. The Pollster rating, Qi, has to be determined for each pollster before we can move forward. The ways Silver does this is rather complex, but the pollster ratings are the signature element of his model so this is important.
The way Sliver goes about developing these ratings is outlined in this blog post from 2010, but here's the gist:
The point is to see how a pollster does compared to the mean. It's much easier to call a presidential contest (avg. error 2.8 points) than a gubernatorial primary (avg. error 7.8 points).
First, Run a regression for races where each pollster has a dummy variable. Other variables should include recency for each poll, sample size, and dummy variables abut the race. The subsequent weight that is found for each pollster dummy value is the measured skill and is called "raw score."
Next, we need to figure out a value called the Reversion parameter. If "n" is the number of previous polls from the pollster in the sample:
reversionparameter = 1 - (0.06 * sqrt(n))
Next, we regress these raw score ratings toward the mean to account for inherent luck and variance and noise. Pollsters in the National Council on Public Polls or the AAPOR Transparency initiative are considered better than non-members
Here are the values of a variable called "groupmean" that we'll need for the final calculation:
- NCPP or the AAPOR Transparency Initiative members: -0.50.
- Polls by telephone: +0.26
- Polling by means of the Internet: +1.04.
adjscore = (rawscore * (1 - reversionparameter)) + (groupmean * reversionparameter)
Negative numbers are good pollsters, positive ones are bad. Finally, Silver calculates Pollster-induced Error. PIE= 2 + adjscore. The minimum PIE is 0, and for our purposes Qi is PIE.
Step Four: Get the weighted polling average. Silver doesn't say how he combines these values Ri, Ni, and Qi to form Wi, and by now that's most of what remains of his "secret sauce." We'll just multiply them.
Now, we have a weight Wi for each poll Pi. Take the average for each state of all the weighted polls. Each pollster has a "House Effect," or a demonstrable partisan skew as compared to the running poll average. For many, this isn't a major factor, but the House effect for each poll (Hi) has to be factored in.
When n is the number of polls in the average, the Weighted Polling Average is equal to ∑[(Wi•Pi)+Hi)/ n.
Next, do a national trend line adjustment to the weighted polling average. Nationally, if the generic Democrat has lost two points, factor that into the Weighted Polling Average.
Finally, do a likely voter adjustment to the likely-voter adjusted weighted polling average. Nate's version is simple, just add a certain percentage to the party that is usually expected to get more votes by the expected percentage.
This, incidentally, is the Five Thirty Eight Weighted Polling average you'll see in each Senate race and state presidential race.
Step Five: The FiveThirtyEight Regression. Silver creates one additional "poll" that gets averaged into the mix.
This "poll" is actually the result of a regression that brings in the major ground effects of the race — candidate experience, partisan tilt of the state, incumbency stats, et cetera.
Silver has already figured out the coefficients based on long time study of prior races, like the marginal quantitative advantage that a Governor has over a House Representative in a Senate election.
You'll have to figure that out on your own by running regressions of historical data, just like Silver presumably did. Once you do figure out those coefficients (A, B, C...), push them into a regression that looks something like this:
LR = Ax1 + Bx2 + Cx3 + Dx 4 + Ex5 + Fx6
LR is the result of this poll, and the xi values correspond to these variables:
- x1, the partisan voting index of the state. This is probably taken from Cook Political Report, and is a number describing the average margin of victory for a generic Republican candidate over a generic Democrat. In Virginia, for example, a generic Republican would beat a generic Democrat by 1 point, so the PVI would be 1.
- x2, the Party Identification in the state. This value compares the number of registered Dems to the number of registered GOP members.
- x3, a number (or numbers) describing fundraising data for each candidate
- x4, a binary variable for incumbency.
- x5, a number describing the approval rating of an incumbent, if there is one.
- x6, a series of dummy variables describing "stature." For senators and governors this is 3, for Representatives, Attorney Generals and big-city mayors this is 2, for State-level offices this is 1, for no prior experience this is 0.
Step Six: Error analysis. The error for the snapshot projection is determined from variables based on prior FiveThirtyEight projections. Here are the ones that Silver specifically refers to as important:
- The error is higher in races with fewer polls
- The error is higher in races where the polls disagree with one another.
- The error is higher when there are a larger number of undecided voters.
- The error is higher when the margin between the two candidates is lopsided.
- The error is higher the further one is from Election Day.
After finishing this, we have two statistics: The projected vote split and the standard error.
Step Seven: Prepare for simulation.
So now, we want to run around 100,000 simulations to figure out what happens next.
We have projected vote split and standard error for each of the 50 states for our model.
We need to split up the error into two types: National Error and Local Error. We know total error from Step Seven when we calculated the standard error.
National error is calculated from a historical analysis of poll changes between the date of the analysis and election day as well as general changes, we'll call this NE.
Here's the relationship between Local, National, and Total error:
Local Error = √ [(Total Error)2 — (National Error)2]
Step Eight: Simulate Once. Do you have Excel ready? Great. We now calculate values for Local and National error based on a normal distribution, using Microsoft Excel's NORMINV(rand(), mu , sigma) function.
For national error, the value of mu is equal to the observed NE and sigma is equal to the standard deviation observed when calculating NE.
In each simulation, NE is the same for each state. But, in east state we calculate local error.
- To calculate Total Error, run NORMINV with mu equal to to the standard error developed in Step Six for each state and the related sigma.
- National Error comes from NE.
Then, plug these values into the Local Error formula to get local error
So now, you have a row in excel with (a) National error and (b) 50 local errors, one for each state. Take the Local Error for each state and combine it with the related projected vote split that you got in step six. This is this simulation's result, state by state
Step Nine: Repeat step eight 100,000 times, and aggregate.
That's the first simulation. Now, do the same thing 100,000 times. In each simulation, you'll find out which party won which states.
You can then use this to figure out the electoral vote count in each simulation. Once you have 100,000 simulated electoral vote counts, average them. This is your FiveThirtyEight projection.
For each state, you can figure out the average margin of victory. This will show you the odds for each state. You can also program a row to figure out which state was the first to push someone over 270, which will allow you to calculate the Tipping Point states.
Now, the key thing to becoming Nate Silver is this: Make sure to do this every day for years on end and add insightful daily commentary responding to different changes in the model that you notice after coding for hours on end. Once you manage that, you can contact Mr. Sulzberger and ask when to pick up your paycheck.
More From Business Insider