Debunking Two Nate Silver Myths

I followed our election pretty closely. My favorite source of information was Nate Silver. He’s a smart guy, and I love the analysis that he does. He’s using solid math in a good way to produce excellent results. But in the aftermath of the election, I’ve seen a lot of bad information going around about him, his methods, and his result.

First: I keep seeing proclamations that “Nate Silver proves that big data works”.

Rubbish.

There is nothing big data about Nate’s methods. He’s using straightforward Bayesian methods to combine data, and the number of data points is remarkably small.

Big data is one of the popular jargon keywords that people use to appear smart. But it does actually mean something. Big data is using massive quantities of information to find patterns: using a million data points isn’t really big data. Big data means terabytes of information, and billions of datapoints.

When I was at Google, I did log analysis. We ran thousands of machines every day on billions of log records (I can’t say the exact number, but it was in excess of 10 billion records per day) to extract information. It took a data center with 10,000 CPUs running full-blast for 12 hours a day to process a single days data. Using that data, we could extract some obvious things – like how many queries per day for each of the languages that Google supports. We could also extract some very non-obvious things that weren’t explicitly in the data, but that were inferrable from the data – like probable network topologies of the global internet, based on communication latencies. That’s big data.

For another example, look at this image produced by some of my coworkers. At foursquare, we about five million points of checkin data every day, and we’ve got a total of more than 2 1/2 billion data points. By looking at average checkin densities, and then comparing that to checkin densities after the hurricane, we can map out precisely where in the city there was electricity, and where there wasn’t. We couldn’t do that by watching one person, or a hundred people. But by looking at the patterns in millions and millions of records, we can. That is big data.

This doesn’t take away from Nate’s accomplishment in any way. He used data in an impressive and elegant way. The fact is, he didn’t need big data to do this. Elections are determined by aggregate behavior, and you just don’t need big data to predict them. The data that Nate used was small enough that a person could do the analysis of it with paper and pencil. It would be a huge amount of work to do by hand, but it’s just nowhere close to the scale of what we call big data. And trying to do big data would have made it vastly more complicated without improving the result.

Second: there are a bunch of things like this.

The point that many people seem to be missing is that Silver was not simply predicting who would win in each state. He was publishing the odds that one or the other candidate would win in each statewide race. That’s an important difference. It’s precisely this data, which Silver presented so clearly and blogged about so eloquently, that makes it easy to check on how well he actually did. Unfortunately, these very numbers also suggest that his model most likely blew it by paradoxically underestimating the odds of President Obama’s reelection while at the same time correctly predicting the outcomes of 82 of 83 contests (50 state presidential tallies and 32 of 33 Senate races).

Look at it this way, if a meteorologist says there a 90% chance of rain where you live and it doesn’t rain, the forecast wasn’t necessarily wrong, because 10% of the time it shouldn’t rain – otherwise the odds would be something other than a 90% chance of rain. One way a meteorologist could be wrong, however, is by using a predictive model that consistently gives the incorrect probabilities of rain. Only by looking a the odds the meteorologist gave and comparing them to actual data could you tell in hindsight if there was something fishy with the prediction.

Bzzt. Sorry, wrong.

There are two main ways of interpreting probability data: frequentist, and Bayesian.

In a frequentist interpretation, saying that an outcome of an event has a probability X% of occuring, you’re saying that if you were to run an infinite series of repetitions of the event, then on average,
the outcome would occur in X out of every 100 events.

The Bayesian interpretation doesn’t talk about repetition or observation. What it says is: for any specific event, it will have one outcome. There is no repetition. But given the current state of information available to me, I can have a certain amount of certainty about whether or not the event will occur. Saying that I assign probability P% to an event doesn’t mean that I expect my prediction to fail (100-P)% of the time. It just means that given the current state of my knowledge, I expect a particular outcome, and the information I know gives me that degree of certainty.

Bayesian statistics and probability is all about state of knowledge. The fundamental, defining theorem of Bayesian statistics is Bayes theorem, which tells you, given your current state of knowledge and a new piece of information, how to update your knowledge based on what the new information tells you. Getting more information doesn’t change anything about whether or not the event will occur: it will occur, and it will have either one outcome or the other. But new information can allow you to improve your prediction and your certainty of that prediction’s correctness.

The author that I quoted above is being a frequentist. In another section of his articple, he’s more specific:

…The result is P= 0.199, which means there’s a 19.9% chance that it rained every day that week. In other words, there’s an 80.1% chance it didn’t rain on at least one day of the week. If it did in fact rain everyday, you could say it was the result of a little bit of luck. After all, 19.9% isn’t that small a chance of something happening.

That’s frequentist intepretation of the probability – which makes sense, since as a physicist, the author is mainly working with repeated experiments – which is a great place for frequentist interpretation. But looking at the same data, a Bayesian would say: “I have an 19.9% certainty that it will rain today”. Then they’d go look outside, see the clouds, and say “Ok, so it looks like rain – that means that I need to update my prediction. Now I’m 32% certain that it will rain”. Note that nothing about the weather has changed: it’s not true that before looking at the clouds, 80.1 percent of the time it wouldn’t rain, and after looking, that changed. The actual fact of whether or not it will rain on that specific day didn’t
change.

Another way of looking at this is to say that a frequentist believes that a given outcome has an intrinstic probability of occurring, and that our attempts to analyze it just bring us closer to the true probability; whereas a Bayesian says that there is no such thing as an intrinsic probability, because every event is different. All that changes is our ability to make predictions with confidence.

One last metaphor, and I’ll stop. Think about playing craps, where you’re rolling two six sided dice.
For a particular die, a frequentist would say “A fair die has a 1 in 6 chance of coming up with a 1”. A
Bayesian would say “If I don’t know anything else, then my best guess is that I can be 16% certain that a 1
will result from a roll.” The result is the same – but the reasoning is different. And because of the difference in reasoning, you can produce different predictions.

Nate Silver’s predictions of the election are a beautiful example of Bayesian reasoning. He watched daily polls, and each time a new poll came out, he took the information from that poll, weighted it according to the historical reliability of that poll in that situation, and then used that to update his certainty. So based on his data, Nate was 90% certain that his prediction was correct.

34 thoughts on “Debunking Two Nate Silver Myths

  1. Jason Dick

    I think the criticism of Nate Silver’s errors is pretty accurate. In Bayesian terms, he underestimated his knowledge of the outcomes because he overestimated the potential bias in the state-level polls. After all, if he properly understood the bias, then the many races that were predicted would absolutely have reduced to the frequentist case. The only way that his analysis would survive in that situation would be if the errors in the various states were very highly-correlated, but this is difficult to defend when the mean result is much better-measured than the per-state result.

    Now, one could argue that it could have potentially been the case this time around that the state-level polls had a systematic bias, but it seems to me that a better way of doing this is to simply try to measure the systematic bias by comparing polling firms against one another. Nate actually did this, but it doesn’t seem that he factored his understanding of the systematic bias into the numbers, only taking previous experience with polling accuracy.

    Reply
    1. Manuel M Moe Garcia

      Nate Silver is in need of criticism? Nate Silver’s methods are not successful? I think Jason Dick could benefit from reading the gritty details of Silver’s methods – Silver never assumes correlation between states unless the data points are so very few that he has to make an appeal to similarity – such are trade-offs that one must make with a very small number of imperfect polling results. But Silver compensates for this with great success using all manner of population and demographic data available – and his track record speaks for himself. I had to check the date and time of Jason Dick’s comment just to verify that this was not a message in a bottle from a few months ago.

      The biases that Jason Dick speaks of are but a very few among the very very many that Silver has spoken about, in excruciating detail, explaining his strategies to overcome. Silver has a long laundry list of biases that he works to counteract – Silver knows a great deal about polling.

      Reply
      1. Jason Dick

        Let me make this clear: Nate Silver clearly overestimated uncertainty in the biases. Otherwise he would have had tighter error bars. The fact that the data points fell right in the middle of the error bars is indicative of an incorrect error analysis.

        Reply
        1. John Armstrong

          Umm.. Silver did have error bars tight enough to miss the precise vote counts in three of the 51 races. Getting it right 48/51 times is… 0.94117647058.. 94.1% accuracy. For 95% error bars, that’s pretty damn correct.

          Reply
          1. Jason Dick

            I only see two where the error bars were outside the estimates (Hawaii and West Virginia). And that’s really not surprising there, because those states were very poorly-polled. My main beef with Nate Silver is that he underestimated the errors in the swing states where the polling was much more dense.

            But regardless, the main point I was trying to make is that his probability of victory for the swing states was simply too low. Leaving out Florida (since it was basically a tossup), we have the following probabilities of the victor who eventually carried the state:
            North Carolina: 74%
            Virginia: 79%
            Colorado: 80%
            Iowa: 84%
            New Hampshire: 85%
            Nebraska 2: 87%
            Ohio: 91%
            Wisconsin: 97%
            Nevada: 93%
            Montana: 98%
            Arizona: 98%
            Maine 2: 95%

            So leaving out the races with very high probabilities of going one way or the other (99%+), the chance of getting each one of these correct given uncorrelated errors would be about 22%.

            So granted, maybe it was just a fluke that Silver called all of the swing states. But I don’t think so.

          2. David Olsen

            His error bars are too small in the poorly polled states and too big in the swing states. Saying he correctly missed 5% is just a fluke of being wrong enough with the low sample states to ignore that he was too right (too wide of error bars) in the high sample states. 30 polls in Ohio in the last week of the election. None of them had Obama losing. Only Rasmussen (with a historical right bias that’s being adjusted for) has a tie. And you’re going to say 91% is a correct confidence level? Seriously, what’s that? 9% confidence that all the polls are dropping acid?

        2. David Olsen

          The fact that in areas where there was poor polling. Hawaii and West Virginia where the error bars didn’t account for the horrible polling is, in itself, a valid criticism. Where there is a lot more polling he had much tighter error bars than he said he did, and the overall picture looked much worse as a result.

          His bars were too tight in places with few polls and his error bars were too broad in places with a lot of polls. These are two sides of the same coin.

          Reply
    2. David Olsen

      Dan Wang who similarly did a Bayesian analysis of the race, had Obama at 99.5% before the race. While Silver still had only 80% certainty on Obama winning overall. If they were doing the same thing, they should have had similar confidence levels too. I really think his error bars are too large as well. When you get more data the potential error drops rather significantly. Perhaps his studies of prior polls and biases gave him better judgement to gauge the typical error in polls but they would have needed to be so systematically wrong at that point that the confidence levels at 80% seem way too low.

      Reply
  2. Dan

    I’m surprised you didn’t take the second author to task for assuming independence of the events. For simultaneous political races, this is a ridiculously bad assumption to make. Heck, it’s probably wrong for weekly weather patterns as well.

    Reply
  3. Steve Ruble

    How does Silver’s use of simulations for some parts of his predictions affect the status of those predictions as Bayesian as opposed to frequentist?

    Reply
  4. Manuel M Moe Garcia

    (I am puzzled by your writing, but I am sure it is from my own stupidity. Please indulge me and correct me where I am wrong. Thanks.)

    Surely a frequentist interpretation of probability does not prelude the use of Bayes’ Theorem? The theorem is still correct if dealing with frequencies, the equation doesn’t depend on the interpretation.

    My understanding of Silver’s methods is that using ad-hoc methods derivable from Bayes’ Theorem.

    Bayes’ Theorem lets you take a finding or an assumption

    [ B ]

    and speak of the probability before having the finding/assumption

    [ P(A) ]

    and speak of an “update” to the probability after having the finding/assumption

    [ P(B|A)/P(B) ]

    to give you the probability taking in account the finding/assumption

    [ P(A|B) which is P(A) times P(B|A)/P(B) ]

    (I am not typing this out because I don’t think you know this already. I am typing this out so my own ignorance has a better chance of jumping out at me.)

    And it does not care if speaking percentage of frequencies or moving toward higher confidence about a prediction.

    So, using Bayes’ Theorem, Silver can strategically, intelligently compensate for deficiencies in the data, whether it be missing results or results contaminated by bias. Some of these methods force one to deal with combinatorics that lead to dealing with many different cases, so automation is a boon. Also, some methods are intractable without use of Monte-Carlo techniques – specifically Silver sets up simulations and sees how the electoral votes distribute – the many many runs of the simulations are strongly sympathetic to the frequentist interpretation of probability.

    I could be wrong in my explanation above, but my readings have been consistent with this.

    So the interpretation of probability doesn’t enter into it, with regards to the math. Only with the communication after. With effort, it could all be communicated using frequentist language, kind of like how some mathematicians in the 1700s and 1800s used to take great pains to use only natural numbers but still got complete results, while others used negative numbers & zero with the same facility and familiarity as natural numbers, with no apologies, getting complete results with much greater ease. The choice of the frequentist interpretation of probability is much less of a burden than eschewing negative numbers.

    Reply
  5. Pingback: What I’m Reading, Friday, November 9, 2012 | Rationally Thinking Out Loud

  6. Pingback: Links 11/10/12 | Mike the Mad Biologist

  7. Peter

    What can it mean that Nate Silver was “90% certain” except that 90% of the time that his model is similarly certain, it’s correct.

    Is there any way to compare, say, NS claiming 90% certainty and SW claiming 99% certainty and DC claiming 80% certainty that the other two are wrong?

    Also, Nate didn’t just claim a 90% chance of Obama winning, he gave probabilities by state for the president and Congress. And most importantly for this discussion, he gave an expectation value of the number of electoral votes for president: 313. Actual outcome is apparently 332, because every state he predicted likely to go to Obama went to Obama, and he expected that several wouldn’t.

    Nate Silver underestimated the uncertainty on his model. He could have done a better job. Maybe he didn’t know he could have done better, but that’s because he didn’t do as good a job collecting and/or analyzing his data as he could have. Sam Wang made similar predictions with tighter error bars.

    Reply
    1. David Olsen

      Wang used a straighter average rather than poll adjusted average with weights for historical bias. That’s the *only* reason why Wang and other’s missed Florida. Silver had a straight poll average in Romney’s favor but the adjusted average 0.1% towards Obama so it went blue. — Generally I think Wang did a better job, even without a good adjusted weighting for the polls.

      Reply
      1. Jason Dick

        There’s no way for anybody to have accurately predicted Florida. The result was far too close for polls to measure. As Wang correctly pointed out, the prediction of Florida effectively amounted to a coin toss.

        Reply
  8. Pingback: Linkblogging For 10/11/12 « Sci-Ence! Justice Leak!

  9. Oliver Rivers (@maxrothbarth)

    “In a frequentist interpretation, saying that an outcome of an event has a probability X% of occuring, you’re saying that if you were to run an infinite series of repetitions of the event, then on average, the outcome would occur in X out of every 100 events.”

    I am puzzled by this. Silver runs simulations; his probabilities are derived from counting the number of times an event occurs in (I think) 1,000 simulations. How is that not frequentist?

    Reply
      1. Peter

        So why would one expect the outcome of a simulation to tell you anything about the outcome of a real election? Why would 1000 simulations tell him any more than the outcome of 1 or 0 simulations on the outcome of “actual elections”?

        Reply
  10. John Armstrong

    There are two questions here: “why does 1 simulation tell more than 0 simulations?” and “why do n+1 simulations tell more than n simulations?”

    For the first one, it’s complicated, and comes down to what Silver the statistician knows about his data sources and how well they’ve correlated to actual outcomes in the past. Then you can establish good estimates for conditional probabilities like

    P(W=R | M=O)

    i.e.: the winner will be Romney given that the model’s winner is Obama.

    From these you can establish a recurrence that takes your prior probabilities for O and R wins, runs them through Bayes’ theorem, and comes out with new probabilities.

    For the second question, there are two answers. First, you can take the above recurrence and repeat it many, many times with your simulation as input on each one. Each run gives slightly more information and, in principle, they should eventually reach steady state.

    On the other, slightly counterintuitive, hand we can take the simulation as an infinitely repeatable experiment. This is reasonable to apply the frequentist approach to. In principle, we could even run our simulation with all possible PRNG seeds and count all the outcomes. Then we can take these probabilities and again use Bayes’ theorem to improve our estimates of P(O) and P(R).

    The upshot in either case is that the coupling is in the conditional probabilities, which we can determine because the outcome distribution of the model is a statistic, and we can look historically at how its component statistics have compared with the outcomes they’ve tried to predict in the past.

    Reply
    1. Peter

      So you agree with Oliver, and “Because those are simulations, not actual elections.” was not a real answer?

      Reply
    2. Peter

      I should probably say that I reject this notion that there are two fundamentally different “interpretations” of probability. I understand (er, I have been told) that there are people who naively apply “frequentist” theories when their data doesn’t warrant it, and that’s gotten people into trouble. And they were doing something wrong.

      But the idea that the “Bayesian” approach is fundamentally different is wrong because: Bayes’s Theorem is consistent with a frequentist interpretation of probability, and can be derived from that interpretation; and Bayesian and frequentist approaches converge, they aren’t mutually exclusive perspectives. They’re at most different starting points for different types of data.

      See, for example, Mark CC’s example above where the Bayesian starts from a 1/6 probability of the die landing on n, just like the frequentist, because ultimately they’re both trying to do the same thing.

      Reply
  11. John Armstrong

    You’re misunderstanding three things: one is that Bayesians even claim to have some exclusive domain over Bayes’ theorem. Stop with the straw man.

    The second is that I really do mean that you can take the recurrence that adds the information from one model run to your current assumed probabilities and run it many, many times and see what happens. As in sit down with some toy numbers and actually try it. It doesn’t take a doctorate in either mathematics or statistics to perform a little arithmetic.

    Finally, you don’t seem to understand what is meant by “interpretation” here. It’s like interpretations of quantum physics: both viewpoints give the same predictions, but there’s a difference in what they say is “really going on”, which leads them to apply different toolsets.

    Reply
    1. Peter

      I don’t think I’m misunderstanding those things:

      I’m not claiming that Bayesians do that, MarkCC for instance *is* doing that if he’s saying that Nate Silver started from Bayes and so “frequentist” expectations shouldn’t apply to his results (which he is claiming!)

      I agree that you’re using Bayes’ Theorem correctly in your previous post. But only if you assume repetitions of the model are sufficiently like repetitions of the actual election. And if you do, then why isn’t Nate’s approach frequentist?

      And last: the original point is that MarkCC was claiming Nate is entitled to *different results* because he’s using a *different interpretation*. As you say, the different interpretations are equivalent in the sense that they all produce identical predictions. And actually, the different interpretations of QM don’t use different toolsets. Well, I think there are studies of, I’m not exactly sure, how decoherence can produce wave function collapse in the Copenhagen interpretation or something, but I don’t think physicists *use* those types of models unless they’re specifically studying those phenomena.

      Reply
      1. ScentOfViolets

        Speaking as someone who teaches statistics for a living, you’ve got it exactly right Peter. And with my stat hat still on, I’ve got to say I don’t understand what John A. is on about.

        And still with my stat hat on: Bayesian sorts are kind of weird prickly about things they shouldn’t be. I suspect it comes down to Frequentist envy, unfortunately; that and a lot of them also seem to be libertarians, make of that what you will.

        Reply
  12. John Miller

    So then, when the Army Corp of Engineers says that my levies will fail in the event of a “100 year storm” is this a frequentist analysis?

    Reply
  13. Phil Koop

    Peter: John Armstrong is being unnecessarily coy here. Yes, Monte Carlo simulation is an inherently frequentist operation, the premise being that one random variable (the simulation) will converge to another (that which is being modeled) in distribution.

    But Monte Carlo simulation is just a particular technique for numerical integration, and it is just just as valid to apply it to a Bayesian model as any other. Markov chain Monte Carlo, for example has proven to be a particularly useful way to solve otherwise intractable Bayesian models.

    Reply
    1. ScentOfViolets

      Exactly so. One of the reasons that Frequentists (if you insist on attaching an allegiance) predominate is that until relatively recently the Bayesian approach was much harder to model in terms of raw computing power.

      Reply
  14. Dave W.

    One problem with this critique is that Nate himself has explicitly adopted a quasi-frequentist interpretation of his model: he’s said roughly that if, over a sufficiently large set of elections where he forecasts one side as a 75% favorite, if the underdog doesn’t win close to 25% of the time, that would indicate that his model was too conservative. (I don’t have the link, but I think it was a fairly recent column). Note that you can’t get this out of his state probabilities for this one election by assuming independence, since he does believe that the different state results are positively correlated with one another, in response to national trends. (Contra Jason, I don’t think this is at all difficult to defend, given the historical evidence of elections like 1980, when there was a substantial shift in voter sentiment against Carter in the final weekend before the election, after the final public polls had taken their surveys. This did show up in at least the Carter campaign’s private polls.)

    I think this is an issue with a pure Baysean interpretation in general, since the choice of initial priors may be left unanchored as arbitrary assumptions, unless the subsequent observations are sufficient to get a wide variety of initial priors to converge to very similar results. When a weather forecaster reports a 30% chance of rain, I want that forecast to mean something stronger than just “rain is more likely than when this forcaster reports a 20% chance, but less likely than when this forecaster reports a 40% chance.” It should mean something pretty close to “over all situations where this forecaster predicts a 30% chance of rain, a randomly situated observer within the forecast area should observe rain close to three times in ten.” If not, I think that becomes evidence that either the model or the priors need adjusting.

    Reply
    1. Dave W.

      I found the column I referred to in my previous comment, from October 22nd: http://fivethirtyeight.blogs.nytimes.com/2012/10/22/oct-21-uncertainty-clouds-polling-but-obama-remains-electoral-college-favorite/

      The money quote:

      We calculate Mr. Obama’s odds as being about two chances out of three.

      Not only will the underdog — Mr. Romney — win some of the time, but he should win some of the time if we have estimated the odds correctly. If the set of candidates you have listed as 67 percent favorites in fact win 95 percent of the time, or 100 percent of the time, you’ve done something wrong. Over the long run, such candidates should win two out of three times — no less and no more.

      Of course, it takes a very long time to realize the long run in presidential elections, since there is only one of them every four years. To the extent that one is evaluating the accuracy of political forecast models — whether they calibrate the odds correctly — it is probably better to look at something like races for the Senate. In that case, there are roughly 35 races held every other year, as opposed to just one every four years. Although these races are not completely independent from one another (there have been years in which Democrats or Republicans overperformed their Senate polls across the board), they are substantially more informative on balance for measuring the effectiveness of a series of forecasts.

      Reply
  15. Pingback: Nate Silver and the Ascendance of Expertise | QuestioScientia.com

Leave a Reply