Tag Archives: bayesian

Debunking Two Nate Silver Myths

I followed our election pretty closely. My favorite source of information was Nate Silver. He’s a smart guy, and I love the analysis that he does. He’s using solid math in a good way to produce excellent results. But in the aftermath of the election, I’ve seen a lot of bad information going around about him, his methods, and his result.

First: I keep seeing proclamations that “Nate Silver proves that big data works”.


There is nothing big data about Nate’s methods. He’s using straightforward Bayesian methods to combine data, and the number of data points is remarkably small.

Big data is one of the popular jargon keywords that people use to appear smart. But it does actually mean something. Big data is using massive quantities of information to find patterns: using a million data points isn’t really big data. Big data means terabytes of information, and billions of datapoints.

When I was at Google, I did log analysis. We ran thousands of machines every day on billions of log records (I can’t say the exact number, but it was in excess of 10 billion records per day) to extract information. It took a data center with 10,000 CPUs running full-blast for 12 hours a day to process a single days data. Using that data, we could extract some obvious things – like how many queries per day for each of the languages that Google supports. We could also extract some very non-obvious things that weren’t explicitly in the data, but that were inferrable from the data – like probable network topologies of the global internet, based on communication latencies. That’s big data.

For another example, look at this image produced by some of my coworkers. At foursquare, we about five million points of checkin data every day, and we’ve got a total of more than 2 1/2 billion data points. By looking at average checkin densities, and then comparing that to checkin densities after the hurricane, we can map out precisely where in the city there was electricity, and where there wasn’t. We couldn’t do that by watching one person, or a hundred people. But by looking at the patterns in millions and millions of records, we can. That is big data.

This doesn’t take away from Nate’s accomplishment in any way. He used data in an impressive and elegant way. The fact is, he didn’t need big data to do this. Elections are determined by aggregate behavior, and you just don’t need big data to predict them. The data that Nate used was small enough that a person could do the analysis of it with paper and pencil. It would be a huge amount of work to do by hand, but it’s just nowhere close to the scale of what we call big data. And trying to do big data would have made it vastly more complicated without improving the result.

Second: there are a bunch of things like this.

The point that many people seem to be missing is that Silver was not simply predicting who would win in each state. He was publishing the odds that one or the other candidate would win in each statewide race. That’s an important difference. It’s precisely this data, which Silver presented so clearly and blogged about so eloquently, that makes it easy to check on how well he actually did. Unfortunately, these very numbers also suggest that his model most likely blew it by paradoxically underestimating the odds of President Obama’s reelection while at the same time correctly predicting the outcomes of 82 of 83 contests (50 state presidential tallies and 32 of 33 Senate races).

Look at it this way, if a meteorologist says there a 90% chance of rain where you live and it doesn’t rain, the forecast wasn’t necessarily wrong, because 10% of the time it shouldn’t rain – otherwise the odds would be something other than a 90% chance of rain. One way a meteorologist could be wrong, however, is by using a predictive model that consistently gives the incorrect probabilities of rain. Only by looking a the odds the meteorologist gave and comparing them to actual data could you tell in hindsight if there was something fishy with the prediction.

Bzzt. Sorry, wrong.

There are two main ways of interpreting probability data: frequentist, and Bayesian.

In a frequentist interpretation, saying that an outcome of an event has a probability X% of occuring, you’re saying that if you were to run an infinite series of repetitions of the event, then on average,
the outcome would occur in X out of every 100 events.

The Bayesian interpretation doesn’t talk about repetition or observation. What it says is: for any specific event, it will have one outcome. There is no repetition. But given the current state of information available to me, I can have a certain amount of certainty about whether or not the event will occur. Saying that I assign probability P% to an event doesn’t mean that I expect my prediction to fail (100-P)% of the time. It just means that given the current state of my knowledge, I expect a particular outcome, and the information I know gives me that degree of certainty.

Bayesian statistics and probability is all about state of knowledge. The fundamental, defining theorem of Bayesian statistics is Bayes theorem, which tells you, given your current state of knowledge and a new piece of information, how to update your knowledge based on what the new information tells you. Getting more information doesn’t change anything about whether or not the event will occur: it will occur, and it will have either one outcome or the other. But new information can allow you to improve your prediction and your certainty of that prediction’s correctness.

The author that I quoted above is being a frequentist. In another section of his articple, he’s more specific:

…The result is P= 0.199, which means there’s a 19.9% chance that it rained every day that week. In other words, there’s an 80.1% chance it didn’t rain on at least one day of the week. If it did in fact rain everyday, you could say it was the result of a little bit of luck. After all, 19.9% isn’t that small a chance of something happening.

That’s frequentist intepretation of the probability – which makes sense, since as a physicist, the author is mainly working with repeated experiments – which is a great place for frequentist interpretation. But looking at the same data, a Bayesian would say: “I have an 19.9% certainty that it will rain today”. Then they’d go look outside, see the clouds, and say “Ok, so it looks like rain – that means that I need to update my prediction. Now I’m 32% certain that it will rain”. Note that nothing about the weather has changed: it’s not true that before looking at the clouds, 80.1 percent of the time it wouldn’t rain, and after looking, that changed. The actual fact of whether or not it will rain on that specific day didn’t

Another way of looking at this is to say that a frequentist believes that a given outcome has an intrinstic probability of occurring, and that our attempts to analyze it just bring us closer to the true probability; whereas a Bayesian says that there is no such thing as an intrinsic probability, because every event is different. All that changes is our ability to make predictions with confidence.

One last metaphor, and I’ll stop. Think about playing craps, where you’re rolling two six sided dice.
For a particular die, a frequentist would say “A fair die has a 1 in 6 chance of coming up with a 1”. A
Bayesian would say “If I don’t know anything else, then my best guess is that I can be 16% certain that a 1
will result from a roll.” The result is the same – but the reasoning is different. And because of the difference in reasoning, you can produce different predictions.

Nate Silver’s predictions of the election are a beautiful example of Bayesian reasoning. He watched daily polls, and each time a new poll came out, he took the information from that poll, weighted it according to the historical reliability of that poll in that situation, and then used that to update his certainty. So based on his data, Nate was 90% certain that his prediction was correct.