My biggest pet peeve is press coverage of statistics. As someone who is mathematically literate, I’m constantly infuriated by it. Basic statistics isn’t that hard, but people can’t be bothered to actually learn a tiny bit in order to understand the meaning of the things they’re covering.

My twitter feed has been exploding with a particularly egregious example of this. After monday night’s presidential debate, there’s been a ton of polling about who “won” the debate. One conservative radio host named Bill Mitchell has been on a rampage about those polls. Here’s a sample of his tweets:

There is no bias in Internet polling. Voters self-select. And with millions of votes, the MOE is non-existent.

— Bill Mitchell (@mitchellvii) September 28, 2016

I fail to understand why I should accept the opinion of a liberal pollster with a 500 sample over a self-selected poll of 1 million.

— Bill Mitchell (@mitchellvii) September 28, 2016

Each one of these polls has an MOE below 5, and yet the entire sample of polls has a variance of 30!!! How can THAT be scientific?! https://t.co/fw9WtGY9Q9

— Bill Mitchell (@mitchellvii) September 28, 2016

Let’s start with a quick refresher about statistics, why we use them, and how they work.

Statistical analysis has a very simple point. We’re interested in understanding the properties of a large population of things. For whatever reason, we *can’t* measure the properties of every object in that population.

The exact reason can vary. In political polling, we can’t ask every single person in the country who they’re going to vote for. (Even if we could, we simply don’t know who’s actually going to show up and vote!) For a very different example, my first exposure to statistics was through my father, who worked in semiconductor manufacturing. They’d produce a run of 10,000 chips for use in Satellites. They needed to know when, on average, a chip would fail from exposure to radiation. If they measured that in every chip, they’d end up with nothing to sell.)

Anyway: you can’t measure every element of the population, but you still want to take measurements. So what you do is *randomly* select a collection of *representative* elements from the population, and you measure those. Then you can say that with a certain probability, the result of analyzing that representative subset will match the result that you’d get if you measured the entire population.

How close can you get? If you’ve really selected a random sample of the population, then the answer depends on the size of the sample. We measure that using something called the “margin of error”. “Margin of error” is actually a terrible name for it, and that’s the root cause of one of the most common problems in reporting about statistics. The margin of error is a probability measurement that says “there is an % probability that the value for the full population lies within the margin of error of the measured value of the sample.”.

Right away, there’s a huge problem with that. What is that variable doing in there? The margin of error measures the probability that the full population value is within a confidence interval around the measured sample value. If you don’t say what the confidence interval is, the margin of error is worthless. Most of the time – but not *all* of the time – we’re talking about a 95% confidence interval.

But there are several subtler issues with the margin of error, both due to the name.

- The “true” value for the full population is
*not*guaranteed to be within the margin of error of the sampled value. It’s just a probability.*There is no hard bound on the size of the error: just a high probability of it being within the margin.*. - The margin of error
*only*includes errors due to sample size. It does*not*incorporate any other factor – and there are many! – that may have affected the result. - The margin of error is deeply dependent on the way that the underlying sample was taken. It’s only meaningful for a
*random*sample. That randomness is critically important: all of sampled statistics is built around the idea that you’ve got a randomly selected subset of your target population.

Let’s get back to our friend the radio host, and his first tweet, because he’s doing a great job of illustrating some of these errors.

The quality of a sampled statistic is entirely dependent on how well the sample matches the population. The sample is critical. It doesn’t matter how big the sample size is if it’s not *random*. A non-random sample *cannot* be treated as a representative sample.

So: an internet poll, where a group of people has to deliberately choose to exert the effort to participate *cannot* be a valid sample for statistical purposes. It’s not random.

It’s true that the set of people who show up to vote isn’t a random sample. But that’s fine: the purpose of an election *isn’t* to try to divine what the full population thinks. It’s to count *what the people who chose to vote* think. It’s deliberately measuring a full population: the population of people who chose to vote.

But if you’re trying to statistically measure something about the population of people who will go and vote, you need to take a randomly selected sample of people who will go to vote. The set of voters is the full population; you need to select a representative sample of that population.

Internet polls do not do that. At best, they measure a different population of people. (At worst, with ballot stuffing, they measure absolutely nothing, but we’ll give them this much benefit of the doubt.) So you can’t take much of anything about the sample population and use it to reason about the full population.

And you can’t say anything about the margin of error, either. Because the margin of error is only meaningful for a representative sample. You *cannot* compute a meaningful margin of error for a non-representative sample, because there is *no way* of knowing how that sampled population compares to the true full target population.

And that brings us to the second tweet. A properly sampled random population of 500 people can produce a high quality result with a roughly 5% margin of error and a 95% confidence interval. (I’m doing a back-of-the-envelope calculation here, so that’s not precise.) That means that *if* the population were randomly sampled, we could say there is in 19 out of 20 polls of that size, the full population value would be within +/- 4% of value measured by the poll. For a non-randomly selected sample of 10 million people, the margin of error cannot be measured, because it’s meaningless. The random sample of 500 people tells us a reasonable estimate based on data; the non-random sample of 10 million people tells us *nothing*.

And with that, on to the third tweet!

In a poll like this, the margin of error only tells us one thing: what’s the probability that the sampled population will respond to *the poll* in the same way that the full population would?

There are many, many things that can affect a poll beyond the sample size. Even with a truly random and representative sample, there are many things that can affect the outcome. For a couple of examples:

How, exactly, is the question phrased? For example, if you ask people “Should police shoot first and ask questions later?”, you’ll get a very different answer from “Should police shoot dangerous criminal suspects if they feel threatened?” – but both of those questions are trying to measure very similar things. But the phrasing of the questions dramatically affects the outcome.

What context is the question asked in? Is this the only question asked? Or is it asked after some other set of questions? The preceding questions can bias the answers. If you ask a bunch of questions about how each candidate did with respect to particular issues before you ask who won, those preceding questions will bias the answers.

When you’re looking at a collection of polls that asked different questions in different ways, you *expect* a significant variation between them. That doesn’t mean that there’s anything wrong with any of them. They can *all* be correct even though their results vary by much more than their margins of error, because the margin of error has nothing to do with how you compare their results: they used different samples, and measured different things.

The problem with the reporting is the same things I mentioned up above. The press treats the margin of error as an absolute bound on the error in the computed sample statistics (which it isn’t); and the press pretends that all of the polls are measuring exactly the same thing, when they’re actually measuring different (but similar) things. They don’t tell us what the polls are really measuring; they don’t tell us what the sampling methodology was; and they don’t tell us the confidence interval.

Which leads to exactly the kind of errors that Mr. Mitchell made.

And one bonus. Mr. Mitchell repeatedly rants about how many polls show a “bias” by “over-sampling< democratic party supporters. This is a classic mistake by people who don't understand statistics. As I keep repeating, for a sample to be meaningful, it must be random. You can report on all sorts of measurements of the sample, but you *cannot* change it.

If you’re randomly selecting phone numbers and polling the respondents, you *cannot* screen the responders based on their self-reported party affiliation. If you do, you are biasing your sample. Mr. Mitchell may not like the results, but that doesn’t make them invalid. People report what they report.

In the last presidential election, we saw exactly this notion in the idea of “unskewing” polls, where a group of conservative folks decided that the polls were all biased in favor of the democrats for *exactly* the reasons cited by Mr. Mitchell. They recomputed the poll results based on shifting the samples to represent what they believed to be the “correct” breakdown of party affiliation in the voting population. The results? The actual election results closely tracked the supposedly “skewed” polls, and the unskewers came off looking like idiots.

We also saw exactly this phenomenon going on in the Republican primaries this year. Randomly sampled polls *consistently* showed Donald Trump crushing his opponents. But the political press could not believe that Donald Trump would actually win – and so they kept finding ways to claim that the poll samples were off: things like they were off because they used land-lines which oversampled older people, and if you corrected for that sampling error, Trump wasn’t actually winning. Nope: the randomly sampled polls were correct, and Donald Trump is the republican nominee.

If you want to use statistics, you *must* work with random samples. If you don’t, you’re going to screw up the results, and make yourself look stupid.

ITYM “There is no hard bound on the size of the error: just a high probability of it being **SMALL**.”

Thanks for catching that.

I was writing the post during runs of my test suite, and my test was failing due to a test size timeout issue, so I had the word “large” on the brain!

Ohh c’mon correcting biases in sampling is both a common practice and perfectly mathematically valid.

For instance, if women complete phone polls at twice the rate that men do (but conditioned on gender completing phone polls is independent of all other factors of interest) it is totally appropriate for a poll to synthetically ensure that the sample contains men and women in the usual population proportions. Indeed, it would be inappropriate not to make such a fix. These kinds of fixes are standard practice in polling.

Indeed, the whole point of a poll isn’t to randomly sample people who answer their phone but to approximate a random sample of likely voters. Given that the only methods you have to poll people don’t randomly sample from likely voters it is entirely appropriate to estimate the biases introduced by your polling methods and correct to better approximate a random sample of the desired population.

As such your criticism of attempts to unskew the polls because they over-sampled democrats is misguided. If it really was true that there was some reason that democrats were, other things being equal, more likely to complete polls this would be a perfectly appropriate thing to correct with respect to (if it was instead that certain demographic groups that correlate with democratic voting were more likely to complete them than it may be better to unskew at that level). It’s just in this case the people who claimed further unskewing was needed were wrong. Not that there was anything fundamentally mathematically illiterate about their methodology.

The problem with using stratified random sampling (also known as unskewing the polls) is that your assumptions about how the strata are distributed might be off.

The problem with using unstratified random sampling is that Dewey wasn’t in fact elected president.

http://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html is a good discussion of choices made in stratified random sampling. It says, among other things, that:

“Jill Darling, the survey director at the U.S.C. Center for Economic and Social Research, noted that they had decided not to “trim” the weights (that’s when a poll prevents one person from being weighted up by more than some amount, like five or 10) because the sample would otherwise underrepresent African-American and young voters.

This makes sense. Gallup got itself into trouble for this reason in 2012: It trimmed its weights, and nonwhite voters were underrepresented.”

I assume that you’re referring to https://en.wikipedia.org/wiki/Dewey_Defeats_Truman when you say “Dewey wasn’t in fact elected president”?

This reference to the 1948 election totally distorts what actually happened. In fact, the Gallup poll was probably accurate

at the time it was taken.The problem was that Gallup stopped polling 3 weeks before the election and hence failed to pickup a finishing surge for Truman. Neither Gallup or the other polling organizations have made the same mistake since.There are several different stories: http://www.math.uah.edu/stat/data/1948Election.html (and a couple other I found) blame samples that weren’t adequately random (quota sampling) leading to a Republican bias; https://www.csudh.edu/dearhabermas/sampling01.htm blames telephone polls for being biased. http://home.isr.umich.edu/sampler/isr-and-the-truman-dewey-upset/ blames the early polls and quota sampling, as well as not taking into account undecided voters.