# Deceptive Statistics and Elections

I don’t mean to turn by blog into political nonsense central, but I just can’t pass on some of these insane arguments.

This morning, the state of Texas sued four other states to overturn the results of our presidential election. As part of their suit, they included an "expert analysis" that claims that the odds of the election results in the state of Georgia being legitimate are worse that 1 in one quadrillion. So naturally, I had to take a look.

Here’s the meat of the argument that they’re claiming "proves" that the election results were fraudulent.

I tested the hypothesis that the performance of the two Democrat candidates were statistically similar by comparing Clinton to Biden. I use a Z-statistic or score, which measures the number of standard deviations the observation is above the mean value of the comparison being made. I compare the total votes of each candidate, in two elections and test the hypothesis that other things being the same they would have an equal number of votes.
I estimate the variance by multiplying the mean times the probability of the candidate not getting a vote. The hypothesis is tested using a Z-score which is the difference between the two candidates’ mean values divided by the square root of the sum of their respective variances. I use the calculated Z-score to determine the p-value, which is the probability of finding a test result at least as extreme as the actual results observed. First, I determine the Z-score comparing the number of votes Clinton received in 2016 to the number of votes Biden received in 2020. The Z-score is 396.3. This value corresponds to a confidence that I can reject the hypothesis many times more than one in a quadrillion times that the two outcomes were similar.

This is, to put it mildly, a truly trashy argument. I’d be incredibly ashamed if a high school student taking statistics turned this in.

What’s going on here?

Well, to start, this is deliberately bad writing. It’s using a whole lot of repetitive words in confusing ways in order to make it sound complicated and scientific.

I can simplify the writing, and that will make it very clear what’s going on.

I tested the hypothesis that the performance of the two Democrat candidates were statistically similar by comparing Clinton to Biden. I started by assuming that the population of eligible voters, the rates at which they cast votes, and their voting preferences, were identical for the two elections. I further assumed that the counted votes were a valid random sampling of the total population of voters. Then I computed the probability that in two identical populations, a random sampling could produce results as different as the observed results of the two elections.

As you can see from the rewrite, the "analysis" assumes that the voting population is unchanged, and the preferences of the voters are unchanged. He assumes that the only thing that changed is the specific sampling of voters from the population of eligible voters – and in both elections, he assumes that the set of people who actually vote is a valid random sample of that population.

In other words, if you assume that:

1. No one ever changes their mind and votes for different parties candidates in two sequential elections;
2. The population and its preferences never changes – people don’t move in and out of the state, and new people don’t register to vote;
3. The specific people who vote in an election is completely random.

Then you can say that this election result is impossible and clearly indicates fraud.

The problem is, none of those assumptions are anywhere close to correct or reasonable. We know that people’s voting preference change. We know that the voting population changes. We know that who turns out to vote changes. None of these things are fixed constants – and any analysis that assumes any of these things is nothing but garbage.

But I’m going to zoom in a bit on one of those: the one about the set of voters being a random sample.

When it comes to statistics, the selection of a sample is one of the most important, fundamental concerns. If your sample isn’t random, then it’s not random. You can’t compare results for two samples as if they’re equivalent if they aren’t equivalent.

Elections aren’t random statistical samples of the population. They’re not even intended to be random statistical samples. They’re deliberately performed as counts of motivated individual who choose to come out and cast their votes. In statistical terms, they’re a self-selected, motivated sample. Self-selected samples are neither random nor representative in a statistical sense. There’s nothing wrong with that: an election isn’t intended to be a random sample. But it does mean that when you do statistical analysis, you cannot treat the set of voters as a random sampling of the population of elegible voters; and you cannot make any assumptions about uniformity when you’re comparing the results of two different elections.

If you could – if the set of voters was a valid random statistical sample of an unchanging population of eligible voters, then there’d be no reason to even have elections on an ongoing basis. Just have one election, take its results as the eternal truth, and just assume that every election in the future would be exactly the same!

But that’s not how it works. And the people behind this lawsuit, and particularly the "expert" who wrote this so-called statistical analysis, know that. This analysis is pure garbage, put together to deceive. They’re hoping to fool someone into believing that they actually prove something that they couldn’t prove.

And that’s despicable.

# Election Fraud? Nope, just bad math

My old ScienceBlogs friend Mike Dunford has been tweeting his way through the latest lawsuit that’s attempting to overturn the results of our presidential election. The lawsuit is an amazingly shoddy piece of work. But one bit of it stuck out to me, because it falls into my area. Part of their argument tries to make the case that, based on "mathematical analysis", the reported vote counts couldn’t possibly make any sense.

The attached affidavit of Eric Quinell, Ph.D. ("Dr. Quinell Report) analyzez the extraordinary increase in turnout from 2016 to 2020 in a relatively small subset of townships and precincts outside of Detroit in Wayne County and Oakland county, and more importantly how nearly 100% or more of all "new" voters from 2016 to 2020 voted for Biden. See Exh. 102. Using publicly available information from Wayne County andOakland County, Dr. Quinell found that for the votes received up to the 2016 turnout levels, the 2020 vote Democrat vs Republican two-ways distributions (i.e. excluding third parties) tracked the 2016 Democrat vs. Republican distribution very closely…

This is very bad statistical analysis – it’s doing something which is absolutely never correct, which is guaranteed to produce a result that looks odd, and then pretending that the fact that you deliberately did something that will produce a certain result means that there’s something weird going on.

Let’s just make up a scenario with some numbers up to demonstrate. Let’s imagine a voting district in Cosine city. Cosine city has 1 million residents that are registered to vote.

In the 2016 election, let’s say that the election was dominated by two parties: the Radians, and the Degrees. The radians won 52% of the vote, and the Degrees won 48%. The voter turnout was low – just 45%.

Now, 2020 comes, and it’s a rematch of the Radians and the Degrees. But this time, the turnout was 50% of registered votes. The Degrees won, with 51% of the vote.

So let’s break that down into numbers for the two elections:

• In 2016:
• A total of 450,000 voters actually cast ballots.
• The Degrees got 216,000 votes.
• In 2020:
• A total of 500,000 voters actually cast ballots.</li>
• The Degrees got 255,000 votes.</li>

Let’s do what Dr. Quinell did. Let’s look at the 2020 election numbers, and take out 450,000 votes which match the distribution from 2016. What we’re left with is:

• 39,000 new votes for the Degrees.

There was a 3 percent shift in the vote, combined with an increase in voter turnout. Neither of those is unusual or radically surprising. But when you extract things in a statistically invalid way, we end up with a result that in a voting district which the vote for the two parties usually varies by no more than 4%, the "new votes" in this election went nearly 4:1 for one party.

If we reduced the increase in voter turnout, that ratio becomes significant worse. If the election turnout was 46%, then the numbers would be 460,000 total votes; 225,400 for the Radians and 234,600 for the Degrees. With Dr. Quinell’s analysis, that would give us: -9,000 votes for the Radians, and +18,000 votes for the Degrees. Or since negative votes don’t make sense, we can just stop at 225,400, and say that all of the remaining votes, every single new vote beyond what the Radians won last time, was taken by the Degrees. Clearly impossible, it must be fraud!

So what’s the problem here? What caused this reasonable result to suddenly look incredibly unlikely?

The votes are one big pool of numbers. You don’t know which data points came from which voters. You don’t know which voters are new versus old. What happened here is that the bozo doing the analysis baked in an invalid assumption. He assumed that all of the voters who voted in 2016 voted the same way in 2020.

"For the votes received up to the turnout level" isn’t something that’s actually measurable in the data. It’s an assertion of something without evidence. You can’t break out subgroups within a population, unless the subgroups were actually deliberately and carefully measured when the data was gathered. And in the case of an election, the data that he’s purportedly analyzing doesn’t actually contain the information needed to separate out that group.

You can’t do that. Or rather you can, but the results are, at best, meaningless.

# Abusing Linear Regression to Make a Point

A bunch of people have been sending me links to a particularly sloppy article that (mis)uses linear regression to draw an incorrect conclusion from some data. So I guess I’ve got to got back to good-old linear regression, and talk about it a bit.

If you have a collection of data – typically data with one independent variable, and one dependent variable (that is, the first variable can vary any way it wants; changing it will change the second variable), then you’re probably interested in how the dependent variable relates to the independent. If you have reason to believe that they should have a linear relationship, then you’d like to know just what that linear relationship is.

If your data were perfect, then you’d just need to plot all of the data points on a graph, with the independent variable on the X axis, and the dependent on the Y, and then your graph would be a line, and you could get its slope and Y intercept, and thus completely capture the relationship.

But data is never perfect. There’s a lot of reasons for that, but no real set of collected data is ever perfect. No matter how perfect the real underlying linear relationship is, real measured data will always show some scatter. And that means that you can draw a lot of possible lines through the collected data. Which one of them represents the best fit?

Since that’s pretty abstract, I’m going to talk a bit about an example – the very example that was used to ignite my interest in math!

Back in 1974 or so, when I was a little kid in second grade, my father was working for RCA, as a physicist involved in manufacturing electronics for satellite systems. One of the important requirements for the products they were manufacturing was that they be radiation hard – meaning that they could be exposed to quite a bit of radiation before they would be damaged enough to stop working.

Their customers – NASA, JPL, and various groups from the U. S. Military, had very strong requirements. They had to show, for a manufacturing setup of a particular component, what the failure profile was.

The primary failure mode of these chips they were making was circuit trace failure. If a sufficiently energetic gamma ray hit one of the circuit traces, it was possible that the trace would burn out – breaking the circuit, and causing the chip to fail.

The test setup that that they used had a gamma ray emitter. So they’d make a manufacturing run to produce a batch of chips from the setup. Then they’d take those, and they’d expose them to increasing doses of radiation from the gamma emitter, and detect when they failed.

For trace failure, the probability of failure is linear in the size of the radiation dose that the chip is exposed to. So to satisfy the customer, they had to show them what the slope of the failure curve was. “Radiation hard” was defined as being able to sustain exposure to some dose of radiation with a specified probability of failure.

So, my dad had done a batch of tests, and he had a ton of little paper slips that described the test results, and he needed to computer the slop of that line – which would give the probability of failure as a multiple of the radiation dose.

I walked into the dining room, where he was set up doing this, and asked what he was doing. So he explained it to me. A lot like I just explained above – except that my dad was a much better teacher than me. I couldn’t explain this to a second or third grader the way that he did!

Anyway… The method that we use to compute the best line is called least squares. The intuition behind it is that you’re trying to find the line where the average distance of all of the datapoints from that line is the smallest. But a simple average doesn’t work well – because some of the data points are above the line, and some are below. Just because one point is, say, above a possible fit by 100, and another is below by 100 doesn’t mean that the two should cancel. So you take the distance between the data points and the line, and you square them – making them all positive. Then you find the line where that total is the smallest – and that’s the best fit.

So let’s look at a real-ish example.

For example, here’s a graph that I generated semi-randomly of data points. The distribution of the points isn’t really what you’d get from real observations, but it’s good enough for demonstration.

The way that we do that is: first we compute the means of $x$ and $y$, which we’ll call $\overline{x}$ and $\overline{y}$. Then using those, we compute the slope as:

$m = \frac{\Sigma_{i=1}^n (x-\hat{x})(y-\hat{y})}{\Sigma_{i=1}^{n} (x-\hat{x})^2}$

Then for the y intercept: $b = \hat{y} - m\hat{x}$.

In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slope in the figure is 2.54, and the y-intercept is 18.4.

Now, we want to check how good the linear relationship is. There’s several different ways of doing that. The simplest is called the correlation coefficient, or $r$.

$r = \frac{\left(\Sigma (x-\hat{x})\right) \left(\Sigma (y - \hat{y})\right)}{\sqrt{ \left(\Sigma (x-\hat{x})^2\right) \left(\Sigma (y - \hat{y})^2\right) }}$

If you look at this, it’s really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you’ve got a set of products; on the bottom, you’ve got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect – that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there’s no relationship, then the correlation will be 0.

For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you’d expect.

Ok, so that’s the basics of linear regression. Let’s get back to the bozo-brained article that started this.

They featured this graph:

You can see the scatter-plot of the points, and you can see the line that was fit to the points by linear regression. How does that fit look to you? I don’t have access to the original dataset, so I can’t check it, but I’m guessing that the correlation there is somewhere around 0.1 or 0.2 – also known as “no correlation”.

You see, the author fell into one of the classic traps of linear regression. Look back at the top of this article, where I started explaining it. I said that if you had reason to believe in a linear relationship, then you could try to find it. That’s the huge catch to linear regression: no matter what data you put in, you’ll always get a “best match” line out. If the dependent and independent variables don’t have a linear relation – or don’t have any actual relation at all – then the “best match” fit that you get back as a result is garbage.

That’s what the graph above shows: you’ve got a collection of data points that to all appearances has no linear relationship – and probably no direct relationship at all. The author is interpreting the fact that linear regression gave him an answer with a positive slope as if that positive slope is meaningful. But it’s only meaningful if there’s actually a relationship present.

But when you look at the data, you don’t see a linear relationship. You see what looks like a pretty random scatterplot. Without knowing the correlation coefficient, we don’t know for sure, but that line doesn’t look to me like a particularly good fit. And since the author doesn’t give us any evidence beyond the existence of that line to believe in the relationship that they’re arguing for, we really have no reason to believe them. All they’ve done is demonstrate that they don’t understand the math that they’re using.

# The Math of Vaccinations, Infection Rates, and Herd Immunity

Here in the US, we are, horribly, in the middle of a measles outbreak. And, as usual, anti-vaccine people are arguing that:

• Measles isn’t really that serious;
• Unvaccinated children have nothing to do with the outbreak; and
• More vaccinated people are being infected than unvaccinated, which shows that vaccines don’t help.

A few years back, I wrote a post about the math of vaccines; it seems like this is a good time to update it.

When it comes to vaccines, there’s two things that a lot of people don’t understand. One is herd immunity; the other is probability of infection.

Herd immunity is the fundamental concept behind vaccines.

In an ideal world, a person who’s been vaccinated against a disease would have no chance of catching it. But the real world isn’t ideal, and vaccines aren’t perfect. What a vaccine does is prime the recipient’s immune system in a way that reduces the probability that they’ll be infected.

But even if a vaccine for an illness were perfect, and everyone was vaccinated, that wouldn’t mean that it was impossible for anyone to catch the illness. There are many people who’s immune systems are compromised – people with diseases like AIDS, or people with cancer receiving chemotherapy. (Or people who’ve had the measles within the previous two years!) And that’s not considering the fact that there are people who, for legitimate medical reasons, cannot be vaccinated!

So individual immunity, provided by vaccines, isn’t enough to completely eliminate the spread of a contagious illness. To prevent outbreaks, we rely on an emergent property of a vaccinated population. If enough people are immune to the disease, then even if one person gets infected with it, the disease won’t be able to spread enough to produce a significant outbreak.

We can demonstrate this with some relatively simple math.

Let’s imagine a case of an infection disease. For illustration purposes, we’ll simplify things in way that makes the outbreak more likely to spread than reality. (So this makes herd immunity harder to attain than reality.)

• There’s a vaccine that’s 95% effective: out of every 100 people vaccinated against the disease, 95% are perfectly immune; the remaining 5% have no immunity at all.
• The disease is highly contagious: out of every 100 people who are exposed to the disease, 95% will be infected.

If everyone is immunized, but one person becomes ill with the disease, how many people do they need to expose to the disease for the disease to spread?

Keeping things simple: an outbreak, by definition, is a situation where the number of exposed people is steadily increasing. That can only happen if every sick person, on average, infects more than 1 other person with the illness. If that happens, then the rate of infection can grow exponentially, turning into an outbreak.

In our scheme here, only one out of 20 people is infectable – so, on average, if our infected person has enough contact with 20 people to pass an infection, then there’s a 95% chance that they’d pass the infection on to one other person. (19 of 20 are immune; the one remaining person has a 95% chance of getting infected). To get to an outbreak level – that is, a level where they’re probably going to infect more than one other person, they’d need expose something around 25 people (which would mean that each infected person, on average, could infect roughly 1.2 people). If they’re exposed to 20 other people on average, then on average, each infected person will infect roughly 0.9 other people – so the number of infected will decrease without turning into a significant outbreak.

But what will happen if just 5% of the population doesn’t get vaccinated? Then we’ve got 95% of the population getting vaccinated, with a 95% immunity rate – so roughly 90% of the population has vaccine immunity. Our pool of non-immune people has doubled. In our example scenario, if each person is exposed to 20 other people during their illness, then they will, on average, cause 1.8 people to get sick. And so we have a major outbreak on our hands!

This illustrates the basic idea behind herd immunity. If you can successfully make a large enough portion of the population non-infectable by a disease, then the disease can’t spread through the population, even though the population contains a large number of infectable people. When the population’s immunity rate (either through vaccine, or through prior infection) gets to be high enough that an infection can no longer spread, the population is said to have herd immunity: even individuals who can’t be immunized no longer need to worry about catching it, because the population doesn’t have the capacity to spread it around in a major outbreak.

(In reality, the effectiveness of the measles vaccine really is in the 95 percent range – actually slightly higher than that; various sources estimate it somewhere between 95 and 97 percent effective! And the success rate of the vaccine isn’t binary: 95% of people will be fully immune; the remaining 5% will have a varying degree of immunity And the infectivity of most diseases is lower than the example above. Measles (which is a highly, highly contagious disease, far more contagious than most!) is estimated to infect between 80 and 90 percent of exposed non-immune people. So if enough people are immunized, herd immunity will take hold even if more than 20 people are exposed by every sick person.)

Moving past herd immunity to my second point: there’s a paradox that some antivaccine people (including, recently, Sheryl Atkinson) use in their arguments. If you look at an outbreak of an illness that we vaccinate for, you’ll frequently find that more vaccinated people become ill than unvaccinated. And that, the antivaccine people say, shows that the vaccines don’t work, and the outbreak can’t be the fault of the unvaccinated folks.

Let’s look at the math to see the problem with that.

Let’s use the same numbers as above: 95% vaccine effectiveness, 95% contagion. In addition, let’s say that 2% of people choose to go unvaccinated.

That means thats that 98% of the population has been immunized, and 95% of them are immune. So now 92% of the population has immunity.

If each infected person has contact with 20 other people, then we can expect expect 8% of those 20 to be infectable – or 1.6; and of those, 95% will become ill – or 1.52. So on average, each sick person will infect 1 1/2 other people. That’s enough to cause a significant outbreak. Without the non-immunized people, the infection rate is less than 1 – not enough to cause an outbreak.

The non-immunized population reduced the herd immunity enough to cause an outbreak.

Within the population, how many immunized versus non-immunized people will get sick?

Out of every 100 people, there are 5 who got vaccinated, but aren’t immune. Out of that same 100 people, there are 2 (2% of 100) that didn’t get vaccinated. If every non-immune person is equally likely to become ill, then we’d expect that in 100 cases of the disease, about 70 of them to be vaccinated, and 30 unvaccinated.

The vaccinated population is much, much larger – 50 times larger! – than the unvaccinated.
Since that population is so much larger, we’d expect more vaccinated people to become ill, even though it’s the smaller unvaccinated group that broke the herd immunity!

The easiest way to see that is to take those numbers, and normalize them into probabilities – that is, figure out, within the pool of all vaccinated people, what their likelihood of getting ill after exposure is, and compare that to the likelihood of a non-vaccinated person becoming ill after exposure.

So, let’s start with the vaccinated people. Let’s say that we’re looking at a population of 10,000 people total. 98% were vaccinated; 2% were not.

• The total pool of vaccinated people is 9800, and the total pool of unvaccinated is 200.
• Of the 9800 who were vaccinated, 95% of them are immune, leaving 5% who are not – so
490 infectable people.
• Of the 200 people who weren’t vaccinated, all of them are infectable.
• If everyone is exposed to the illness, then we would expect about 466 of the vaccinated, and 190 of the unvaccinated to become ill.

So more than twice the number of vaccinated people became ill. But:

• The odds of a vaccinated person becoming ill are 466/9800, or about 1 out of every 21
people.
• The odds of an unvaccinated person becoming ill are 190/200 or 19 out of every 20 people! (Note: there was originally a typo in this line, which was corrected after it was pointed out in the comments.)

The numbers can, if you look at them without considering the context, appear to be deceiving. The population of vaccinated people is so much larger than the population of unvaccinated that the total number of infected can give the wrong impression. But the facts are very clear: vaccination drastically reduces an individuals chance of getting ill; and vaccinating the entire population dramatically reduces the chances of an outbreak.

The reality of vaccines is pretty simple.

• Vaccines are highly effective.
• The diseases that vaccines prevent are not benign.
• Vaccines are really, really safe. None of the horror stories told by anti-vaccine people have any basis in fact. Vaccines don’t damage your immune system, they don’t cause autism, and they don’t cause cancer.
• Not vaccinating your children (or yourself!) doesn’t just put you at risk for illness; it dramatically increases the chances of other people becoming ill. Even when more vaccinated people than unvaccinated become ill, that’s largely caused by the unvaccinated population.

In short: everyone who is healthy enough to be vaccinated should get vaccinated. If you don’t, you’re a despicable free-riding asshole who’s deliberately choosing to put not just yourself but other people at risk.

# Polls and Sampling Errors in the Presidental Debate Results

My biggest pet peeve is press coverage of statistics. As someone who is mathematically literate, I’m constantly infuriated by it. Basic statistics isn’t that hard, but people can’t be bothered to actually learn a tiny bit in order to understand the meaning of the things they’re covering.

My twitter feed has been exploding with a particularly egregious example of this. After monday night’s presidential debate, there’s been a ton of polling about who “won” the debate. One conservative radio host named Bill Mitchell has been on a rampage about those polls. Here’s a sample of his tweets:

Statistical analysis has a very simple point. We’re interested in understanding the properties of a large population of things. For whatever reason, we can’t measure the properties of every object in that population.

The exact reason can vary. In political polling, we can’t ask every single person in the country who they’re going to vote for. (Even if we could, we simply don’t know who’s actually going to show up and vote!) For a very different example, my first exposure to statistics was through my father, who worked in semiconductor manufacturing. They’d produce a run of 10,000 chips for use in Satellites. They needed to know when, on average, a chip would fail from exposure to radiation. If they measured that in every chip, they’d end up with nothing to sell.)

Anyway: you can’t measure every element of the population, but you still want to take measurements. So what you do is randomly select a collection of representative elements from the population, and you measure those. Then you can say that with a certain probability, the result of analyzing that representative subset will match the result that you’d get if you measured the entire population.

How close can you get? If you’ve really selected a random sample of the population, then the answer depends on the size of the sample. We measure that using something called the “margin of error”. “Margin of error” is actually a terrible name for it, and that’s the root cause of one of the most common problems in reporting about statistics. The margin of error is a probability measurement that says “there is an $N$% probability that the value for the full population lies within the margin of error of the measured value of the sample.”.

Right away, there’s a huge problem with that. What is that variable doing in there? The margin of error measures the probability that the full population value is within a confidence interval around the measured sample value. If you don’t say what the confidence interval is, the margin of error is worthless. Most of the time – but not all of the time – we’re talking about a 95% confidence interval.

But there are several subtler issues with the margin of error, both due to the name.

1. The “true” value for the full population is not guaranteed to be within the margin of error of the sampled value. It’s just a probability. There is no hard bound on the size of the error: just a high probability of it being within the margin..
2. The margin of error only includes errors due to sample size. It does not incorporate any other factor – and there are many! – that may have affected the result.
3. The margin of error is deeply dependent on the way that the underlying sample was taken. It’s only meaningful for a random sample. That randomness is critically important: all of sampled statistics is built around the idea that you’ve got a randomly selected subset of your target population.

Let’s get back to our friend the radio host, and his first tweet, because he’s doing a great job of illustrating some of these errors.

The quality of a sampled statistic is entirely dependent on how well the sample matches the population. The sample is critical. It doesn’t matter how big the sample size is if it’s not random. A non-random sample cannot be treated as a representative sample.

So: an internet poll, where a group of people has to deliberately choose to exert the effort to participate cannot be a valid sample for statistical purposes. It’s not random.

It’s true that the set of people who show up to vote isn’t a random sample. But that’s fine: the purpose of an election isn’t to try to divine what the full population thinks. It’s to count what the people who chose to vote think. It’s deliberately measuring a full population: the population of people who chose to vote.

But if you’re trying to statistically measure something about the population of people who will go and vote, you need to take a randomly selected sample of people who will go to vote. The set of voters is the full population; you need to select a representative sample of that population.

Internet polls do not do that. At best, they measure a different population of people. (At worst, with ballot stuffing, they measure absolutely nothing, but we’ll give them this much benefit of the doubt.) So you can’t take much of anything about the sample population and use it to reason about the full population.

And you can’t say anything about the margin of error, either. Because the margin of error is only meaningful for a representative sample. You cannot compute a meaningful margin of error for a non-representative sample, because there is no way of knowing how that sampled population compares to the true full target population.

And that brings us to the second tweet. A properly sampled random population of 500 people can produce a high quality result with a roughly 5% margin of error and a 95% confidence interval. (I’m doing a back-of-the-envelope calculation here, so that’s not precise.) That means that if the population were randomly sampled, we could say there is in 19 out of 20 polls of that size, the full population value would be within +/- 4% of value measured by the poll. For a non-randomly selected sample of 10 million people, the margin of error cannot be measured, because it’s meaningless. The random sample of 500 people tells us a reasonable estimate based on data; the non-random sample of 10 million people tells us nothing.

And with that, on to the third tweet!

In a poll like this, the margin of error only tells us one thing: what’s the probability that the sampled population will respond to the poll in the same way that the full population would?

There are many, many things that can affect a poll beyond the sample size. Even with a truly random and representative sample, there are many things that can affect the outcome. For a couple of examples:

How, exactly, is the question phrased? For example, if you ask people “Should police shoot first and ask questions later?”, you’ll get a very different answer from “Should police shoot dangerous criminal suspects if they feel threatened?” – but both of those questions are trying to measure very similar things. But the phrasing of the questions dramatically affects the outcome.

What context is the question asked in? Is this the only question asked? Or is it asked after some other set of questions? The preceding questions can bias the answers. If you ask a bunch of questions about how each candidate did with respect to particular issues before you ask who won, those preceding questions will bias the answers.

When you’re looking at a collection of polls that asked different questions in different ways, you expect a significant variation between them. That doesn’t mean that there’s anything wrong with any of them. They can all be correct even though their results vary by much more than their margins of error, because the margin of error has nothing to do with how you compare their results: they used different samples, and measured different things.

The problem with the reporting is the same things I mentioned up above. The press treats the margin of error as an absolute bound on the error in the computed sample statistics (which it isn’t); and the press pretends that all of the polls are measuring exactly the same thing, when they’re actually measuring different (but similar) things. They don’t tell us what the polls are really measuring; they don’t tell us what the sampling methodology was; and they don’t tell us the confidence interval.

Which leads to exactly the kind of errors that Mr. Mitchell made.

And one bonus. Mr. Mitchell repeatedly rants about how many polls show a “bias” by “over-sampling< democratic party supporters. This is a classic mistake by people who don't understand statistics. As I keep repeating, for a sample to be meaningful, it must be random. You can report on all sorts of measurements of the sample, but you cannot change it.

If you’re randomly selecting phone numbers and polling the respondents, you cannot screen the responders based on their self-reported party affiliation. If you do, you are biasing your sample. Mr. Mitchell may not like the results, but that doesn’t make them invalid. People report what they report.

In the last presidential election, we saw exactly this notion in the idea of “unskewing” polls, where a group of conservative folks decided that the polls were all biased in favor of the democrats for exactly the reasons cited by Mr. Mitchell. They recomputed the poll results based on shifting the samples to represent what they believed to be the “correct” breakdown of party affiliation in the voting population. The results? The actual election results closely tracked the supposedly “skewed” polls, and the unskewers came off looking like idiots.

We also saw exactly this phenomenon going on in the Republican primaries this year. Randomly sampled polls consistently showed Donald Trump crushing his opponents. But the political press could not believe that Donald Trump would actually win – and so they kept finding ways to claim that the poll samples were off: things like they were off because they used land-lines which oversampled older people, and if you corrected for that sampling error, Trump wasn’t actually winning. Nope: the randomly sampled polls were correct, and Donald Trump is the republican nominee.

If you want to use statistics, you must work with random samples. If you don’t, you’re going to screw up the results, and make yourself look stupid.

# Back to an old topic: Bad Vaccine Math

The very first Good Math/Bad Math post ever was about an idiotic bit of antivaccine rubbish. I haven’t dealt with antivaccine stuff much since then, because the bulk of the antivaccine idiocy has nothing to do with math. But the other day, a reader sent me a really interesting link from what my friend Orac calls a “wretched hive of scum and quackery”, naturalnews.com, in which they try to argue that the whooping cough vaccine is an epic failure:

(NaturalNews) The utter failure of the whooping cough (pertussis) vaccine to provide any real protection against disease is once again on display for the world to see, as yet another major outbreak of the condition has spread primarily throughout the vaccinated community. As it turns out, 90 percent of those affected by an ongoing whooping cough epidemic that was officially declared in the state of Vermont on December 13, 2012, were vaccinated against the condition — and some of these were vaccinated two or more times in accordance with official government recommendations.

As reported by the Burlington Free Press, at least 522 cases of whooping cough were confirmed by Vermont authorities last month, which was about 10 times the normal amount from previous years. Since that time, nearly 100 more cases have been confirmed, bringing the official total as of January 15, 2013, to 612 cases. The majority of those affected, according to Vermont state epidemiologist Patsy Kelso, are in the 10-14-year-old age group, and 90 percent of those confirmed have already been vaccinated one or more times for pertussis.

Even so, Kelso and others are still urging both adults and children to get a free pertussis shot at one of the free clinics set up throughout the state, insisting that both the vaccine and the Tdap booster for adults “are 80 to 90 percent effective.” Clearly this is not the case, as evidenced by the fact that those most affected in the outbreak have already been vaccinated, but officials are apparently hoping that the public is too naive or disengaged to notice this glaring disparity between what is being said and what is actually occurring.

It continues in that vein. The gist of the argument is:

1. We say everyone needs to be vaccinated, which will protect them from getting the whooping cough.
2. The whooping cough vaccine is, allagedly, 80 to 90% effective.
3. 90% of the people who caught whooping cough were properly vaccinated.
4. Therefore the vaccine can’t possibly work.

What they want you to do is look at that 80 to 90 percent effective rate, and see that only 10-20% of vaccinated people should be succeptible to the whooping cough, and compare that 10-20% to the 90% of actual infected people that were vaccinated. 20% (the upper bound of the succeptible portion of vaccinated people according to the quoted statistic) is clearly much smaller than 90% – therefore it’s obvious that the vaccine doesn’t work.

Of course, this is rubbish. It’s a classic apple to orange-grove comparison. You’re comparing percentages, when those percentages are measuring different groups – groups with wildly difference sizes.

Take a pool of 1000 people, and suppose that 95% are properly vaccinated (the current DTAP vaccination rate in the US is around 95%). That gives you 950 vaccinated people and 50 unvaccinated people who are unvaccinated.

In the vaccinated pool, let’s assume that the vaccine was fully effective on 90% of them (that’s the highest estimate of effectiveness, which will result in the lowest number of succeptible vaccinated – aka the best possible scenario for the anti-vaxers). That gives us 95 vaccinated people who are succeptible to the whooping cough.

There’s the root of the problem. Using numbers that are ridiculously friendly to the anti-vaxers, we’ve still got a population of twice as many succeptible vaccinated people as unvaccinated. so we’d expect, right out of the box, that better than 2/3rds of the cases of whooping cough would be among the vaccinated people.

In reality, the numbers are much worse for the antivax case. The percentage of people who were ever vaccinated is around 95%, because you need the vaccination to go to school. But that’s just the childhood dose. DTAP is a vaccination that needs to be periodically boosted or the immunity wanes. And the percentage of people who’ve had boosters is extremely low. Among adolescents, according to the CDC, only a bit more than half have had DTAP boosters; among adults, less that 10% have had a booster within the last 5 years.

What’s your succeptibility if you’ve gone more than 5 years without vaccination? Somewhere 40% of people who didn’t have boosters in the last five years are succeptible.

So let’s just play with those numbers a bit. Assume, for simplicity, than 50% of the people are adults, and 50% children, and assume that all of the children are fully up-to-date on the vaccine. Then you’ve got 10% of the children (10% of 475), 10% of the adults that are up-to-date (10% of 10% of 475), and 40% of the adults that aren’t up-to-date (40% of 90% of 475) is the succeptible population. That works out to 266 succeptible people among the vaccinated, which is 85%: so you’d expect 85% of the actual cases of whooping cough to be among people who’d been vaccinated. Suddenly, the antivaxers case doesn’t look so good, does it?

Consider, for a moment, what you’d expect among a non-vaccinated population. Pertussis is highly contagious. If someone in your household has pertussis, and you’re succeptible, you’ve got a better than 90% chance of catching it. It’s that contagious. Routine exposure – not sharing a household, but going to work, to the store, etc., with people who are infected still gives you about a 50% chance of infection if you’re succeptible.

In the state of Vermont, where NaturalNews is claiming that the evidence shows that the vaccine doesn’t work, how many cases of Pertussis have they seen? Around 600, out of a state population of 600,000 – an infection rate of one tenth of one percent. 0.1 percent, from a virulently contagious disease.

That’s the highest level of Pertussis that we’ve seen in the US in a long time. But at the same time, it’s really a very low number for something so contagious. To compare for a moment: there’s been a huge outbreak of Norovirus in the UK this year. Overall, more than one million people have caught it so far this winter, out of a total population of 62 million, for a rate of about 1.6% or sixteen times the rate of infection of pertussis.

Why is the rate of infection with this virulently contagious disease so different from the rate of infection with that other virulently contagious disease? Vaccines are a big part of it.

# Willfull Ignorance about Statistics in Government

Quick but important one here.

I’ve repeatedly ranted here about ignorant twits. Ignorance is a plague on society, and it’s at its worst when it’s willful ignorance – that is, when you have a person who knows nothing about a subject, and who refuses to be bothered with something as trivial and useless about learning about it before they open their stupid mouths.

We’ve got an amazing, truly amazing, example of this in the US congress right now.
There’s a “debate” going on about something called the American Community Survey, or the
ACS for short. The ACS is a regular survey performed by the Census administration, which
measures a wide range of statistics related to economics.

A group of Republicans are trying to eliminate the ACS. Why? well, let’s put that question aside. And let’s also leave aside, for the moment, whether the survey is important or not. You can, honestly, put together an argument that the ACS isn’t worth doing, that it doesn’t measure the right things, that the value of the information gathered doesn’t measure up to the cost, that it’s intrusive, that it violates the privacy of the survey targets. But let’s not even bother with any of that.

Members of congress are arguing that the survey should be eliminated, and they’re claiming that the reason why is because the survey is unscientific. According to Daniel Webster, a representative from the state of Florida:

We’re spending \$70 per person to fill this out. That’s just not cost effective, especially since in the end this is not a scientific survey. It’s a random survey.

Note well the emphasized point there. That’s the important bit.

The survey isn’t cost effective, the data gathered isn’t genuinely useful according to Representative Webster, because it’s not a scientific survey. Why isn’t it a scientific survey? Because it’s random.

This is what I mean by willful ignorance. Mr. Webster doesn’t understand what a survey is, or how a survey works, or what it takes to make a valid survey. He’s talking out his ass, trying to kill a statistical analysis for his own political reasons without making any attempt to actually understand what it is or how it works.

Surveys are, fundamentally, about statistical sampling. Given a large population, you can create estimates about the properties of the population by looking at a representative sample of the population. For example, if you’re looking at the entire population of America, you’re talking about hundreds of millions of people. You can’t measure, say, the employment rate of the entire population every year – there are just too many people. It’s too much information – it’s pretty much impossible to gather it.

But: if you can select a group of, say, 10,000 people, whose distribution matches the distribution of the wider population, then the data you gather about them will closely resemble the data about the wider population.

That’s the point of a survey: find a representative sample, and take measurements of that sample. Then, with a certain probability of correctness, you can infer the properties of the entire population from the properties of the sample.

Of course, there’s a catch. The key to a survey is the sample. The sample must be representative – meaning that the sample must have the same properties as the wider population of which it’s a part. But the point of survey is to discover those properties! If you choose your population to match what you believe the distribution to be, then you’ll bias your data towards matching that distribution. Your sample will only be representative if your beliefs about the data are correct. But that defeats the whole purpose of doing the survey.

So the scientific method of doing a survey is to be random. You don’t start with any preconceived idea of what the population is like. You just randomly select people in a way that makes sure that every member of the population is equally likely to be selected. If your selection is truly random, then there’s a high probability (a measurably high probability, based on the size of the sample and the size of the sampled population) that the sample will be representative.

Scientific sampling is always random.

So Mr. Webster’s statement could be rephrased more correctly as the following contradiction: “This is not a scientific survey, because this is a scientific survey”. But Mr. Webster doesn’t know that what he said is a stupid contradiction. Because he doesn’t care.

# Stupid Politician Tricks; aka Averages Unfairly Biased against Moronic Conclusions

In the news lately, there’ve been a few particularly egregious examples of bad math. One that really ticked me off came from Alan Simpson. Simpson is one of the two co-chairs of a presidential comission that was asked to come up with a proposal for how to handle the federal budget deficit.

The proposal that his comission claimed that social security was one of the big problems in the budget. It really isn’t – it requires extremely creative accounting combined with several blatant lies to make it into part of the budget problem. (At the moment, social security is operating in surplus: it recieves more money in taxes each year than it pays out.)

Simpson has claimed that social security must be cut if we’re going to fix the budget deficit. As part of his attempt to defend his proposed cuts, he said the following about social security:

It was never intended as a retirement program. It was set up in ‘37 and ‘38 to take care of people who were in distress — ditch diggers, wage earners — it was to give them 43 percent of the replacement rate of their wages. The life expectancy was 63. That’s why they set retirement age at 65

When I first heard that he’d said that, my immediate reaction was “that miserable fucking liar”. Because there are only two possible interpretations of that statement. Either the guy is a malicious liar, or he’s cosmically stupid and ill-informed. I was willing to accept that he’s a moron, but given that he spent a couple of years on the deficit commission, I couldn’t believe that he didn’t understand anything about how social security works.

I was wrong.

In an interview after that astonishing quote, a reported pointed out that the overall life expectancy was 63 – but that the life expectancy for people who lived to be 65 actually had a life expectancy of 79 years. You see, the life expectancy figures are pushed down by people who die young. Especially when you realize that social security start at a time when the people collecting it grew up without antibiotics, there were a whole lot of people who died very young – which bias the age downwards. Simpson’s
response to this?

If you’re telling me that a guy who got to be 65 in 1940 — that all of them lived to be 77 — that is just not correct. Just because a guy gets to be 65, he’s gonna live to be 77? Hell, that’s my genre. That’s not true.

So yeah.. He’s really stupid. Usually, when it comes to politicians, my bias is to assume malice before ignorance. They spend so much of their time repeating lies – lying is pretty much their entire job. But Simpson is an extremely proud, arrogant man. If he had any clue of how unbelievably stupid he sounded, he wouldn’t have said that. He’d have made up some other lie that made him look less stupid. He’s got too much ego to deliberately look like a credulous drooling cretin.

So my conclusion is: He really doesn’t understand that if the overall average life expectancy for a set of people is 63, that the life expectancy of the subset people who live to be 63 going to be significantly higher than 63.

Just to hammer in how stupid it is, let’s look at a trivial example. Let’s look at a group of five people, with an average life expectancy of 62 years.

One died when he was 12. What’s the average age at death of the rest of them to make the overall average life expectancy was 62 years?

$frac{4x + 12}{5} = 62, x = 74$
.

So in this particular group of people with a life expectancy of 62 years, the pool of people who live to be 20 has a life expectancy of 74 years.

It doesn’t take much math at all to see how much of a moron Simpson is. It should be completely obvious: some people die young, and the fact that they die young affects the average.

Another way of saying it, which makes it pretty obvious how stupid Simpson is: if you live to be 65, you can be pretty sure that you’ll live to be at least 65, and you’ve got a darn good chance of living to be 66.

It’s incredibly depressing to realize that the report co-signed by this ignorant, moronic jackass is widely accepted by politicians and influential journalists as a credible, honest, informed analysis of the deficit problem and how to solve it. The people who wrote the report are incapable of comprehending the kind of simple arithmetic that’s needed to see how stupid Simpson’s statement was.

# Electoral Rubbish

I live in New York. ’round here, we’ve got a somewhat peculiar feature of how we run our elections. A single candidate can run for office on behalf of multiple parties. If they do, they appear on the ballot in multiple places – one ballot line for each party that they represent. When votes are tallied, if the candidate names for two different ballot lines match exactly, then the votes for those two lines are combined.

The theory behind this is that it allows people to say a bit more with their votes. If you want to vote for the democratic candidate, but you also want to express you preferences for policies more liberal than those of the democratic party platform, you can vote for the democrat, but do it on the liberal party line instead of the democratic party line.

In practice, what this means is that we’ve got lots of patronage parties – that is, lots of small parties which were set up by a small group of people as a way of making money by, essentially, selling their ballot line.

One thing we hear, election after election, is how terribly important these phony parties are. This year, we keep on hearing, over and over, how no Republican has won a statewide election since 1975 without the backing of the Conservative party! Therefore, winning the backing of the Conservative party is so very, very important!

This is, alas, a classic example of the old problem: correlation does not imply causation. The Republicans don’t lose elections because they don’t have the backing of the Conservative party: the Conservative party always backs the republican candidate unless it’s completely clear that they’re going to lose.

# Iterative Hockey Stick Analysis? Gimme a break!

This past weekend, my friend Orac sent me a link to an interesting piece
of bad math. One of Orac’s big interest is vaccination and
anti-vaccinationists. The piece is a newsletter by a group calling itself the “Sound Choice
Pharmaceutical Institute” (SCPI), which purports to show a link
between vaccinations and autism. But instead of the usual anti-vac rubbish about
thimerosol, they claim that “residual human DNA contamintants from aborted human fetal cells”
causes autism.

Among others, Orac already covered the nonsense
of that from a biological/medical
perspective. What he didn’t do, and why he forwarded this newsletter to me, is because
the basis of their argument is that they discovered key change points in the
autism rate that correlate perfectly with the introduction of various vaccines.

In fact, they claim to have discovered three different inflection points:

1. 1979, the year that the MMR 2 vaccine was approved in the US;
2. 1988, the year that a 2nd dose of the MMR 2 was added to the recommended vaccination
schedule; and
3. 1995, the year that the chickenpox vaccine was approved in the US.

They claim to have discovered these inflection points using “iterative hockey stick analysis”.