# Deceptive Statistics and Elections

I don’t mean to turn by blog into political nonsense central, but I just can’t pass on some of these insane arguments.

This morning, the state of Texas sued four other states to overturn the results of our presidential election. As part of their suit, they included an "expert analysis" that claims that the odds of the election results in the state of Georgia being legitimate are worse that 1 in one quadrillion. So naturally, I had to take a look.

Here’s the meat of the argument that they’re claiming "proves" that the election results were fraudulent.

I tested the hypothesis that the performance of the two Democrat candidates were statistically similar by comparing Clinton to Biden. I use a Z-statistic or score, which measures the number of standard deviations the observation is above the mean value of the comparison being made. I compare the total votes of each candidate, in two elections and test the hypothesis that other things being the same they would have an equal number of votes.
I estimate the variance by multiplying the mean times the probability of the candidate not getting a vote. The hypothesis is tested using a Z-score which is the difference between the two candidates’ mean values divided by the square root of the sum of their respective variances. I use the calculated Z-score to determine the p-value, which is the probability of finding a test result at least as extreme as the actual results observed. First, I determine the Z-score comparing the number of votes Clinton received in 2016 to the number of votes Biden received in 2020. The Z-score is 396.3. This value corresponds to a confidence that I can reject the hypothesis many times more than one in a quadrillion times that the two outcomes were similar.

This is, to put it mildly, a truly trashy argument. I’d be incredibly ashamed if a high school student taking statistics turned this in.

What’s going on here?

Well, to start, this is deliberately bad writing. It’s using a whole lot of repetitive words in confusing ways in order to make it sound complicated and scientific.

I can simplify the writing, and that will make it very clear what’s going on.

I tested the hypothesis that the performance of the two Democrat candidates were statistically similar by comparing Clinton to Biden. I started by assuming that the population of eligible voters, the rates at which they cast votes, and their voting preferences, were identical for the two elections. I further assumed that the counted votes were a valid random sampling of the total population of voters. Then I computed the probability that in two identical populations, a random sampling could produce results as different as the observed results of the two elections.

As you can see from the rewrite, the "analysis" assumes that the voting population is unchanged, and the preferences of the voters are unchanged. He assumes that the only thing that changed is the specific sampling of voters from the population of eligible voters – and in both elections, he assumes that the set of people who actually vote is a valid random sample of that population.

In other words, if you assume that:

1. No one ever changes their mind and votes for different parties candidates in two sequential elections;
2. The population and its preferences never changes – people don’t move in and out of the state, and new people don’t register to vote;
3. The specific people who vote in an election is completely random.

Then you can say that this election result is impossible and clearly indicates fraud.

The problem is, none of those assumptions are anywhere close to correct or reasonable. We know that people’s voting preference change. We know that the voting population changes. We know that who turns out to vote changes. None of these things are fixed constants – and any analysis that assumes any of these things is nothing but garbage.

But I’m going to zoom in a bit on one of those: the one about the set of voters being a random sample.

When it comes to statistics, the selection of a sample is one of the most important, fundamental concerns. If your sample isn’t random, then it’s not random. You can’t compare results for two samples as if they’re equivalent if they aren’t equivalent.

Elections aren’t random statistical samples of the population. They’re not even intended to be random statistical samples. They’re deliberately performed as counts of motivated individual who choose to come out and cast their votes. In statistical terms, they’re a self-selected, motivated sample. Self-selected samples are neither random nor representative in a statistical sense. There’s nothing wrong with that: an election isn’t intended to be a random sample. But it does mean that when you do statistical analysis, you cannot treat the set of voters as a random sampling of the population of elegible voters; and you cannot make any assumptions about uniformity when you’re comparing the results of two different elections.

If you could – if the set of voters was a valid random statistical sample of an unchanging population of eligible voters, then there’d be no reason to even have elections on an ongoing basis. Just have one election, take its results as the eternal truth, and just assume that every election in the future would be exactly the same!

But that’s not how it works. And the people behind this lawsuit, and particularly the "expert" who wrote this so-called statistical analysis, know that. This analysis is pure garbage, put together to deceive. They’re hoping to fool someone into believing that they actually prove something that they couldn’t prove.

And that’s despicable.

# Herd Immunity

With COVID running rampant throughout the US, I’ve seen a bunch of discussions about herd immunity, and questions about what it means. There’s a simple mathematical concept behind it, so I decided to spend a bit of time explaining.

The basic concept is pretty simple. Let’s put together a simple model of an infectious disease. This will be an extremely simple model – we won’t consider things like variable infectivity, population age distributions, population density – we’re just building a simple model to illustrate the point.

To start, we need to model the infectivity of the disease. This is typically done using the name $R_0$. $R_0$ is the average number of susceptible people that will be infected by each person with the disease. $R_0$ is the purest measure of infectivity – it’s the infectivity of the disease in ideal circumstances. In practice, we look for a value $R$, which is the actual infectivity. $R$ includes the effects of social behaviors, population density, etc.

The state of an infectious disease is based on the expected number of new infections that will be produced by each infected individual. We compute that by using a number S, which is the proportion of the population that is susceptible to the disease.

• If R S < 1, then the disease dies out without spreading throughout the population. More people can get sick, but each wave of infection will be smaller than the last.
• If R S = 1, then the disease is said to be endemic. It continues as a steady state in the population. It never spreads dramatically, but it never dies out, either.
• If R S > 1, then the disease is pandemic. Each wave of infection spreads the disease to a larger subsequent wave. The higher the value of $R$ in a pandemic, the faster the disease will spread, and the more people will end up sick.

There are two keys to managing the spread of an infectious disease

1. Reduce the effective value of $R$. The value of $R$ can be affected by various attributes of the population, including behavioral ones. In the case of COVID-19, an infected person wearing a mask will spread the disease to fewer others; and if other people are also wearing masks, then it will spread even less.
2. Reduce the value of $S$. If there are fewer susceptible people in the population, then even with a high value of $R$, the disease can’t spread as quickly.

The latter is the key concept behind herd immunity. If you can get the value of $S$ to be small enough, then you can get $R * S$ to the sub-endemic level – you can prevent the disease from spreading. You’re effectively denying the disease access to enough susceptible people to be able to spread.

Let’s look at a somewhat concrete example. The $R_0$ for measles is somewhere around 15, which is insanely infectious. If 50% of the population is susceptible, and no one is doing anything to avoid the infection, then each person infected with measles will infect 7 or 8 other people – and they’ll each infect 7 or 8 others – and so on, which means you’ll have epidemic spread.

Now, let’s say that we get 95% of the population vaccinated, and they’re immune to measles. Now $R * S = 15 * 0.05 = 0.75$. The disease isn’t able to spread. If you had an initial outbreak of 5 infected, then they’ll infect around 3 people, who’ll infect around 2 people, who’ll infect one person, and soon, there’s no more infections.

In this case, we say that the population has herd immunity to the measles. There aren’t enough susceptible people in the population to sustain the spread of the disease – so if the disease is introduced to the population, it will rapidly die out. Even if there are individuals who are still susceptible, they probably won’t get infected, because there aren’t enough other susceptible people to carry it to them.

There are very few diseases that are as infectious as measles. But even with a disease that is that infectious, you can get to herd immunity relatively easily with vaccination.

Without vaccination, it’s still possible to develop herd immunity. It’s just extremely painful. If you’re dealing with a disease that can kill, getting to herd immunity means letting the disease spread until enough people have gotten sick and recovered that the disease can’t spread any more. What that means is letting a huge number of people get sick and suffer – and let some portion of those people die.

Getting back to COVID-19: it’s got an $R_0$ that’s much lower. It’s somewhere between 1.4 and 2.5. Of those who get sick, even with good medical care, somewhere between 1 and 2% of the infected end up dying. Based on that $R_0$, herd immunity for COVID-19 (the value of S required to make R*S<1) is somewhere around 50% of the population. Without a vaccine, that means that we’d need to have 150 million people in the US get sick, and of those, around 2 million would die.

(UPDATE: Ok, so I blew it here. The papers that I found in a quick search appear to have a really bad estimate. The current CDC estimate of $R_0$ is around 5.7 – so the S needed for herd immunity is significantly higher – upward of 80%, and so the would the number of deaths.)

A strategy for dealing with an infection disease that accepts the needless death of 2 million people is not exactly a good strategy.