Deceptive Statistics and Elections

I don’t mean to turn by blog into political nonsense central, but I just can’t pass on some of these insane arguments.

This morning, the state of Texas sued four other states to overturn the results of our presidential election. As part of their suit, they included an "expert analysis" that claims that the odds of the election results in the state of Georgia being legitimate are worse that 1 in one quadrillion. So naturally, I had to take a look.

Here’s the meat of the argument that they’re claiming "proves" that the election results were fraudulent.

I tested the hypothesis that the performance of the two Democrat candidates were statistically similar by comparing Clinton to Biden. I use a Z-statistic or score, which measures the number of standard deviations the observation is above the mean value of the comparison being made. I compare the total votes of each candidate, in two elections and test the hypothesis that other things being the same they would have an equal number of votes.
I estimate the variance by multiplying the mean times the probability of the candidate not getting a vote. The hypothesis is tested using a Z-score which is the difference between the two candidates’ mean values divided by the square root of the sum of their respective variances. I use the calculated Z-score to determine the p-value, which is the probability of finding a test result at least as extreme as the actual results observed. First, I determine the Z-score comparing the number of votes Clinton received in 2016 to the number of votes Biden received in 2020. The Z-score is 396.3. This value corresponds to a confidence that I can reject the hypothesis many times more than one in a quadrillion times that the two outcomes were similar.

This is, to put it mildly, a truly trashy argument. I’d be incredibly ashamed if a high school student taking statistics turned this in.

What’s going on here?

Well, to start, this is deliberately bad writing. It’s using a whole lot of repetitive words in confusing ways in order to make it sound complicated and scientific.

I can simplify the writing, and that will make it very clear what’s going on.

I tested the hypothesis that the performance of the two Democrat candidates were statistically similar by comparing Clinton to Biden. I started by assuming that the population of eligible voters, the rates at which they cast votes, and their voting preferences, were identical for the two elections. I further assumed that the counted votes were a valid random sampling of the total population of voters. Then I computed the probability that in two identical populations, a random sampling could produce results as different as the observed results of the two elections.

As you can see from the rewrite, the "analysis" assumes that the voting population is unchanged, and the preferences of the voters are unchanged. He assumes that the only thing that changed is the specific sampling of voters from the population of eligible voters – and in both elections, he assumes that the set of people who actually vote is a valid random sample of that population.

In other words, if you assume that:

  1. No one ever changes their mind and votes for different parties candidates in two sequential elections;
  2. The population and its preferences never changes – people don’t move in and out of the state, and new people don’t register to vote;
  3. The specific people who vote in an election is completely random.

Then you can say that this election result is impossible and clearly indicates fraud.

The problem is, none of those assumptions are anywhere close to correct or reasonable. We know that people’s voting preference change. We know that the voting population changes. We know that who turns out to vote changes. None of these things are fixed constants – and any analysis that assumes any of these things is nothing but garbage.

But I’m going to zoom in a bit on one of those: the one about the set of voters being a random sample.

When it comes to statistics, the selection of a sample is one of the most important, fundamental concerns. If your sample isn’t random, then it’s not random. You can’t compare results for two samples as if they’re equivalent if they aren’t equivalent.

Elections aren’t random statistical samples of the population. They’re not even intended to be random statistical samples. They’re deliberately performed as counts of motivated individual who choose to come out and cast their votes. In statistical terms, they’re a self-selected, motivated sample. Self-selected samples are neither random nor representative in a statistical sense. There’s nothing wrong with that: an election isn’t intended to be a random sample. But it does mean that when you do statistical analysis, you cannot treat the set of voters as a random sampling of the population of elegible voters; and you cannot make any assumptions about uniformity when you’re comparing the results of two different elections.

If you could – if the set of voters was a valid random statistical sample of an unchanging population of eligible voters, then there’d be no reason to even have elections on an ongoing basis. Just have one election, take its results as the eternal truth, and just assume that every election in the future would be exactly the same!

But that’s not how it works. And the people behind this lawsuit, and particularly the "expert" who wrote this so-called statistical analysis, know that. This analysis is pure garbage, put together to deceive. They’re hoping to fool someone into believing that they actually prove something that they couldn’t prove.

And that’s despicable.

11 thoughts on “Deceptive Statistics and Elections

  1. spencer

    > I estimate the variance by multiplying the mean times the probability of the candidate not getting a vote.

    This is gibberish to me. The mean of what? They seem to be only using a single data point (2016 results) so how can the variance be estimated? How can even a mean be estimated?

    Reply
    1. markcc Post author

      I was debating whether to make the post longer by talking about that.

      As far as I can tell, he just makes that number up. He pretends that he’s looking at a sample, and he makes up a probability, and then uses that to compute a variance. It’s just a made-up number to give the result that he wants; and he buries it in a flurry of obfuscatory text to try to cover up for the fact that it’s made-up.

      Reply
  2. fredsbend

    Thank you for your statistical analysis. I did however find that dismissing their assumptions may have come too quickly.

    *”We know that people’s voting preference change.”* My experience is that people rarely change their political voting habits, and I believe I’ve seen data to back that up. Makes me wonder if taking the other dismissed assumptions (*”We know that the voting population changes. We know that who turns out to vote changes.”*) at face value (which I do) is valid. Do populations change significantly in only 1 election cycle? Do consistent voters not significantly outnumber inconsistent voters?

    Perhaps it’s beyond this site’s purview to discuss it, but dismissing those assumptions without evidence may be incorrect. There might be validity in them.

    Reply
    1. markcc Post author

      You don’t get to say “my experience is X, therefore there must be fraud”.

      In 2004, a republicate candidate won re-election by a decent margin.

      In 2008, a democratic candidate won a national election by a significant margin. Was that fraud? By the argument I criticized, it must have been, because voters never change.

      In 2012, a democratic candidate was re-elected by a significant margin, but with a lower elector count total than 2008. Was that fraud? By the argument I criticized, it must be: voters don’t change, right?

      In 2016, the democratic candidate lost. Was that a fraud? By the reasoning of this fraud argument, it must have been. Because, according to the argument, voters don’t change.

      Similar arguments apply to things like voter turnout. In 2008, 229 million people voted. In 2012, 235 million people voted. In 2016, 231 million people voted. In 2020, 239 million people voted. Right there, we can see that there’s a significant variation in the set of people who voted.

      Can we assume that every election where turnout increases is a fraud?

      Reply
  3. GW

    First thing I’d like to know about the author is whether he’s a Biden or Trump supporter. You “cleaned” up someone else’s writing, subjectively. It’s impossible to do it objectively – you didn’t write it and aren’t in the mind of the original author.

    I barely remember much from statistics and probability. I agree that the calculations are based on subjective data points. However, those cities in question don’t vary that much percentage-wise in recent Presidential elections. Just because you say voters preferences change (true) and people move in and out (true) also doesn’t mean it will reflect a meaningful change from one election to another.

    I have noticed the people disputing this quadrillion number statistic only try to refute it by trying to pick certain areas apart on it that attempt to discredit the number. I also have noticed you didn’t tell your readers that ANY statistic anywhere from 1 billion to the 1:1 quadrillion number is so IMPRECISE it’s laughable no matter who calculated it. The numbers are so massive that too many variances exist for there being any level of preciseness. They just sound good. Second, those of you challenging the 1: 1 quadrillion statistic and trying to project yourselves onto your readers as way smarter but with humility…. but if you’re so damn smart and believe you’ve actually proven this guy’s calculation and data points to be inaccurate….. well, smarty pants…. prove your superior intelligence by giving us the real statistic based on accurate data points and sign your name to it like the other guy did. I can pick apart many, many things at my level of intelligence and education and do it pretty accurately. What I cannot do in most cases is prove I can do it any better or more accurately. Publish your statistic for all to see and earn your reader’s respect.
    Thank you for an informative and well written article, too. You did a nice job on keeping it where the average + person can understand it. That’s now always easy to do. God Bless you and yours and Happy Holidays.

    Reply
    1. markcc Post author

      You can’t compute a statistic like this, because there’s no meaningful way to do it. Or rather, you can, but it’s not interesting: the odds of it happening are 100% – because it did happen.

      The original writer put together a ridiculous, nonsensical “analysis” based on statistically invalid methods, made up numbers, and ridiculous arguments.And at the end, he came up with a number that was meaningless, because it was based on nonsense.

      Reply
  4. Mark G Townes

    OK and Fair enough — and thank you for the simplified language (I believe his explanation was just nerd talk and not an obfucation effort…I was guilty of this early in my engineering career). However, making simplified assumptions as a starting point are not uncommon in any kind of analysis. So, maybe instead of stopping short and calling it “trash”, use your skills to extend the analysis.

    For example, take the first “false” assumption — no one ever changes their mind and votes for different parties candidates in two sequential elections — do we not have pretty good data on the probability of this occurance? Assuming so, then refine the analysis to include additional data. Do this, also, with data associated with the other two questionable assumptions — one can even model “level” of randomness as a variable.

    Plug all the (prior and addirional) data into a multivariate statistical software package that iterates on different levels of these three assumptions — with, of course, reasonable boundaries so the calculations don’t spin to infinity.

    Then look at results at the extremes. The Texas lawsuit argues, admittedly, one extreme with respect to the suspect assumptions listed. But what is at the other extreme?

    So forget the quadrillion talk — maybe the “probability” of a Biden win is calculates out to be 1 in 2 … even steven. Or even 1.1 to 2 — a narrow victory would then make sense.

    But, what if the probability at the opposite extreme is shown to be 1 in 10, 1 in 100, 1 in 1000? What do we do now?

    Reply
    1. markcc Post author

      The problem is that the *entire analysis* is invalid. It’s not a case where you can pick one little aspect that’s wrong, correct that, and then continue with a reasonable analysis. The entire thing is fabricated from a pile of nonsense.

      The assumption of unchanged voting patterns is nonsense. But even worse is that the core of his “analysis” is the assumption that election tallies are some kind of random sampling of *something*. (He never actually specifies what they’re a random sampling *of*; and he comes up with a variance for that random sampling calculated from a number that he pulled out of nowhere.)

      You can’t fix that analysis. The entire thing is statistical garbage.

      Yes, you *can* do a reasonable analysis of things like this. Look at sites like 538, which combine data from previous elections with polling data to make reasonable estimates. But those methods are completely different from the one cited in the texas suit.

      The method cited in the Texas suit is nothing but garbage. From top to bottom, there is *nothing* about it that is correct. It makes unfounded assumptions; invalid methods; and invented numbers. It is a thoroughly dishonest, disgraceful piece of work.

      In all seriousness, I would expect any high school student who took a basic statistics course to be able to take one look at the method that the “analyst” described, and realize what utter garbage it is.

      I’ll repeat myself, because this is important. He didn’t make “simplified” assumptions. He made *obviously* incorrect assumptions. He used methods that *obviously* made absolutely no sense. (Again, the analysis is based on the idea that the election results are a *random sampling* of *something* which he never specifies. Election results aren’t a random sampling, and no statistical analysis ever talks about a random sample without identifying the population that was sampled and the methods used to ensure randomness.) He pulled numbers out of thin air in order to get the result that he wanted. (Once again, how did he compute that variance? By inventing a probability of a vote being tallied for a particular candidate. That invented number determines *everything* else about his analysis. And he invented it whole-cloth.)

      What was the probability of a Biden win? That depends greatly on exactly how you choose to crunch the numbers. There’s a reason that we’re really terrible at predicting the results of an election: our models of who will turn out to vote really don’t work very well. But in my opinion, the best analysis I saw before the election came from 538, which estimated the probability of a Biden win to be roughly 95%. But even there, looking at the results after the fact, I think that 538’s assumptions were overly generous to Biden – the election was much closer than their predictions suggested.

      I’m not going to go further into the analysis than that. I’m *not* a statistician. I’m no expert on this stuff. But the point is, you don’t need to be an expert to see what’s wrong with the Texas analysis. It’s *obvious*.

      Reply
  5. A.M.

    The Texas analyst has now appended a reply defending his methods.

    https://www.supremecourt.gov/DocketPDF/22/22O155/163493/20201211095822921_TX-v-State-LeaveReply-2020-12-11.pdf

    You might like to write a follow up post addressing this argument simply because it is curious. I find his reply very strange. While the original “expert analysis” is so ridiculous that it easily reads as bad faith, this new reply makes it sound like he legitimately doesn’t understand how statistical analysis works.

    Reply

Leave a Reply