Category Archives: statistics

Least Square Linear Regression

There’s one topic I’ve been asked about multiple times, but which I’ve never gotten around to writing about. It happens to be one of the first math things that my dad taught me about: linear regression.

Here’s the problem: you’re doing an experiment. You’re measuring one quantity as you vary another. You’ve got a good reason to believe that there’s a linear relationship between the two quantities. But your measurements are full of noise, so when you plot the data on a graph, you get a scattershot. How can you figure out what line is the best match to your data, and how can you measure how good the match is?

When my dad taught me this, he was working for RCA manufacturing semiconductor chips for military and satellite applications. The focus of his work was building chips that would survive in the high-radiation environment of space – in the jargon, he was building radiation hard components. They’d put together a set of masks for an assembly line, and do a test run. Then they’d take the chips from that run, and they’d expose them to gamma radiation until they failed. That would give them a good way of estimating the actual radiation hardness of the run, and whether it was good enough for their customers. Based on a combination of theory and experience, they knew that the relationship they cared about was nearly linear: for a given amount of radiation, the number of specific circuitry failures was proportional to the amount of gamma exposure.

graph For example, here’s a graph that I generated semi-randomly of data points. The distribution of the points isn’t really what you’d get from real observations, but it’s good enough for demonstration.

The way that we’d usually approach this is called least square linear regression. The idea is that what we want do do is create a line where the square of the vertical distance between the chosen line and the measured data points is a minimum.

For the purposes of this, we’ll say that one quantity is the independent value, and we’ll call that x, and the other quantity is the dependent variable, and we’ll call that y. In theory, the dependent variable, as its name suggests depends on the independent variable. In fact, we don’t always really know which value depends on the other, so we do our best to make an intelligent guess.

So what we want to do is find a linear equation, y = mx + b where the mean-square distance is minimal. All we need to do is find values for m (the slope of the line) and b (the point where the line crosses the y axis, also called the y intercept). And, in fact, b is relatively easy to compute once we know the slope of the line. So the real trick is to find the slope of the line.

The way that we do that is: first we compute the means of x and y, which we’ll call \overline{x} and \overline{y}. Then using those, we compute the slope as:

 m = \frac{\Sigma_{i=1}^n (x-\hat{x})(y-\hat{y})}{\Sigma_{i=1}^{n} (x-\hat{x})^2}

Then for the y intercept: b = \hat{y} - m\hat{x}.

In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slop in the figure is 2.54, and the y-intercept is 18.4.

Now, we want to check how good the linear relationship is. There’s several different ways of doing that. The simplest is called the correlation coefficient, or r.

 r = \frac{\left(\Sigma (x-\hat{x})\right) \left(Sigma (y - \hat{y})\right)}{\sqrt{ \left(\Sigma (x-\hat{x})^2\right) \left(\Sigma (y - \hat{y})^2\right) }}

If you look at this, it’s really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you’ve got a set of products; on the bottom, you’ve got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect – that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there’s no relationship, then the correlation will be 0.

For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you’d expect.

When you see people use linear regression, there are a few common errors that you’ll see all the time.

  • No matter what your data set looks like, linear regression will find a line. It won’t tell you “Oops, I couldn’t find a match”. So the fact that you fit a line means absolutely nothing by itself. If you’re doing it right, you start off with a hypothesis based on prior plausibility for a linear relation, and you’re using regression as part of a process to test that hypothesis.
  • You don’t get to look at the graph before you do the analysis. What I mean by that is, if you look at the data, you’ll naturally notice some patterns. Humans are pattern seekers – we’re really good at noticing them. And almost any data set that you look at carefully enough will contain some patterns purely by chance. If you look at the data, and there’s a particular pattern that you want to see, you’ll probably find a way to look at the data that produces that pattern. For example, in the first post on this blog, I was looking at a shoddy analysis by some anti-vaxxers, who were claiming that they’d found an inflection point in the rate of autism diagnoses, and used linear regression to fit two lines – one before the inflection, one after. But that wasn’t supported in the data. It was random – the data was very noisy. You could fit different lines to different sections by being selective. If you picked one time, you’d get a steeper slope before that time, and a shallower one after. But by picking different points, you could get a steeping slope after. The point is, when you’re testing the data, you need to design the tests before you’ve seen the data, in order to keep your bias out!
  • A strong correlation doesn’t imply linear correlation. If you fit a line to a bunch of data that’s not really linear, you can still get a strong positive (or negative) correlation. Correlation is really testing whether the data is increasing the way you’d expect it to, not whether it’s truly linear. Random data will have a near-zero correlation. Data where the dependent variable doesn’t vary consistently with the independent will have near-zero correlation. But there are plenty of ways of getting data where the dependent and independent variables increase together that produce a strong correlation. You need to do other things to judge the strength of the fit. (I might do some more posts on this kind of thing to cover some of that.)

Debunking Two Nate Silver Myths

I followed our election pretty closely. My favorite source of information was Nate Silver. He’s a smart guy, and I love the analysis that he does. He’s using solid math in a good way to produce excellent results. But in the aftermath of the election, I’ve seen a lot of bad information going around about him, his methods, and his result.

First: I keep seeing proclamations that “Nate Silver proves that big data works”.

Rubbish.

There is nothing big data about Nate’s methods. He’s using straightforward Bayesian methods to combine data, and the number of data points is remarkably small.

Big data is one of the popular jargon keywords that people use to appear smart. But it does actually mean something. Big data is using massive quantities of information to find patterns: using a million data points isn’t really big data. Big data means terabytes of information, and billions of datapoints.

When I was at Google, I did log analysis. We ran thousands of machines every day on billions of log records (I can’t say the exact number, but it was in excess of 10 billion records per day) to extract information. It took a data center with 10,000 CPUs running full-blast for 12 hours a day to process a single days data. Using that data, we could extract some obvious things – like how many queries per day for each of the languages that Google supports. We could also extract some very non-obvious things that weren’t explicitly in the data, but that were inferrable from the data – like probable network topologies of the global internet, based on communication latencies. That’s big data.

For another example, look at this image produced by some of my coworkers. At foursquare, we about five million points of checkin data every day, and we’ve got a total of more than 2 1/2 billion data points. By looking at average checkin densities, and then comparing that to checkin densities after the hurricane, we can map out precisely where in the city there was electricity, and where there wasn’t. We couldn’t do that by watching one person, or a hundred people. But by looking at the patterns in millions and millions of records, we can. That is big data.

This doesn’t take away from Nate’s accomplishment in any way. He used data in an impressive and elegant way. The fact is, he didn’t need big data to do this. Elections are determined by aggregate behavior, and you just don’t need big data to predict them. The data that Nate used was small enough that a person could do the analysis of it with paper and pencil. It would be a huge amount of work to do by hand, but it’s just nowhere close to the scale of what we call big data. And trying to do big data would have made it vastly more complicated without improving the result.

Second: there are a bunch of things like this.

The point that many people seem to be missing is that Silver was not simply predicting who would win in each state. He was publishing the odds that one or the other candidate would win in each statewide race. That’s an important difference. It’s precisely this data, which Silver presented so clearly and blogged about so eloquently, that makes it easy to check on how well he actually did. Unfortunately, these very numbers also suggest that his model most likely blew it by paradoxically underestimating the odds of President Obama’s reelection while at the same time correctly predicting the outcomes of 82 of 83 contests (50 state presidential tallies and 32 of 33 Senate races).

Look at it this way, if a meteorologist says there a 90% chance of rain where you live and it doesn’t rain, the forecast wasn’t necessarily wrong, because 10% of the time it shouldn’t rain – otherwise the odds would be something other than a 90% chance of rain. One way a meteorologist could be wrong, however, is by using a predictive model that consistently gives the incorrect probabilities of rain. Only by looking a the odds the meteorologist gave and comparing them to actual data could you tell in hindsight if there was something fishy with the prediction.

Bzzt. Sorry, wrong.

There are two main ways of interpreting probability data: frequentist, and Bayesian.

In a frequentist interpretation, saying that an outcome of an event has a probability X% of occuring, you’re saying that if you were to run an infinite series of repetitions of the event, then on average,
the outcome would occur in X out of every 100 events.

The Bayesian interpretation doesn’t talk about repetition or observation. What it says is: for any specific event, it will have one outcome. There is no repetition. But given the current state of information available to me, I can have a certain amount of certainty about whether or not the event will occur. Saying that I assign probability P% to an event doesn’t mean that I expect my prediction to fail (100-P)% of the time. It just means that given the current state of my knowledge, I expect a particular outcome, and the information I know gives me that degree of certainty.

Bayesian statistics and probability is all about state of knowledge. The fundamental, defining theorem of Bayesian statistics is Bayes theorem, which tells you, given your current state of knowledge and a new piece of information, how to update your knowledge based on what the new information tells you. Getting more information doesn’t change anything about whether or not the event will occur: it will occur, and it will have either one outcome or the other. But new information can allow you to improve your prediction and your certainty of that prediction’s correctness.

The author that I quoted above is being a frequentist. In another section of his articple, he’s more specific:

…The result is P= 0.199, which means there’s a 19.9% chance that it rained every day that week. In other words, there’s an 80.1% chance it didn’t rain on at least one day of the week. If it did in fact rain everyday, you could say it was the result of a little bit of luck. After all, 19.9% isn’t that small a chance of something happening.

That’s frequentist intepretation of the probability – which makes sense, since as a physicist, the author is mainly working with repeated experiments – which is a great place for frequentist interpretation. But looking at the same data, a Bayesian would say: “I have an 19.9% certainty that it will rain today”. Then they’d go look outside, see the clouds, and say “Ok, so it looks like rain – that means that I need to update my prediction. Now I’m 32% certain that it will rain”. Note that nothing about the weather has changed: it’s not true that before looking at the clouds, 80.1 percent of the time it wouldn’t rain, and after looking, that changed. The actual fact of whether or not it will rain on that specific day didn’t
change.

Another way of looking at this is to say that a frequentist believes that a given outcome has an intrinstic probability of occurring, and that our attempts to analyze it just bring us closer to the true probability; whereas a Bayesian says that there is no such thing as an intrinsic probability, because every event is different. All that changes is our ability to make predictions with confidence.

One last metaphor, and I’ll stop. Think about playing craps, where you’re rolling two six sided dice.
For a particular die, a frequentist would say “A fair die has a 1 in 6 chance of coming up with a 1”. A
Bayesian would say “If I don’t know anything else, then my best guess is that I can be 16% certain that a 1
will result from a roll.” The result is the same – but the reasoning is different. And because of the difference in reasoning, you can produce different predictions.

Nate Silver’s predictions of the election are a beautiful example of Bayesian reasoning. He watched daily polls, and each time a new poll came out, he took the information from that poll, weighted it according to the historical reliability of that poll in that situation, and then used that to update his certainty. So based on his data, Nate was 90% certain that his prediction was correct.

Book Review: The Manga Guide to Statistics

51b3a5avR5L._SL160_.jpg

I recently got an offer from someone at No-Starch Press to review the
newly translated book, The Manga Guide to Statistics. I recieved the book a couple of weeks ago, but haven’t had time to sit down and read it until now.

If you haven’t heard of the “Manga Guides”, they’re an interesting idea. In Japan, comic books (“Manga”) are much more common and socially accepte than they typically are in the US. It’s not at all unusual to see Japanese adults sitting in the subway reading Manga. Manga has a very distinctive artistic style, with its own
set of common artistic conventions. The Manga Guides are textbooks written as
Manga-style comics. In this case, it’s an introductory text on statistics.

The short version of the review: terrific book; engaging, thorough, and fun. Highly recommended. Details beneath the fold.

Continue reading

Margin of Error and Election Polls

Before I get to the meat of the post, I want to remind you that our
DonorsChoose drive is ending in just a couple of days! A small number of readers have made extremely generous contributions, which
is very gratifying. (One person has even taken me up on my offer
of letting donors choose topics.) But the number of contributions has been very small. Please, follow the link in my sidebar, go to DonorsChoose, and make a donation. Even a few dollars can make a
big difference. And remember – if you donate one hundred dollars or more, email me a math topic that you’d like me to write about, and I’ll
write you a blog article on that topic.

This post repeats a bunch of stuff that I mentioned in one of my basics posts last year on the margin of error. But given some of the awful rubbish I’ve heard in coverage of the coming election, I thought it was worth discussing a bit.

As the election nears, it seems like every other minute, we
hear predictions of the outcome of the election, based on polling. The
thing is, pretty much every one of those reports is
utter rubbish.

Continue reading

Probability Distributions

I’d like to start with a quick apology. Sorry that both the abstract algebra and the new game theory posts have been moving so slowly. I’ve been a bit overwhelmed lately with things that need doing right away, and
by the time I’m done with all that, I’m too tired to write anything that
requires a lot of care. I’ve known the probability theory stuff for so long,
and the parts I’m writing about so far are so simple that it really doesn’t take nearly as much effort.

With that out of the way, today, I’m going to write about probability distributions.

Continue reading

Random Variables

The first key concept in probability is called a random variable.
Random variables are a key concept – but since they’re a key concept of the
frequentist school, they are alas, one of the things that bring out more of
the Bayesian wars. But the idea of the random variable, and its key position
in understanding probability and statistics predates the divide between
frequentist and Bayesian though. So please, folks, be a little bit patient,
and don’t bring the Bayesian flamewars into this post, OK? If you want to
rant about how stupid frequentist explanations are, please keep it in the comments here. I’m trying to
explain basic ideas, and you really can’t talk about probability and
statistics without talking about random variables.

Continue reading

Schools of thought in Probability Theory

To understand a lot of statistical ideas, you need to know about
probability. The two fields are inextricably entwined: sampled statistics
works because of probabilistic properties of populations.

I approach writing about probability with no small amount of trepidation.

For some reason that I’ve never quite understood, discussions of probability
theory bring out an intensity of emotion that is more extreme than anything else
I’ve seen in mathematics. It’s an almost religious topic, like programming
languages in CS. This post is intended really as a flame attractor: that is, I’d request that if you want to argue about Bayesian probability versus frequentist probability, please do it here, and don’t clutter up every comment thread that
discusses probability!

There are two main schools of thought in probability:
frequentism and Bayesianism, and the Bayesians have an intense contempt for the
frequentists. As I said, I really don’t get it: the intensity seems to be mostly
one way – I can’t count the number of times that I’ve read Bayesian screeds about
the intense stupidity of frequentists, but not the other direction. And while I
sit out the dispute – I’m undecided; sometimes I lean frequentist, and sometimes I
lean Bayesian – every time I write about probability, I get emails and comments
from tons of Bayesians tearing me to ribbons for not being sufficiently
Bayesian.

It’s hard to even define probability without getting into trouble, because the
two schools of thought end up defining it quite differently.

The frequentist approach to probability basically defines probability in terms
of experiment. If you repeated an experiment an infinite number of times, and
you’d find that out of every 1,000 trials, a given outcome occured 350 times, then
a frequentist would say that the probability of that outcome was 35%. Based on
that, a frequentist says that for a given event, there is a true
probability associated with it: the probability that you’d get from repeated
trials. The frequentist approach is thus based on studying the “real” probability
of things – trying to determine how close a given measurement from a set of
experiments is to the real probability. So a frequentist would define probability
as the mathematics of predicting the actual likelihood of certain events occuring
based on observed patterns.

The bayesian approach is based on incomplete knowledge. It says that you only
associate a probability with an event because there is uncertainty about it –
because you don’t know all the facts. In reality, a given event either will happen
(probability=100%) or it won’t happen (probability=0%). Anything else is an
approximation based on your incomplete knowledge. The Bayesian approach is
therefore based on the idea of refining predictions in the face of new knowledge.
A Bayesian would define probability as a mathematical system of measuring the
completeness of knowledge used to make predictions. So to a Bayesian, strictly speaking, it’s incorrect to say “I predict that there’s a 30% chance of P”, but rather “Based on the current state of my knowledge, I am 30% certain that P will occur.”

Like I said, I tend to sit in the middle. On the one hand, I think that the
Bayesian approach makes some things clearer. For example, a lot of people
frequently misunderstand how to apply statistics: they’ll take a study showing
that, say, 10 out of 100 smokers will develop cancer, and assume that it means
that for a specific smoker, there’s a 10% chance that they’ll develop cancer.
That’s not true. The study showing that 10 out of 100 people who smoke will develop cancer can be taken as a good starting point for making a prediction – but a Bayesian will be very clear on the fact that it’s incomplete knowledge, and that it therefore isn’t very meaningful unless you can add more information to increase the certainty.

On the other hand, Bayesian reasoning is often used by cranks.
A Bayesian
argues that you can do a probabilistic analysis of almost anything, by lining
up the set of factors that influence it, and combining your knowledge of those factors in the correct way. That’s been used incredibly frequently by cranks for
arguing for the existence of God, for the “fact” that aliens have visited the
earth, for the “fact” that artists have been planting secret messages in
paintings, for the “fact” that there are magic codes embedded in various holy texts, etc. I’ve dealt with these sorts of arguments numerous times on this blog; the link above is a typical example.

Frequentism doesn’t fall victim to that problem; a frequentist only
believes probabilities make sense in the setting of a repeatable experiment. You
can’t properly formulate something like a probabilistic proof of God under the
frequentist approach, because the existence of a creator of the universe isn’t a
problem amenable to repeated experimental trials. But frequentism suffers
from the idea that there is an absolute probability for things – which is often ridiculous.

I’d argue that they’re both right, and both wrong, each in their own settings. There are definitely settings in which the idea of a fixed probability based on a model of repeatable, controlled experiment is, quite simply, silly. And there
are settings in which the idea of a probability only measuring a state of knowledge is equally silly.

Introduction to Linear Regression

Suppose you’ve got a bunch of data. You believe that there’s a linear
relationship between two of the values in that data, and you want to
find out whether that relationship really exists, and if so, what the properties
of that relationship are.

Continue reading

Basic Statistics: Mean and Standard Deviation

Several people have asked me to write a few basic posts on statistics. I’ve
written a few basic posts on the subject – like, for example, this post on mean, median and mode. But I’ve never really started from the beginnings, for people
who really don’t understand statistics at all.

To begin with: statistics is the mathematical analysis of aggregates. That is, it’s a set of tool for looking at a large quantity of data about a population, and finding ways to measure, analyze, describe, and understand the information about the population.

There are two main kinds of statistics: sampled statistics, and
full-population statistics. Full-population statistics are
generated from information about all members of a population; sampled statistics
are generated by drawing a representative sample – a subset of the population that should have the same pattern of properties as the full population.

My first exposure to statistics was full-population statistics, and that’s
what I’m going to talk about in the first couple of posts. After that, we’ll move on to sampled statistics.

Continue reading

Sex Crimes and Illegal Immigrants: Misuse of Statistics for Politics

Yet another reader sent me a great bad math link. (Keep ’em coming guys!) This one is an astonishingly nasty slight of hand, and a great example of how people misuse statistics to support a political agenda. It’s by someone
named “Dr. Deborah Schurman-Kauflin”, and it’s an attempt to paint illegal
immigrants as a bunch of filthy criminal lowlifes. It’s titled “The Dark Side of Illegal Immigration: Nearly One Million Sex Crimes Committed by Illegal Immigrants in the United States.”

With a title like that, you’d think that she has actual data showing that nearly one million sex crimes were committed by illegal immigrants, wouldn’t you? Well, you’d be wrong.

Continue reading