Suppose you’ve got a bunch of data. You believe that there’s a linear
relationship between two of the values in that data, and you want to
find out whether that relationship really exists, and if so, what the properties
of that relationship are.
Once again, I’ll use an example based on the first example that my
father showed me. He was working on semiconductor manufacturing. One of the
tests they did was to expose an integrated circuit to radiation, to determine
how much radiation you could expect it to be exposed to before it failed. (These were circuits for satellites, which are exposed to a lot of radiation!)
The expectation was that there was a linear relationship between
failure rate and radiation level. Expose a batch of chips to a given level R of radiation, and you’ll get an X% failure rate. Expose the chips to double the
amount of radiation, and the failure rate will double. This relationship
held up pretty well – even when the model predicts that the doubling would go above 100, they’d see all chips fail.
So we’d start with a list of data about failure rates. This is a very small
dataset; in reality, we’d have hundreds or thousands of data points, but for simplicity, I’m using a very small set.
There are a couple of things that we’d like to know from this data. First,
can we predict the failure rate for a particular exposure? And if so, how reliable will that prediction be?
To answer that, we first need to have some idea of what kind of relationship we’re looking for. In this case, we believe that the relationship should be linear.
(If it’s not, then we wouldn’t use linear regression, and this would be a different post.) So what we’re looking for is a linear equation that captures the relationship in that data. Since it’s experimental data, it’s not perfect – there are experimental errors, there are extraneous factor. So no line is going to be
an exact fit to the data – so we want to find the best possible fit.
If you remember your algebra, a linear equation takes
the form y=mx+b, where “m” and “b” are constants. In this case, “x” is the
radiation level, and “y” is the failure rate. So what we want to do is to take
the set of data, find values for “m” and “b” that provide a best fit line for the data. The way that we define best fit is a line where, for a given x-value,
the mean-square difference between the measured y-value, and the y-value predicted by our best-fit line is minimal.
Mean-square in the above basically means that we take the difference between
the observed and the predicted, square it, and take the mean. That mean-square
value is what we want to minimize. But why square it? The intuitive answer comes
from an example. What’s a better fit: one where for a given pair of points, the
distances are -10 and 15, or one where the distances are 7 and -2? If we use mean
distance, then the two are the same – the mean distance in each case is 2.5. If we
use mean-square, then we’ve got 162.5 in the first case, and 53 in the second. The “7” and “-2” is a better fit.
So how do we minimize the mean-square y-distance? It’s actually pretty easy. First
we want to compute the slope. Normally, the slope is defined as ΔY/ΔX. We
tweak that a bit, making it (ΔYΔX)/(ΔX2). Then we
use that for all of the data points. So what we end up with is:
m = (Σ(xi–x)
So, for our example data, x=16,
and y=11. So we compute the
following values for our data:
If we sum up “(y-y)(x-x)”, we get 29.8; if we sum up
(x-x)2, we get 70. That gives us a slope of 0.42.
Once we have the slope, we need to compute the point where the regression line crosses the y axis, which is simple: b=y-mx. So the intercept is 11-0.42×15 = 4.7.
The resulting line – the “best fit” line according to our linear regression – is shown in the diagram. You can see it’s a reasonably good fit. But how good? There’s a standard
measure of the quality of a line-fit in a linear regression, called the correlation coefficient. The correlation coefficient describes how well a simple linear relationship matches a set of data (which connects to an interesting point that I’ll mention in a moment). The correlation coefficient is unitless, and varies between -1 and +1. If the data is a perfect negative correlation – meaning that as X increases, Y decreases in a perfectly linear fashion, then the correlation coefficient will be -1. If the data is a perfect positive correlation – that is, a perfect relationship where when X increases, Y also increases in a linear fashion, then the correlation coefficient will be +1. If there is absolutely no linear relationship between X and Y, then the correlation coefficient will be zero.
What’s a “good” correlation coefficient? There is no single answer. If you’re working on
an experiment where you’ve done a really good job of isolating out any external factors, and
where you’ve got very precise equipment, the correlation coefficient required to conclude
that you’ve correctly isolated a relationship could be very high – greater than 0.99. On the other hand, in social science work, anything over 0.5 is often considered good enough to
infer a relationship.
The definition of the correlation coefficient is a bit messy. In fact, there are several
different versions of it used for different purposes. The most common one is the Pearson
correlation, which is what I’m going to describe. The basic idea of the Pearson correlation
coefficient is to compute something called the covariance of the two variables in
the equation, and then divide that by the product of the standard deviation of the two
variables. The covariance is something that’s usually defined via probability theory, and I
don’t want to get into the details in this post, but the basic idea of it is: for each data
point (x,y), take the product of the difference between x and mean x, and the difference
between y and the mean y. Sum that up for all data points, and that’s the covariance
of the variables x and y.
For our particular data set, the correlation coefficient comes out to 0.88 – a very
high correlation for such a small data set.
The interesting point is, we constantly hear people talk about correlation. You’ll frequently hear things about “correlation is not causation”, “X correlates with Y”, etc. Most people seem to think that correlation implies nothing more that “X is related to Y”, or “If X increases, then Y increases”. In fact, correlation means something very specific: X correlates with Y if and only if there is a linear relationship between X and Y. Nothing more, and nothing less. Correlation means a direct linear relationship.
When you look at data, it’s easy to make the assumption that if, every time you change X Y changes in a predictable linear fashion, then changing X must cause Y to change. That’s probably the single most common mistake made in the misuse of statistics.
Correlation shows that there is a relationship – but it doesn’t show why.
Real examples of this show up all the time. For a trivial example, my dog is more
likely to have an accident in the house if it’s very cold out. In fact, there’s
a strong inverse correlation between the temperature outside, and the likelihood of my
dog having an accident. Does cold weather cause my dog to have an accident? No. Cold
weather causes me to try to avoid taking him out, and because I don’t take him out
enough, he has more accidents. The cold doesn’t affect my dog’s likelihood to
have an accident directly. He’d be equally likely to have an accident if the weather
were a balmy 70 degrees out, if I didn’t take him out for a walk. He doesn’t care whether
it’s 10 or 70; he just needs to go.