When we look at a the data for a population+ often the first thing we do
is look at the mean. But even if we know that the distribution
is perfectly normal, the mean isn’t enough to tell us what we know to understand what the mean is telling us about the population. We also need
to know something about how the data is spread out around the mean – that is, how wide the bell curve is around the mean.
There’s a basic measure that tells us that: it’s called the standard deviation. The standard deviation describes the spread of the data,
and is the basis for how we compute things like the degree of certainty,
the margin of error, etc.
Suppose we have a population of data points, P={p_{1},…,p_{n}}. We know that the mean is
the sum of the points (p_{i}s) divided by the number of
points |P|. The way to describe the spread is based roughly on the concept
of the average difference between the points from the mean.
So what happens if we naively compute the average of the difference between the mean and the the data points? That is, compute the mean difference? That is – if the mean is M, and the average distance is d, then can we use the following?
Unfortunately, that won’t work. If we work it through, what we’d find is that by the definition of mean, that average difference d will
be 0. After all, the mean is the point in the center of the
distribution – that means that a simple sum of the differences will be zero – the values larger than the mean (which will be positive) will be precisely equal to the sum of the values smaller that the mean (which will be negative), and so the sum, and therefore the average must be 0.
How do we get around that? By making all of the distances positive. And how do we do that? Square them. The standard deviation, which is usually written σ is a root mean-square measure – which means that it’s the mean (average) of
the square root of the difference between the points and the mean squared. The sum of the squares is also a useful figure, called the variance; the variance is just the mean of the squares – that is σ^{2}. The standard deviation written in equational form, where M is the mean, and P is the set of points, is:
Let’s run through an example. Take the list of salaries from the mean article: [ 20, 20, 22, 25, 25, 25, 28, 30, 31, 32, 34, 35, 37, 39, 39, 40, 42, 42, 43, 80, 100, 300, 700, 3000 ]. The sum of these is 4789. There are 24 values. So the mean (rounding off to 2 significant figures) is 4789/24 = 200. So what’s the standard deviation?
- First, we’ll compute the sum of the squares of the differences:
(20-200)^{2} + (20-200)^{2} + (22-200)^{2} + (25-200)^{2} + … + (700-200)^{2} + (3000-200)^{2} =
32400+32400+31684+30625+30625+30625+29584+28900+28561+28224+27556+27225+26569+25921+25921+25600+24964+24964+24649+14400+10000+10000+250000+7840000 = 8661397. - Then we’ll divide by the number of points: 8661397/24 = 360891. So the variance is roughly 360,000.
- Then take the square root of the variance: the square root of 360,000 = 600.
So, for our salaries, the mean is $200,000 with a standard deviation of $600,000. That right there should be enough to give us a good sense that there’s something very strange about the distribution of numbers here – because salaries can’t be less than zero, but the standard deviation is three times the size of the mean!
But what does the standard deviation mean precisely? The best way to define it is in probabilistic terms. In a population P with roughly normal distribution, mean M, and standard deviation σ:
- 2/3s of the values in P will be
within the range M +/- σ. - 95% of the values will be within the range M +/- 2σ.
- 99% of the values will be within the range M +/- 3σ
For any population P with mean M and standard deviation σ, regardless of whether the distribution is
normal:
- At least 1/2 of the values in P will be within the range M +/- 1.4σ.
- At least 3/4 of the values in P will be within the range M +/- 2σ
- At least 9/10s of the values in P will be within the range M +/- 3σ.
If you have a population P which is very large, you often want to make
an estimate about the population using a sample, where a sample
is a subset P’ ⊂ P of the population. Since the standard deviation of the sample is generally slightly smaller than the standard deviation of the population as a whole, we add a correction factor for sampled populations. In the equation for the standard deviation, instead of dividing by the size of the sample, |P’|, we divide by the size of the sample minus one: |P’|-1. The ideal correction factor is a lot more complicated, but in practice, the “subtract one from the size of the sample” trick is an excellent approximation, and so it’s used nearly universally.
Next topic in the basics will be something closely related: confidence intervals and margins of error.
Ooh, I didn’t know you could make statements about the stdev of non-normal distributions. That’s pretty cool.
Mark,
I think you threw off the whole calculation by a typo: you calculated the mean as 200, but then you used the value of 2000 in calculating the deviation.
Great you are zeroing in on my suggestion! I am looking forward to the next installment. I think it is important because a lot of people see some statistical error range, and then see a value outside that range and just dump the statistics, or say this is a counter example. Based on my acquaintancesI think there is a lot of distrust of math and statistics by people who would have no trouble comprehending them because of this very topic of confidence intervals.
Nope, never mind: you still have bad typos in that calculation, but the deviation is still a bit over 600.
JBL:
I wrote 2000, but in the calculations, I actually did the correct thing, and used 200. The real calculations were done with a Haskell program using the list of salaries, so they didn’t get affected by the typos; it’s just my explanatory text that’s wrong. I’ll go fix that. Thanks for the catch!
@JBL
I think his calculation looks right, just the typing was wrong… you’ll notice that what should be (20-200)^2 = 32400, was actually written (20-2000)^2 = 32400. Also, the final (20-2000)2 should be (3000-200)^2.
Xanthir, It uses the Chebyshev inequality http://en.wikipedia.org/wiki/Chebyshev's_inequality, P(|X-u|>= ks)
Sorry, I was too slow. MarkCC, make sure to get that last (20-2000)^2 to (3000-200)^2. BTW, haven’t posted before, but I’ve got to tell you this is definitely top 5 blog material for my interests.
Brent:
Thanks for that catch!
My second comment was trying to communicate the fact that I’d figured out that you’d done the proper calculation and just typed it into your blog wrong … in any event, it looks much better now. I hadn’t known the more general rules for standard deviation — that’s very interesting! I shall have to go meditate on why they are true. (I assume that 1.4 really means the square root of 2, right? Or is it some other similarly-sized number?)
JBL:
Don’t worry – I wrote the response to you after I saw your first comment, and posted it before I saw your second one :-).
To be honest, I actually don’t know if that 1.4 number is the square root of two! The “1.4σ” is something that I’ve had memorized for a long time, and I never thought about the fact that it’s awfully close to the square root of two.
I’ve actually known about standard deviation and bell curves for a bizzarely long time. I’m not sure exactly when I learned it – it was no later than third grade. My father, who’s a physicist who was working in semiconductor manufacturing at the time, had brought home some work – he was analyzing the results of a series of radiation tests on a sample from a production run. I wandered into the dining room where he was doing it, and asked “Hey dad, what are you doing?”. And being the kind of guy that my dad is, he proceeded to explain it to me a little at a time, stopping after each new bit to make sure he hadn’t gotten past me. We got all the way to computing standard deviation! We finally stopped when I couldn’t understand why subtracting one from the size of the population was a reasonable thing to do. (The concept of an approximate error bound was just too much, I think.)
I doubt that I picked up the “1.4” figure then… But it was certainly in my brain by the time I got to middle school.
Now that you’ve piqued my curiousity… When I have some time, I’ll hit wikipedia and mathworld, and see if I can find the derivation of Chebychev’s rule, which is where that 1.4 comes from, so that I can see whether you’re right. I’ll post a comment here with the answer (unless someone beats me to it).
OK, so I couldn’t resist. I just went and looked, and it’s so cool!
It is the square root of two. But what’s more important is that using measure theory, Chebyshev proved that
for any multiple X of σ, no more than 1/X2 values will be outside of the mean +/- Xσ.
I wish I had looked that up before writing this article! But definitely, look up “Chebyshev’s inequality” on wikipedia – they have a very nice explanation of it!
And let me just add here: I think that this blog has one of the most terrific bunch of commenters anywhere on the net. You guys are great!
Chebyshev inequality: (sigma=variance, epsilon is arbitrary parameter, x is random variable, eta is mean)
P[|x-eta|>= epsilon]
Whoops; ‘I lost my less than or equal to’ due to meta tags:
P[|x-eta| (greater than or equal to) epsilon ] less than or equal to: sigma^2/epsilon^2
And here I thought I’d get the answer out first, but MCC beat me to it…
That’s very cool! Actually, I was going to write (but decided against it), “And is 3 standing in for rt(10), as well?” Unfortunately, my statistics knowledge is so weak that I’ll need to do a bit of reading to digest why it actually works — in any event, thanks for the excellent posts (in the Basics series, but also more generally)!
Oh, while we’re doing standard deviation, it might be worth spending a paragraph or so on the difference between the terms “standard deviation” and “standard error”.
“Ooh, I didn’t know you could make statements about the stdev of non-normal distributions. That’s pretty cool.”
Posted by: Xanthir, FCD
Certainly. If the distribution is normal, then the statements using the standard deviation are more precise. If the distribution is non-normal, and you don’t know what it is, then you can use the Chebychev inequality, but it’s not as good.
Barry and others: Well, yeah, it’s obvious now. But if you hadn’t heard of the Chebychev inequality, and had only seen stdev applied to normal distributions, then it’s like a miracle. ^_^
Why is it that no-one ever honestly explains the reason for the squaring in the standard deviation?
I’m not blaming you Mark, the textbooks lie about it most of the time too, but it still annoys me to see it.
We don’t need to square the values to make them positive (we can just take the absolute value), we need to square the values because the *variance* (the square of the standard deviation), is the unique measure of spread that is additive. That is, the variance of a sum of uncorrelated random variables is the sum of the variances, a very useful property.
That’s perhaps a little advanced an explanation for a post on the basics, but that’s no excuse for misinformation.
Shouldn’t the article read,
“At least 8/9s of the values in P will be within the range M +/- 3σ.”?
“We don’t need to square the values to make them positive (we can just take the absolute value), we need to square the values because the *variance* (the square of the standard deviation), is the unique measure of spread that is additive. That is, the variance of a sum of uncorrelated random variables is the sum of the variances, a very useful property.
That’s perhaps a little advanced an explanation for a post on the basics, but that’s no excuse for misinformation.”
Can someone comment more on this please. This is one of the main article I was looking forward to. An explanation of why the (M – d) is squared and not absolute value.
In fact in general the error is always squared instead of absolute values why is that? The best answer I got was in numerical class the prof said because its convenient to take the derivative and minimize the error.
Don’t know if you wanted to go there or not, but maybe some mention of the bootstrapping process of finding out CIs, rather than assume normality & hope that’s right, since one theme so far has been “What if the assumption of normality is off?”
A contributor to the “What are you optimistic about?” question at Edge said he was optimistic that computers will allow us to rely more on the actual patterns in the data, as opposed to hoping that the data are normally distributed and we can just take the mean plus or minus 2 SDs.
http://edge.org/q2007/q07_16.html#kosko
And let me just add here: I think that this blog has one of the most terrific bunch of commenters anywhere on the net. You guys are great!
You set the bar for readership pretty high by blogging only about math. Intelligent people love to babble about all sorts of things they have no knowledge of — look at most blogs about foreign & domestic policy — but, for some odd reason, math is not one of those things.
I found a link in the wikipedia article to another inequality which gives tighter bounds under the restriction that the probability density be unimodal.
http://en.wikipedia.org/wiki/Vysochanskiï-Petunin_inequality
For those who don’t want to click through Canuckistani’s link, the limit imposed by the V-P inequality is that at most 4/9λ2 can lie at or outside λ standard deviations from the mean.
Compared to the Chebychev inequality, this is more than twice as good!
I learned the V-P inequality as the Camp-Meidell inequality, without the limit on λ mentioned in the V-P article. It’s never been clear to me, though, whether it requires symmetry as well as unimodality.
malpollyon: one reason for using squared distances rather than absolute distances in variance computations is that the mean is the point that minimizes the sum of squared distances around it; when we get around to linear regression, you’ll realize that the mean is the least-squares estimate of central tendency for a single variable. If you use absolute distances, it turns out that the median is the minimizing point.
The fact that the mean is the point that minimizes the sum of squared distances also explains why the standard deviation of a sample will be smaller than the standard deviation of the population. In general, the population mean will be different from the sample mean, and thus the sum of squares of the sample points around the population mean will be larger than the sum of squares around the sample mean. Note that in the (usually rare) case where you actually know the population mean, rather than estimating it from the sample, you don’t subtract 1 from the denominator when trying to estimate the population SD from the sample.
Aiioe:
Once you have the minimizing point correct (thanks ebohlman!) it can be convenient to make mechanical analogies.
The usual analogy is between data points and masses, because it treats the distribution as measuring spread directly. I won’t discuss it in case the basics series goes into (central) momenta of distributions. Besides, it is mentioned in Wikipedia’s article on variance.
In another analogy that I like more it is probably easier to understand the squaring. It is also useful for linear regression. But the connection to spread is indirect, since it is rather about finding the expected value (regression line).
In this analogy, imagine a spring from each fixed data point to the expected value (regression line), which we assume is loose and you pull on.
When you let go of the expected value (regression line) the equilibrium will be decided by the potential energy in each spring. It turns out this is gives an additive measure (sums energy from each spring) with squared distance in the expression. (Since the force is linear in distance.)
I also was hoping to hear an explanation of why use the sqrt of the sum of squares instead of just using absolute value. This has always made Statistics seem like voodoo to me, and I never bought the profs’ explanation that it’s “convenient”. The discussion in the comments is somewhat helpful but not detailed enough…Mark, I’ll be forever indebted if you clear this up!
In answer to George Nachman’s question – The reason you take the square root is so that SD and mean are in the same units. At least that’s explanation that’s been floated by me several thousand times.
Recall that the idea of a root-mean-square (RMS) average is steeped in physical phenomena: If you put a volt-meter across an AC line with a diode bridge (so that current only flows in one direction), you’re going to be measuring an RMS voltage, not the “peak” voltage. So there’s a certain intuitive appeal to using this sort of measure in a more pristine mathematical sense. I suspect (but don’t know), because of Chebyshev’s Inequality, that there is a fairly compelling measure-theoretic reason to use the true 2nd-moment (i.e., squared errors) calculation that leads to SD in terms of robustness, etc. I confess I don’t have the background to assess this claim, but as above, it’s been “floated by me” a time or two.
Responding to ebohlman’s question about the V-P (a/k/a Camp-Meidell) Inequality, the requirements are just these:
1.) Uni-modal distribution
2.) Sample mean (approx.) equals median
3.) Monotonic falling-off of the distribution either side of the mode – so it doesn’t have to be symmetric per se, although most distributions will have a tendency to look that way if they obey this restriction; that is, the distribution “skirts” will tend to look approximately symmetric.
With those provisos, it’s a pretty powerful distribution-free “statistic” in itself.
BTW, I’ll second someone’s comment about the level of discussion here – I just happened across this while looking for references to the V-P Inequality, and I’ve been *extremely* impressed by both the quality of posts *and* the degree of respect accorded each participant. I teach online courses that include basic statistics, and have thus far found a number of things just in this thread that I can bring to those courses! So thanks to one and all, and keep on truckin’! I’ll be tuned in, and will try to contribute as able.
Kevin
I second Kevin’s description, this is what I have been told too.
Electrical measurements are another analogy besides the mechanical that helps me understand some probability theory.
Peak voltage is interesting at times. (Especially if you or other sensitive equipment 🙂 risk touching the source potential when grounded, and the pulse width is long enough!) At other times you would like to know about the absolute value of the varying potential (which gives the AC part) on top of the static potential (which gives the DC part).
But these won’t tell you, or tell you easily, how to add up different power sources or drains which RMS values helps with. Btw, total RMS values was for that very reason fairly simple to measure already in analog instruments, as the power exhibited by a resistance, which is another reason you see them a lot here.
If you analyze signals as above you can differentiate signal power from noise power to make signal-to-noise ratios (SNR), which is another interesting measure.
And here I suspect Kevin’s suspicion about the robustness of these measures, based in Chebyshev’s inequality (many values close to mean), may have some direct illustrations. At least in connection with the ubiquitousness of normal distributions. Gaussian white noise, with values centered around the mean, is a typical (‘robust’) theoretical noise behavior in many systems.
Everyone here seems to have it together regarding Standard deviation. Perhaps you could enlighten me on how to calculate the SD of a series of values that are all less than one? My specific problem is the series:
0.107
0.110
0.092
0.090
0.120
Larry:
Why would the fact that the values are less than one make any difference?
The procedure is the same – exactly as described in the post above. Compute the mean of your values. Compute the sum of the squares of the differences between the values and the mean. Divide by the number of values (or number of values +1 if you’re working from a sample), and then take the square root.