Basic Statistics: Mean and Standard Deviation

Salary	Difference	Square
30	-105	11025
40	-95	9025
40	-95	9025
70	-65	4225
70	-65	4225
100	-35	1225
600	465	216225

0 thoughts on “Basic Statistics: Mean and Standard Deviation”

bill March 24, 2008 at 4:15 pm

Be careful with your comments about the standard deviation; the `1 sd = 68 percent’ and `2 sd = 95 percent’ only work for normal distributions; they’re not true for other distributions (and, in particular, they’re not true for the example you constructed). The question of how to tell whether a distribution is `normal enough’ to use the `1 sd = 68 percent’ and `2 sd = 95 percent’ shortcuts is a problem in and of itself.

Loading...

Reply ↓
bill March 24, 2008 at 4:19 pm

Look up Chebyshev’s Inequality (on Wikipedia, for example); it states that
`At least (1 − 1/k^2) × 100% of the values are within k standard deviations from the mean’
so, for any distribution, around 75 percent of the data will be within 2 sd of the mean; a normal distribution raises that 75 percent to 95 percent.

Loading...

Reply ↓
DrugMonkey March 24, 2008 at 4:21 pm

Mark, mad propz. I am firmly convinced that the lack of a fundamental understanding of the various ways to think about central tendency and (especially) variance of the distribution is one of the biggest problems we have in public policy, public health and other personal thinking and decision making. That relate to math, anyway.
Is it possible to open a daily paper and not be bombarded by articles which muddy concepts of “average”? Which completely ignore the all-critical concept of variance from said “average”? From political polls to the sports page. From the real estate section to the local society-ball philanthropic report. the list goes on and on….

Loading...

Reply ↓
C. Chu March 24, 2008 at 4:27 pm

A very basic yet super useful post that I hope more people read. I’m a college student majoring in engineering (with hopes to be a teacher after retirement) and it always annoys the crap out of me when *anyone*, including professors or whatnot, declare that the average for some exam was a 75% so there’s no curve on a test — but the standard deviation is so high that only a few people actually did better than the average.
I just had a conversation with my girlfriend today (a math major, no less!) about how she’s frustrated with classes because she got some tests back today. She got an 80% on one of them, and the average was an 84% — “I’m worse than average!!!! =( =( =(” But how many kids actually did better than average?
It’s funny to me that most people assume that because something is average, it’s the exact middle of the set — okay, fine, I guess that’s the definition — but how often is there exactly 50% of a population below an average and 50% above?
Anyway, enough of my rambling. Good post.

Loading...

Reply ↓
bwv March 24, 2008 at 5:03 pm

The income illustration is confusing because income is not normally distributed, it follows a Pareto or power law distribution where there often is infinite variance (and in theory sometimes an infinite expectation). If you randomly sampled the wealth of 1,000 or 10,000 people on the planet and inferred the population standard deviation then Bill Gates would be a statistical impossibility (even if the standard deviation was as high as $100,000 Bill Gates would till be a close to a 600,000-sigma outlier)

Loading...

Reply ↓
john March 24, 2008 at 5:35 pm

I hope you’ll include abuse of linear regression in your next post on that topic. Some papers assume that high r squared means a good predictive relationship. R squared can be increased by putting several unrelated sets of samples on the same graph just to stretch out the axes. Also, I learned that to test the ability of a relationship to make predictions, I should randomly split the data and use half of it to generate an equation, then predict values for the other half and compare predictions with reality. There is some software now that rotates values in and out iteratively and I think mixes equation generation with prediction generation. Should that be trusted?

Loading...

Reply ↓
Anthony March 24, 2008 at 5:48 pm

Excellent post! I have been really interested in learning some stats and this really wet my whistle.

Loading...

Reply ↓
BradC March 24, 2008 at 5:57 pm

Good article.
Your first paragraph is missing the actual link to your earlier article here: http://scienceblogs.com/goodmath/2007/01/basics_mean_median_and_mode.php

Loading...

Reply ↓
6EQUJ5 March 24, 2008 at 5:58 pm

When monitoring something that should hold constant to see if it behaves, we could track the mean and standard deviation to see if they both hold constant, but a more insightful approach is to differ the actual and the nominal and monitor the RMS of the residue.
Very nearly: RMS = sqrt( mean**2 + variance )
If the RMS value stays negligibly small, we know not to bother looking further.
If the RMS is too large, we look at the mean to find any bias, and at the standard deviation to find any noise increase.

Loading...

Reply ↓
tcmJOE March 24, 2008 at 6:18 pm

A question I’ve always been a bit hazy on is the distinction between the standard deviations where the term before the sum was 1/N and 1/(N-1). Is there an appropriate time when you use the 1/(N-1) standard deviation?
Thanks!

Loading...

Reply ↓
Erik March 24, 2008 at 7:30 pm

tcmJOE:
A sample variance computed using 1/N is slightly biased (it’s expected value is not exactly the population variance), while the sample variance using 1/(N-1) is unbiased (it’s expected value is the population variance).
While biased, the 1/N method has some advantages, and the bias decreases rapidly as the sample size increases.
I’m not sure if there’s a rule on when to use one or the other, but I hope this helps.

Loading...

Reply ↓
Peter March 24, 2008 at 8:17 pm

Wow! A topic I know about!
I’m a statistician for a living.
Full-population vs. sample is a dichotomy I had not heard before. When you said “two types of statistics” I was thinking you were going to go for descriptive and inferential, which isn’t totally different from what you have, but not exactly the same, either.
The mean can be abused and confused.
What, for example, is the average time you go to bed?
Let’s say that on Saturday you went to bed at 1 AM, and on Sunday at 11PM. Add them up, that’s 12. Divide by 2 = 6. hmmmm. Maybe a 24 hour clock? 1 + 23 = 24, divide by two…. NOON! oops.

Loading...

Reply ↓
Peter March 24, 2008 at 8:31 pm

Also useful are the median and (less commonly used) the trimmed mean, or, even less common, Winsorized mean.
The median is the number that’s got half higher than it, and half lower.
The trimmed mean is the mean, after discarding some of the highest and lowest values (typically 10% or 20% of the highest and 10% or 20% of the lowest); the Winsorized mean doesn’t discard values, it substiutes the highest or lowest that’s left. So…
10 12 15 20 25 35 40 100 200
mean = 50.78
median = 25
20% trimmed mean = 35.29
20% Winsorized mean = 39.89

Loading...

Reply ↓
Peter March 24, 2008 at 8:39 pm

To C. Chu….
You seem to be slightly confusing the median and the mean
the median is the middle of a data set
the mean is the expected value…..it’s the number to guess if you pay a penalty for being wrong, and get a reward for being right.
So, if you were playing a game and had to guess the score of some random student, and got $10 if you were exactly right, $9 if you were off by 1, $8 if you were off by 2, and so on, the best thing to guess would be 80
But that’s not necessarily the middle of the data.

Loading...

Reply ↓
Chas. Owens March 25, 2008 at 12:20 am

Peter: 1am is 25 not 1 in this case: 25+23/2=24 or 12pm. Likewise if you were looking for your average rising time and woke up at 11pm, 12am, 2am, and 3am you would use -1, 0, 2, and 3: -1+0+2+3/4=1.5 or 1:30am. Don’t conflate inability to work with time properly with the confusion surrounding statistics. A better way of dealing with this would be to measure amount of time asleep and awake (going to bed at 1am after 48 hours of being awake is not the same as going to bed at 1am after being awake for 16 hours).

Loading...

Reply ↓
Dave March 25, 2008 at 1:53 am

Actually, Chas. Owens, Peter does know what he’s talking about. It is a statistical problem, and there are whole books on the subject.
I’m having a hard time finding a good introductory source on this, but here’s the wikipedia article.

Loading...

Reply ↓
Peter March 25, 2008 at 6:20 am

To Chas Owens
I was using that as an example of how you can go wrong with the mean. Whether you say the problem is about “working with the data properly” or “confusion about statistics” is, to me, irrelevant. Your method is one way of getting a right answer. (another is to take time since some point the previous day, e.g. hours after noon, but it’s the same concept, mine just avoids negative numbers)
I was just trying to point out that the adage
“There are no routine statistical questions, only questionable statistical routines” applies in all cases, even ones that appear simple

Loading...

Reply ↓
Peter March 25, 2008 at 6:35 am

To Dave
Thanks for that defense, and the links!
It’s even more complex than I thought. Chas Owens solution (which is what I would have suggested) will, I think, work in most cases. It bogs down when the angles (that is, times) are uniformly distributed:
hours after midnight: 0 4 8 12 16 20
that is midnight 4AM 8AM noon 4PM 8PM
mean = 10AM, which is sort of silly, I think we’ll agree.
But if people generally go to bed around the same time (which seems likely) then I think the methods are roughly equivalent, but right now I don’t have time to check.

Loading...

Reply ↓
Jonathan Vos Post March 25, 2008 at 9:19 am

“The mean is a tricky thing. It’s not nearly as informative as you might hope. A very typical example of what’s wrong with it is an old joke:”
The mean person in my town has approximately one breast, one ovary, one testicle, half a penis, and half a vagina.

Loading...

Reply ↓
Mark C. Chu-Carroll March 25, 2008 at 9:52 am

tcmJoe:
If you’ve got full-population data, then you use the “/N” standard deviation. When you’re using sampled data, you use “/N-1”. The basic reasoning is that the probabilistic expectation for samples is that they’ll be narrower than the full population. Using the “N-1” denominator is a compensation for that.

Loading...

Reply ↓
Chris March 25, 2008 at 10:10 am

The square root is generally written σ

No, the standard deviation is generally written σ

Loading...

Reply ↓
bill March 25, 2008 at 11:01 am

I was taught that you use the `N-1′ for sample data ’cause you’ve already used one degree of freedom to calculate the mean.

Loading...

Reply ↓
tcmJOE March 25, 2008 at 1:46 pm

Much appreciated, thank you!
When you say that using 1/(N-1) is a compensation for the slight difference in probabilistic expectation between sample and census data, is there some sort of proof that the variation is compensated by subtracting 1 from N? Would (and I’m just throwing this out) taking something like 1/(N-2) ever be a “more” accurate guess of census deviation in some cases?

Loading...

Reply ↓
Mark C. Chu-Carroll March 25, 2008 at 1:57 pm

tcmJoe:
I’ve never studied the formal derivations of the sampled standard deviation, so I may well be wrong. My father, when he taught me this stuff, told me that it was purely an empirical thing.
The fact that the sample is likely to be narrow should be sort of clear: on a sample of a very large data set, you’re likely to miss the outliers. That’s what narrows the standard deviation. So the fact that some correction will help describe that should be fairly obvious. But the specific “/N-1” correction is, I think, empirical: if you look at the standard deviation of samples versus populations, where you know the population data, “/N-1” is what produces the best result.

Loading...

Reply ↓
Canuckistani March 25, 2008 at 1:59 pm

tcmJOE,
There is! When you do a linear regression, the denominator in the unbiased estimator of the variance is N-p, where p is the number of parameters being estimated. Estimating the mean in the manner described by MarkCC can be viewed as a special case of linear regression where there is only one paremeter being estimated — hence N-1.

Loading...

Reply ↓
Canuckistani March 25, 2008 at 2:28 pm

MarkCC,
It’s not empirical — it’s an expected value, i.e., an integral. You can calculate it if you’ve got the mathematical chops. (I don’t have the chops, but I get the same formula from Bayesian posterior expectations… which is a whole other story.)
Here’s the intuitive explanation (which I got from David MacKay‘s book). When you estimate the distribution mean using the sample mean, the estimated mean minimizes the sum of the squares of the residuals (SSR). Any other estimate of the distribution mean would give a larger SSR — and in particular, the true distribution mean would give a larger SSR. The denominator N-p exactly counteracts (in expectation) the shrinkage of the SSR.
This is what people mean when they say that you use up a degree of freedom estimating the parameters.

Loading...

Reply ↓
Canuckistani March 25, 2008 at 2:34 pm

edit:
…you use up a degree of freedom estimating each parameter.

Loading...

Reply ↓
Chas. Owens March 25, 2008 at 3:38 pm

Peter and Dave:
I misspoke, I should have said “Don’t conflate inability to work with time properly with the confusion surrounding what the mean average and other statistical functions mean.” rather than “Don’t conflate inability to work with time properly with the confusion surrounding statistics.” The article is about what comes out of the mean average function and how it is useful, not how to make sure what goes into the function is meaningful. Another example of meaningless input could be mean(“running shoes”, “socks”, “slacks”, “underwear”, “shirt”) to try to get the average price of the clothes a person is wearing. In this case it is obvious that the the understanding of the data is at fault, not the understanding of the statistical function being used (because they don’t look like numbers like time does). Like the time problem, this is not an issue of the statistical functions producing data that is not very enlightening about the population (as is the case with the salaries from the article), but rather a problem of how to represent the data in such a way that the functions can operate on them. The time issue would be a wonderful thing to bring up if the article where about the GIGO rule, but this article is about what the various statistical functions mean and how to use them to get information about a population.

Loading...

Reply ↓
HJT March 25, 2008 at 6:12 pm

@Chas
Peter’s example was entirely appropriate for the article. If you read the article carefully you will see that MCC uses the example of Bill Gates walking into a homeless shelter to illustrate misuse of mean values. Peter’s bed time example leads to similarly humorous results. He knew full well that it was a silly way to compute a mean.

Loading...

Reply ↓
Peter March 25, 2008 at 6:30 pm

To Chas. Owens
No, it’s not GIGO at all.
The average of “running shoes” and “shirts” is, as you point out, obvious nonsense.
The average time going to bed is perfectly meaningful, and, as pointed out in another comment, your solution (which, I admit, was mine too) wasn’t even fully correct.
Is this a data problem? Well, clearly. But, it’s also a problem with understanding what the mean is.

Loading...

Reply ↓
Chas. Owens March 25, 2008 at 6:59 pm

HJT:
Bill Gates walking into a homeless shelter makes the mean increases the standard deviation, but the mean is still correct. Trying to take the mean of 11pm and 1am produces garbage if you naively average 23 and 1 (producing and average of 6). It isn’t a matter of the result having a large standard deviation, the result is pure garbage because the inputs where pure garbage (as bad as my example with the clothes). The data just doesn’t look like garbage because they are numbers. The example from the article shows how people abuse valid results, the time example is not a valid result.

Loading...

Reply ↓
Chas. Owens March 25, 2008 at 7:19 pm

Peter:
The clothes example suffers from the same problem as the time problem: the values of the population must be converted into a usable form before the mean is taken. It is entirely possible to find out what the mean average of the cost of the clothing a person is wearing, but first you must convert the names of the items of apparel to their monetary value. Both problems have nothing to do with the mean, except that when the mean is presented with garbage its output is also garbage.

Loading...

Reply ↓
Anonymous March 25, 2008 at 9:56 pm

Why is the sum of squares used in the standard deviation instead of the absolute value?
I have wondered this a long time, but no one has been able to give me a good answer.

Loading...

Reply ↓
Alfonz March 25, 2008 at 11:03 pm

This is why I don’t report the mean score when I hand back exams, but the median. We all understand the median: half the class scored higher, and half the class scored lower.

Loading...

Reply ↓
Liam March 26, 2008 at 8:03 am

Ooooh! I’ve always wanted better explanations of statistics than I currently have. Would you also be willing to tackle why kurtosis is important, and why we use standard deviation instead of absolute deviation?

Loading...

Reply ↓
Mark C. Chu-Carroll March 26, 2008 at 11:16 am

Anonymous:
It’s not simple to explain the reason for the square. I might try to explain it in a later post… The simple version is that the root-mean-square comes from the variance (the value of the mean-square-difference), which is intimately related to properties of the distribution. For example, when you try to do line-fitting in linear regression, the best line fit comes from minimizing the variance, not minimizing the mean-difference.

Loading...

Reply ↓
AJS March 26, 2008 at 12:05 pm

@33
The square is used because usually, in real life, you don’t calculate the standard deviation exactly how Mark has shown it here (by taking the difference between each individual sample and the mean, and squaring it).
What you want to find is the sum of (x – µ)² for all values of x, µ being the mean. Which is the sum of (x² – 2µx + µ²) for all values of x. If you simplify this, you will see that you only need the sum of the squares of the values; you don’t need to work out the difference between each sample and the mean.
I guess that in the early days of computers, not needing to revisit data was a definite advantage. You only need to keep running totals of the sum of values of x, the sum of values of x² and a count of values. If you’re sure you know what you’re doing, you can even undo a bad entry by subtracting from the relevant totals.

Loading...

Reply ↓
Mike March 26, 2008 at 5:55 pm

Mark – You have a good memory to recall the name of the testing equipment from your childhood. As it happens, Teradyne Corp. manufactures a range of semiconductor test equipment. I first learned about them from a Harvard Business School case in an MBA course.

Loading...

Reply ↓
Pedro Terán March 26, 2008 at 7:00 pm

Mark, I think you are being somewhat misleading about the square. There is nothing `wrong’ with using the absolute deviation instead of the squared deviation. For mathematical reasons, using the absolute deviation is more consistent (so to speak) with using the median, as oppposed to the mean, as a central estimate, but it’s OK. Even for regression problems, there are many papers about median-absolute-deviation instead of mean-squared-deviation estimation.
(The misleading part is when you say squares are used because they provide the `best’ fit– that’s a tautology, as minimizing the mean squared error is commonly the *definition* of what `best fit’ means.)

Loading...

Reply ↓
wcyee March 26, 2008 at 7:12 pm

I’m playing along in Excel and used the STDEV function. But it gives me a different value (206). When I tried Mark’s method, I got 190. After reading the comments, it turns out this is the N vs N-1 thing. Does this mean the STDEV function assumes I have entered sample data rather than population data? Is that the norm in these kinds of calculation programs? Sorry if this sounds too ignorant … my understaning is limited to what I’ve gleaned from the comments. Thanks!

Loading...

Reply ↓
Peter March 27, 2008 at 6:44 am

to WCYEE
Well, relying on ‘norms’ in statistical software is tricky and fraught with danger. But, here, Excel’s default probably is sensible (unusual, for Excel, to have things make sense!).
It’s very rare to have the whole population of anything.

Loading...

Reply ↓
tcmJOE March 27, 2008 at 10:15 am

Again, thank you for clarifying things.
A request: At some point would you please cover Bayesian statistics? This, of course, could be far in the future.

Loading...

Reply ↓
greta March 27, 2008 at 12:18 pm

Thank you, thank you, thank you!
I have linked your blog to my fellow co-horts at Tiffin University in Ohio! (Hi everyone!) We appreciate, as adult learners taking an online statistic class, the ability to link this new-found knowledge to real life applications!
Your topic was extremely timely for us this week! We are heading into the importance of “hypothesis testing” next week.
Thanks again,
Greta

Loading...

Reply ↓
wcyee March 28, 2008 at 1:00 am

Peter,
Thanks. It makes sense that people using statistics packages are more likely to deal with sample rather than population data. After reading the help files a bit, I discovered that Excel has the STDEVP function for population data that seems to use N rather than N-1 in the calculation. I also tried out R and the sd function assumes sample data (N-1) as well. Thank you for clearing that up.

Loading...

Reply ↓
Ashwin Nanjappa March 29, 2008 at 11:18 pm

Mark, thanks so much for writing on these basics. You have a way of describing these stuff which I love.

Loading...

Reply ↓
Dr Rick March 30, 2008 at 4:09 pm

Mark – if you assume a finite population of N objects and then consider a sample (possibly including duplicates) of n of them, it’s quite easy to analyse the situation with some big meaty-looking sums. You find the expected value of the mean, or variance, over the population by considering the set of all samples and summing over it and dividing by its size in the usual way; the values, and denominators, you want then drop out in the wash. It’s a one-side-of-paper calculation, pleasantly enough.
Interesting things to note:
– if you want sampling WITHOUT replacement/duplication, the unbiased estimator turns out to have the population size in it.
– the square root of the unbiased estimator of the variance is NOT an unbiased estimator of the standard deviation (you need Gamma functions for that, I gather).

Loading...

Reply ↓
Jonathan Vos Post April 8, 2008 at 3:38 pm

New in arXiv as of 8 April 2008.
Quadratic distances on probabilities: A unified foundation
Authors: Bruce G. Lindsay, Marianthi Markatou, Surajit Ray, Ke Yang, Shu-Chuan Chen
Comments: Published in at this http URL the Annals of Statistics (this http URL) by the Institute of Mathematical Statistics (this http URL)
Journal-ref: Annals of Statistics 2008, Vol. 36, No. 2, 983-1006
Subjects: Statistics (math.ST)
This work builds a unified framework for the study of quadratic form distance measures as they are used in assessing the goodness of fit of models. Many important procedures have this structure, but the theory for these methods is dispersed and incomplete. Central to the statistical analysis of these distances is the spectral decomposition of the kernel that generates the distance. We show how this determines the limiting distribution of natural goodness-of-fit tests. Additionally, we develop a new notion, the spectral degrees of freedom of the test, based on this decomposition. The degrees of freedom are easy to compute and estimate, and can be used as a guide in the construction of useful procedures in this class.

Loading...

Reply ↓
Virginia Jones June 5, 2009 at 7:34 pm

What are the similarities and differences between the two kinds of standard deviations and the two kinds of means? we are taking a stats course and are lost! HELP!!!

Loading...

Reply ↓
COMFORT January 5, 2010 at 3:23 pm

please i need to know how to calculate sigma values in a basic program.

Loading...

Reply ↓