{"id":617,"date":"2008-03-24T15:43:27","date_gmt":"2008-03-24T15:43:27","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2008\/03\/24\/basic-statistics-mean-and-standard-deviation\/"},"modified":"2008-03-24T15:43:27","modified_gmt":"2008-03-24T15:43:27","slug":"basic-statistics-mean-and-standard-deviation","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2008\/03\/24\/basic-statistics-mean-and-standard-deviation\/","title":{"rendered":"Basic Statistics: Mean and Standard Deviation"},"content":{"rendered":"<p> Several people have asked me to write a few basic posts on statistics. I&#8217;ve<br \/>\nwritten a few basic posts on the subject &#8211; like, for example, this post on mean, median and mode. But I&#8217;ve never really started from the beginnings, for people<br \/>\nwho really don&#8217;t understand statistics at all.<\/p>\n<p> To begin with: statistics is the mathematical analysis of aggregates. That is, it&#8217;s a set of tool for looking at a large quantity of data about a <em>population<\/em>, and finding ways to measure, analyze, describe, and understand the information about the population.<\/p>\n<p> There are two main kinds of statistics: <em>sampled<\/em> statistics, and<br \/>\n<em>full-population<\/em> statistics. Full-population statistics are<br \/>\ngenerated from information about all members of a population; sampled statistics<br \/>\nare generated by drawing a <em>representative sample<\/em> &#8211; a subset of the population that should have the same pattern of properties as the full population.<\/p>\n<p> My first exposure to statistics was full-population statistics, and that&#8217;s<br \/>\nwhat I&#8217;m going to talk about in the first couple of posts. After that, we&#8217;ll move on to sampled statistics.<\/p>\n<p><!--more--><\/p>\n<p> The way that I learned this stuff from from my father. My father was<br \/>\nworking for RCA on semiconductor manufacturing. They were producing circuits for<br \/>\nsatellites and military applications. They&#8217;d do a test-run of a particular<br \/>\ndesign and manufacturing process, and then test all of the chips from that<br \/>\nrun. They&#8217;d basically submit them to increasing stress until they failed. They&#8217;d get failure data about every chip in the manufacturing run. My father&#8217;s job was to<br \/>\ntake that data, and use it to figure out the basic failure properties of the run, and whether or not a full production run using that design and process would produce chips with the desired reliability.<\/p>\n<p> One evening, he brought some work home. After dinner, he spread out a ton of<br \/>\nlittle scraps of paper all over our dining room table. I (in third or fourth grade at the time), walking in and asked him what he was doing. So he explained it to me.<\/p>\n<p> The little slips were test results. They were using a test system called, if I remember correctly, a teradyne. It printed out results on these silly little slips of paper. If you&#8217;ve ever watched &#8220;Space: 1999&#8221;, they were like the slips that come out of the computer on that show.<\/p>\n<p> Together, we went through the slips of paper, taking information off of them,<br \/>\nand putting them into long columns. Then we&#8217;d add up all of the information in<br \/>\nthe column, and start doing the statistics. We did a couple of things. We computed<br \/>\nthe mean and the standard deviation of the data; we did a linear regression;<br \/>\nand we computed a correlation coefficient. I&#8217;m going to explain each of those in turn.<\/p>\n<p> First, we come to the mean. The mean is the average of a set of values. Given<br \/>\na theoretical object which behaved individually exactly as the aggregate information would predict, the behavior of that object is the mean. To compute the mean, you sum up all of the values in the dataset, and divide by the number of<br \/>\nvalues. To write it formally, if your data are N values x<sub>1<\/sub>,&#8230;,x<sub>n<\/sub>, then the mean, which is usually written as<br \/>\n<span style=\"text-decoration: overline\">x<\/span>, is defined by:<\/p>\n<p><span style=\"text-decoration: overline\">x<\/span> = (1\/n)&Sigma;<sub>i=1..n<\/sub>x<sub>i<\/sub><\/p>\n<p> The mean is a tricky thing. It&#8217;s not nearly as informative as you might<br \/>\nhope. A very typical example of what&#8217;s wrong with it is an old joke: Bill Gates walks into a homeless shelter, and suddenly, the average person in the shelter is a millionaire.<\/p>\n<p> To be more concrete, suppose you had a set of salaries at a small company. The receptionist makes $30K. Two tech support guys make $40K each. Two programmers make<br \/>\n$70K each. The technical manager makes $100K. And the CEO makes $600K. What&#8217;s the mean salary? (30+40+40+70+70+100+600)\/7 = 950\/7 = 135. So the average salary of an employee is $135,000. But that&#8217;s more than the second-highest salary! So knowing the mean salary doesn&#8217;t tell you very much on its own.<\/p>\n<p> One fix for that is called the standard deviation. The standard deviation<br \/>\ntells you <em>how much variation<\/em> there is in the data. If everything is<br \/>\nvery close together, the standard deviation will be small. If the data is very<br \/>\nspread out, then the standard deviation will be large.<\/p>\n<p> To compute the standard deviation, for each value in the<br \/>\npopulation, you take the <em>difference<\/em> between that value and the<br \/>\nmean. You <em>square it<\/em>, so that it&#8217;s always positive. Then you take<br \/>\nthose squared differences, and take <em>their<\/em> mean. The result is<br \/>\ncalled the <em>variance<\/em>. The standard deviation is the square root<br \/>\nof the variance. The square root is generally written &sigma;, so:<\/p>\n<p>&sigma; = sqrt((1\/n)&Sigma;<sub>i=1..n<\/sub>(x-<span style=\"text-decoration: overline\">x<\/span>)<sup>2<\/sup>)<\/p>\n<p> So, let&#8217;s go back to our example. The following table shows, for each salary,<br \/>\nthe salary, the difference between the salary and the mean, and the square<br \/>\nof the difference.<\/p>\n<table border=\"1\">\n<tr>\n<th>Salary<\/th>\n<th>Difference<\/th>\n<th>Square<\/th>\n<\/tr>\n<tr>\n<td>30<\/td>\n<td>-105<\/td>\n<td>11025<\/td>\n<\/tr>\n<tr>\n<td>40<\/td>\n<td>-95<\/td>\n<td>9025<\/td>\n<\/tr>\n<tr>\n<td>40<\/td>\n<td>-95<\/td>\n<td>9025<\/td>\n<\/tr>\n<tr>\n<td>70<\/td>\n<td>-65<\/td>\n<td>4225<\/td>\n<\/tr>\n<tr>\n<td>70<\/td>\n<td>-65<\/td>\n<td>4225<\/td>\n<\/tr>\n<tr>\n<td>100<\/td>\n<td>-35<\/td>\n<td>1225<\/td>\n<\/tr>\n<tr>\n<td>600<\/td>\n<td>465<\/td>\n<td>216225<\/td>\n<\/tr>\n<\/table>\n<p> Now, we take the sum of the squares, which gives us 254974. Then we<br \/>\ndivide by the number of values (7), giving us 36425. Finally, we take<br \/>\nthe square root of that number, giving us about 190. So the standard deviation<br \/>\nof the salaries is <em>190,000<\/em>. That&#8217;s pretty darned big, for a mean<br \/>\nof $135,000!<\/p>\n<p> The real meaning of the standard deviation is very specific. Given a set<br \/>\nof data, 68 percent of the data will be within the range (<span style=\"text-decoration: overline\">x<\/span>-&sigma;, <span style=\"text-decoration: overline\">x<\/span>+&sigma;) (which we usually say as &#8220;within one standard deviation of the mean&#8221;, or ever &#8220;within one sigma&#8221;); and about 95 percent of the data is within 2 sigmas of the mean.<\/p>\n<p> What should you take away from this? A couple of things. First,<br \/>\nthat these statistics are about <em>aggregates<\/em>, not individuals. Second,<br \/>\nthat when you see someone draw a conclusion from a mean without telling you anything more than the mean, you really don&#8217;t know enough to draw any<br \/>\nparticularly meaningful conclusions about the data. To know how much the mean tells you, you need to know how the data is distributed &#8211; and the easiest way of<br \/>\ndescribing that is by the standard deviation.<\/p>\n<p> Next post, I&#8217;ll talk about something called <em>linear regression<\/em>, which was the next thing my dad taught me when I learned this stuff. Linear regression is a way of taking a bunch of data, and analyzing it to see if there&#8217;s a simple linear relationship between some pair of attributes. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Several people have asked me to write a few basic posts on statistics. I&#8217;ve written a few basic posts on the subject &#8211; like, for example, this post on mean, median and mode. But I&#8217;ve never really started from the beginnings, for people who really don&#8217;t understand statistics at all. To begin with: statistics is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[61],"tags":[],"class_list":["post-617","post","type-post","status-publish","format-standard","hentry","category-statistics"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-9X","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/617","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=617"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/617\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=617"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}