{"id":619,"date":"2008-03-27T14:19:07","date_gmt":"2008-03-27T14:19:07","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2008\/03\/27\/introduction-to-linear-regression\/"},"modified":"2008-03-27T14:19:07","modified_gmt":"2008-03-27T14:19:07","slug":"introduction-to-linear-regression","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2008\/03\/27\/introduction-to-linear-regression\/","title":{"rendered":"Introduction to Linear Regression"},"content":{"rendered":"<p> Suppose you&#8217;ve got a bunch of data. You believe that there&#8217;s a linear<br \/>\nrelationship between two of the values in that data, and you want to<br \/>\nfind out whether that relationship really exists, and if so, what the properties<br \/>\nof that relationship are.<\/p>\n<p><!--more--><\/p>\n<p> Once again, I&#8217;ll use an example based on the first example that my<br \/>\nfather showed me. He was working on semiconductor manufacturing. One of the<br \/>\ntests they did was to expose an integrated circuit to radiation, to determine<br \/>\nhow much radiation you could expect it to be exposed to before it failed. (These were circuits for satellites, which are exposed to a lot of radiation!)<\/p>\n<p> The expectation was that there was a linear relationship between<br \/>\nfailure rate and radiation level. Expose a batch of chips to a given level R of radiation, and you&#8217;ll get an X% failure rate. Expose the chips to double the<br \/>\namount of radiation, and the failure rate will double. This relationship<br \/>\nheld up pretty well &#8211; even when the model predicts that the doubling would go above 100, they&#8217;d see all chips fail. <\/p>\n<p> So we&#8217;d start with a list of data about failure rates. This is a very small<br \/>\ndataset; in reality, we&#8217;d have hundreds or thousands of data points, but for simplicity, I&#8217;m using a very small set.<\/p>\n<table border=\"1\">\n<tr>\n<th>Exposure<\/th>\n<th>Failure Rate<\/th>\n<\/tr>\n<tr>\n<td>10<\/td>\n<td>8.2<\/td>\n<\/tr>\n<tr>\n<td>12<\/td>\n<td>10.1<\/td>\n<\/tr>\n<tr>\n<td>14<\/td>\n<td>11.8<\/td>\n<\/tr>\n<tr>\n<td>16<\/td>\n<td>10.3<\/td>\n<\/tr>\n<tr>\n<td>18<\/td>\n<td>12.7<\/td>\n<\/tr>\n<tr>\n<td>20<\/td>\n<td>12.9<\/td>\n<\/tr>\n<\/table>\n<p> There are a couple of things that we&#8217;d like to know from this data. First,<br \/>\ncan we predict the failure rate for a particular exposure? And if so, how reliable will that prediction be?<\/p>\n<p> To answer that, we first need to have some idea of what kind of relationship we&#8217;re looking for. In this case, we believe that the relationship should be linear.<br \/>\n(If it&#8217;s not, then we wouldn&#8217;t use linear regression, and this would be a different post.) So what we&#8217;re looking for is a linear equation that captures the relationship in that data. Since it&#8217;s experimental data, it&#8217;s not perfect &#8211; there are experimental errors, there are extraneous factor. So no line is going to be<br \/>\nan exact fit to the data &#8211; so we want to find the best possible fit.<\/p>\n<p> If you remember your algebra, a linear equation takes<br \/>\nthe form y=mx+b, where &#8220;m&#8221; and &#8220;b&#8221; are constants. In this case, &#8220;x&#8221; is the<br \/>\nradiation level, and &#8220;y&#8221; is the failure rate. So what we want to do is to take<br \/>\nthe set of data, find values for &#8220;m&#8221; and &#8220;b&#8221; that provide a <em>best fit<\/em> line for the data. The way that we define best fit is a line where, for a given x-value,<br \/>\nthe mean-square difference between the measured y-value, and the y-value predicted by our best-fit line is minimal.<\/p>\n<p> Mean-square in the above basically means that we take the difference between<br \/>\nthe observed and the predicted, square it, and take the mean. That mean-square<br \/>\nvalue is what we want to minimize. But why square it? The intuitive answer comes<br \/>\nfrom an example. What&#8217;s a better fit: one where for a given pair of points, the<br \/>\ndistances are -10 and 15, or one where the distances are 7 and -2? If we use mean<br \/>\ndistance, then the two are the same &#8211; the mean distance in each case is 2.5. If we<br \/>\nuse mean-square, then we&#8217;ve got 162.5 in the first case, and 53 in the second. The &#8220;7&#8221; and &#8220;-2&#8221; is a better fit.<\/p>\n<p> So how do we minimize the mean-square y-distance? It&#8217;s actually pretty easy. First<br \/>\nwe want to compute the slope. Normally, the slope is defined as &Delta;Y\/&Delta;X. We<br \/>\ntweak that a bit, making it (&Delta;Y&Delta;X)\/(&Delta;X<sup>2<\/sup>). Then we<br \/>\nuse that for all of the data points. So what we end up with is:<\/p>\n<p>m = (&Sigma;(x<sub>i<\/sub>&#8211;<span style=\"text-decoration: overline\">x<\/span>)<br \/>\n(y<sub>i<\/sub>&#8211;<span style=\"text-decoration: overline\">y<\/span>)) \/<br \/>\n(&Sigma;(x-<span style=\"text-decoration: overline\">x<\/span>)<sup>2<\/sup>)<\/p>\n<p> So, for our example data, <span style=\"text-decoration: overline\">x<\/span>=16,<br \/>\nand <span style=\"text-decoration: overline\">y<\/span>=11. So we compute the<br \/>\nfollowing values for our data:<\/p>\n<table border=\"1\">\n<tr>\n<th>y-<span style=\"text-decoration: overline\">y<\/span><\/th>\n<\/th>\n<th>x-<span style=\"text-decoration: overline\">x<\/span><\/th>\n<th>(x-<span style=\"text-decoration: overline\">x<\/span>)<sup>2<\/sup><\/th>\n<\/tr>\n<tr>\n<td>-2.8<\/td>\n<td>-5<\/td>\n<td>25<\/td>\n<\/tr>\n<tr>\n<td>-0.9<\/td>\n<td>-3<\/td>\n<td>9<\/td>\n<\/tr>\n<tr>\n<td>0.8<\/td>\n<td>-1<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>-0.7<\/td>\n<td>1<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>1.7<\/td>\n<td>3<\/td>\n<td>9<\/td>\n<\/tr>\n<tr>\n<td>1.9<\/td>\n<td>5<\/td>\n<td>25<\/td>\n<\/tr>\n<\/table>\n<p> If we sum up &#8220;(y-<span style=\"text-decoration: overline\">y<\/span>)(x-<span style=\"text-decoration: overline\">x<\/span>)&#8221;, we get 29.8; if we sum up<br \/>\n(x-<span style=\"text-decoration: overline\">x<\/span>)<sup>2<\/sup>, we get 70. That gives us a slope of 0.42.<\/p>\n<p> Once we have the slope, we need to compute the point where the regression line crosses the y axis, which is simple: b=<span style=\"text-decoration: overline\">y<\/span>-m<span style=\"text-decoration: overline\">x<\/span>. So the intercept is 11-0.42&times;15 = 4.7.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"regress.png\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_302.png?resize=279%2C171\" width=\"279\" height=\"171\" class=\"inset right\" \/><\/p>\n<p> The resulting line &#8211; the &#8220;best fit&#8221; line according to our linear regression &#8211; is shown in the diagram. You can see it&#8217;s a reasonably good fit. But how good? There&#8217;s a standard<br \/>\nmeasure of the quality of a line-fit in a linear regression, called the <em>correlation coefficient<\/em>. The correlation coefficient describes how well a simple linear relationship matches a set of data (which connects to an interesting point that I&#8217;ll mention in a moment). The correlation coefficient is unitless, and varies between -1 and +1. If the data is a perfect negative correlation &#8211; meaning that as X increases, Y decreases in a perfectly linear fashion, then the correlation coefficient will be -1. If the data is a perfect positive correlation &#8211; that is, a perfect relationship where when X increases, Y also increases in a linear fashion, then the correlation coefficient will be +1. If there is absolutely no linear relationship between X and Y, then the correlation coefficient will be zero.<\/p>\n<p> What&#8217;s a &#8220;good&#8221; correlation coefficient? There is no single answer. If you&#8217;re working on<br \/>\nan experiment where you&#8217;ve done a really good job of isolating out any external factors, and<br \/>\nwhere you&#8217;ve got very precise equipment, the correlation coefficient required to conclude<br \/>\nthat you&#8217;ve correctly isolated a relationship could be very high &#8211; greater than 0.99. On the other hand, in social science work, anything over 0.5 is often considered good enough to<br \/>\ninfer a relationship.<\/p>\n<p> The definition of the correlation coefficient is a bit messy. In fact, there are several<br \/>\ndifferent versions of it used for different purposes. The most common one is the Pearson<br \/>\ncorrelation, which is what I&#8217;m going to describe. The basic idea of the Pearson correlation<br \/>\ncoefficient is to compute something called the <em>covariance<\/em> of the two variables in<br \/>\nthe equation, and then divide that by the product of the standard deviation of the two<br \/>\nvariables. The covariance is something that&#8217;s usually defined via probability theory, and I<br \/>\ndon&#8217;t want to get into the details in this post, but the basic idea of it is: for each data<br \/>\npoint (x,y), take the product of the difference between x and mean x, and the difference<br \/>\nbetween y and the mean y. Sum that up for all data points, and that&#8217;s the <em>covariance<\/em><br \/>\nof the variables x and y.<\/p>\n<p> For our particular data set, the correlation coefficient comes out to 0.88 &#8211; a very<br \/>\nhigh correlation for such a small data set. <\/p>\n<p> The interesting point is, we constantly hear people talk about correlation. You&#8217;ll frequently hear things about &#8220;correlation is not causation&#8221;, &#8220;X correlates with Y&#8221;, etc. Most people seem to think that correlation implies nothing more that &#8220;X is related to Y&#8221;, or &#8220;If X increases, then Y increases&#8221;. In fact, correlation means something very specific: X correlates with Y <em>if and only if<\/em> there is a linear relationship between X and Y. Nothing more, and nothing less. Correlation means a direct linear relationship.<\/p>\n<p> When you look at data, it&#8217;s easy to make the assumption that if, every time you change X Y changes in a predictable linear fashion, then changing X must <em>cause<\/em> Y to change. That&#8217;s probably the single most common mistake made in the misuse of statistics.<br \/>\nCorrelation shows that there is a relationship &#8211; but it doesn&#8217;t show <em>why<\/em>.<\/p>\n<p> Real examples of this show up all the time. For a trivial example, my dog is more<br \/>\nlikely to have an accident in the house if it&#8217;s very cold out. In fact, there&#8217;s<br \/>\na strong inverse correlation between the temperature outside, and the likelihood of my<br \/>\ndog having an accident. Does cold weather cause my dog to have an accident? No. Cold<br \/>\nweather causes me to try to avoid taking him out, and because I don&#8217;t take him out<br \/>\nenough, he has more accidents. The cold doesn&#8217;t affect my dog&#8217;s likelihood to<br \/>\nhave an accident directly. He&#8217;d be equally likely to have an accident if the weather<br \/>\nwere a balmy 70 degrees out, if I didn&#8217;t take him out for a walk. He doesn&#8217;t care whether<br \/>\nit&#8217;s 10 or 70; he just needs to go.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Suppose you&#8217;ve got a bunch of data. You believe that there&#8217;s a linear relationship between two of the values in that data, and you want to find out whether that relationship really exists, and if so, what the properties of that relationship are.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[61],"tags":[],"class_list":["post-619","post","type-post","status-publish","format-standard","hentry","category-statistics"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-9Z","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/619","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=619"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/619\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=619"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=619"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=619"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}