{"id":856,"date":"2010-04-29T21:46:23","date_gmt":"2010-04-29T21:46:23","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2010\/04\/29\/iterative-hockey-stick-analysis-gimme-a-break\/"},"modified":"2010-04-29T21:46:23","modified_gmt":"2010-04-29T21:46:23","slug":"iterative-hockey-stick-analysis-gimme-a-break","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2010\/04\/29\/iterative-hockey-stick-analysis-gimme-a-break\/","title":{"rendered":"Iterative Hockey Stick Analysis? Gimme a break!"},"content":{"rendered":"<p> This past weekend, my friend Orac sent me a link to an interesting piece<br \/>\nof bad math. One of Orac&#8217;s big interest is vaccination and<br \/>\nanti-vaccinationists. The piece is <a href=\"http:\/\/www.soundchoice.org\/Images\/SCPINewsletter_April_2010.pdf\">a newsletter<\/a> by a group calling itself the &#8220;Sound Choice<br \/>\nPharmaceutical Institute&#8221; (SCPI), which purports to show a link<br \/>\nbetween vaccinations and autism. But instead of the usual anti-vac rubbish about<br \/>\nthimerosol,  they claim that &#8220;residual human DNA contamintants from aborted human fetal cells&#8221;<br \/>\ncauses autism.<\/p>\n<p> Among others, Orac <a href=\"http:\/\/scienceblogs.com\/insolence\/2010\/04\/when_right_wing_nuts_try_to_do_science.php#more\">already covered the nonsense<\/a><br \/>\nof that from a biological\/medical<br \/>\nperspective. What he didn&#8217;t do, and why he forwarded this newsletter to me, is because<br \/>\nthe basis of their argument is that they discovered key change points in the<br \/>\nautism rate that correlate perfectly with the introduction of various vaccines.<\/p>\n<p> In fact, they claim to have discovered three different inflection points:<\/p>\n<ol>\n<li> 1979, the year that the MMR 2 vaccine was approved in the US;<\/li>\n<li> 1988, the year that a 2nd dose of the MMR 2 was added to the recommended vaccination<br \/>\nschedule; and<\/li>\n<li> 1995, the year that the chickenpox vaccine was approved in the US.<\/li>\n<\/ol>\n<p> They claim to have discovered these inflection points using &#8220;iterative hockey stick analysis&#8221;.<\/p>\n<p><!--more--><\/p>\n<p> First of all, &#8220;hockey stick analysis&#8221; isn&#8217;t exactly a standard<br \/>\nmathematical term. So we&#8217;re on shaky ground right away. They describe<br \/>\nhockey-stick analysis as a kind of &#8220;computational line fitting analysis&#8221;. But<br \/>\nthey never identify what the actual method is, and there&#8217;s no literature on<br \/>\nexactly what &#8220;iterative hockey stick analysis&#8221; is. So I&#8217;m working from a best<br \/>\nguess. Typically, when you try to fit a line to a set of data points,<br \/>\nyou use a technique called linear regression. The most common linear regression method is<br \/>\ncalled the &#8220;least squares&#8221; method, and their graphs look roughly like least-squares<br \/>\nfitting, so I&#8217;m going to assume that that&#8217;s what they use. <\/p>\n<p> What least squares linear regression do is pretty simple &#8211; but it takes a<br \/>\nbit of explanation. Suppose you&#8217;ve got a set of data points where you&#8217;ve got<br \/>\ngood reason to believe that you&#8217;ve got one independent variable, and one<br \/>\ndependent variable. Then you can plot those points on a standard graph, with<br \/>\nthe independent variable on the x axis, and the dependent variable on the y<br \/>\naxis. That gives you a scattering of points. If there really is a linear<br \/>\nrelationship between the dependent and independent variable, and your<br \/>\nmeasurements were all perfect, with no confounding factors, then the points would<br \/>\nfall on the line defined by that linear relationship.<\/p>\n<p> But nothing in the real world is ever perfect. Our measurements always<br \/>\nhave some amount of error, and there are always confounding factors. So<br \/>\nthe points <em>never<\/em> fall perfectly along a line. So we want some way of<br \/>\ndefining the <em>best fit<\/em> to a set of data. That is, understanding that there&#8217;s<br \/>\nnoise in the data, what&#8217;s the line that comes closest to describing a linear relationship.<\/p>\n<p> Least squares is one simple way of describing that. The idea is that the<br \/>\nbest fit line is the line where, for each data point, you take the difference<br \/>\nbetween the predicted line and the actual measurement. You square that<br \/>\ndifference, and then you add up all of those squared differences. The line<br \/>\nwhere that sum is <em>smallest<\/em> is the best fit. I&#8217;l avoid going into detail about<br \/>\nwhy you square it &#8211; if you&#8217;re interested, say so in the comments, and maybe I&#8217;ll write a basics<br \/>\npost about linear regression.<\/p>\n<p> One big catch here is that least-squares linear regression produces a good result<br \/>\n<em>if<\/em> the data really has a linear relationship. If it doesn&#8217;t, then least squares<br \/>\nwill produce a lousy fit. There are lots of other curve fitting techniques, which work in<br \/>\ndifferent ways. If you want to treat your data as perfect, you can use different techniques to<br \/>\nprogressively fit the data better and better until you have a polynomial curve which<br \/>\nprecisely includes every datum in your data set. You can start with fitting a line to two points; for<br \/>\nevery two points, there&#8217;s a line connecting them. Then for three points, you can fit them precisely<br \/>\nwith a quadratic curve. For four points, you can fit them with a cubic curve. And so on.<\/p>\n<p> Similarly, unless your data is perfectly linear, you can <em>always<\/em> improve a fit by<br \/>\npartitioning the data. Just like we can fit a curve to two points from the set; then get closer<br \/>\nby fitting it to three; then closer by fitting it to four, we can fit two lines to a 2 way partition<br \/>\nof the data, and get a closer match; then we can get closer with three lines in a three way partition,<br \/>\nand four lines in a four way partition, and so on, until you have a partition for every pair of adjacent<br \/>\npoints.<\/p>\n<p> The key takeaway is that no matter <em>what<\/em> you data looks like, if<br \/>\nit&#8217;s not perfectly linear, then you can <em>always<\/em> improve the fit by<br \/>\ncreating a partition.<\/p>\n<p> For &#8220;hockey stick analysis&#8221;, what they&#8217;re doing is looking for a good<br \/>\nplace to put a partition. That&#8217;s a reasonable thing to try to do, but you need<br \/>\nto be really careful about it &#8211; because, as I described above, you can<br \/>\n<em>always<\/em> find a partition. You need to make sure that you&#8217;re actually<br \/>\nfinding a genuine change in the basic relationship between the dependent and<br \/>\nindependent variable, and not just noticing a random correlation.<\/p>\n<p> Identifying change points like that is extremely tricky. To identify it,<br \/>\nyou need to do a lot of work. In particular, you need to create a large number<br \/>\nof partitions of the data, in order to show that there is one specific<br \/>\npartition that produces a better result than any of the others. And that&#8217;s not<br \/>\nenough: you can&#8217;t just select one point that looks good, and see if you get a<br \/>\nbetter match by splitting there. That&#8217;s a start: you need to show that the<br \/>\ninflection point that you chose is really the <em>best<\/em> inflection point.<br \/>\nBut you also really need to go bayesian, and figure out an estimate of the chance<br \/>\nof the inflection being an illusion, and show that what the quality of the partition<br \/>\nthat you found is better than what you would expect by chance.<\/p>\n<p> Finding a partition point like that is, as you can see, not a simple<br \/>\nthing to do.  You need a good supply of data: for small datasets, the<br \/>\nprobability of finding a good partition is quite high. You need to do<br \/>\na lot of careful analysis.<\/p>\n<p> In general, trying to find multiple partition points is simply not<br \/>\nfeasible unless you have a really huge quantity of data, and the slope change<br \/>\nis really dramatic. I&#8217;m not going to go into the details &#8211; but it&#8217;s basically<br \/>\njust using more Bayesian analysis. You know that there&#8217;s a high probability<br \/>\nthat adding partitions to your data will increase the match quality. You need<br \/>\nto determine, given the expected improvement from partitioning based on the<br \/>\ndistribution of you data, how much better of a fit you&#8217;d need to find after<br \/>\npartitioning for it to be reasonably certain that the change wasn&#8217;t an<br \/>\nartifact.<\/p>\n<p> Just to show that there&#8217;s one genuine partition point, you need to show a<br \/>\npretty significant change. (Exactly how much depends on how much data you<br \/>\nhave, what kind of distribution it has, and how well it correlates to the line<br \/>\nmatch.) But you can&#8217;t do it for small changes. To show two genuine change points<br \/>\nrequires an extremely robust change at both points, along with showing that<br \/>\nnon-linear matches aren&#8217;t better that the multiple slope changes. To show<br \/>\n<em>three<\/em> inflection points is close to impossible; if the slope is<br \/>\nshifting that often, it&#8217;s almost certainly not a linear relationship.<\/p>\n<p> To get down to specifics, the data set purportedly analyzed by SCPI<br \/>\nconsists of autism rates measured over 35 years. That&#8217;s just <em>thirty<br \/>\nfive<\/em> data points. The chances of being able to reliably identify<br \/>\n<em>one<\/em> slope change in a set of 35 data points is slim at best. Two?<br \/>\nridiculous. Three? Beyond ridiculous. There&#8217;s just nowhere <em>near<\/em><br \/>\nenough data to be able to claim that you&#8217;ve got three different inflection<br \/>\npoints measured from 35 data points.<\/p>\n<p> To make matters worse: the earliest data in their analysis comes from a<br \/>\n<em>different<\/em> source than the latest data. They&#8217;ve got some data from the<br \/>\nUS Department of Education (1970-&gt;1987), and some data from the California<br \/>\nDepartment of Developmental Services (1973-&gt;1997). And those two are measuring<br \/>\n<em>different<\/em> things; the US DOE statistic is based on a count of the number of 19<br \/>\nyear olds who have a diagnosis of autism (so it was data collected in 1989 through 2006);<br \/>\nthe California DDS statistic is based on the autism diagnosis rate for children living in<br \/>\nCalifornia.<\/p>\n<p> So &#8211; guess where one of their slope changes occurs? Go on, guess.<\/p>\n<p> 1988.<\/p>\n<p> The slope changed in the year when they switched from mixed data to<br \/>\nCalifornia DDS data exclusively. Gosh, you don&#8217;t think that that might be a<br \/>\nconfounding factor, do you? And gosh, it&#8217;s by far the largest (and therefore<br \/>\nthe most likely to be real) of the three slope changes they claim to<br \/>\nhave identified.<\/p>\n<p> For the third slope change, they don&#8217;t even show it on the same graph. In<br \/>\nfact, to get it, they needed to use an <em>entirely different<\/em> dataset from<br \/>\neither of the two others. Which is an interesting choice, given that the CA DDS<br \/>\nstatistic that they used for the second slope change, actually appears<br \/>\nto show a <em>decrease<\/em> occurring around 1995. But when they switch datasets,<br \/>\nignoring the one that they were using before, they find a third slope change<br \/>\nin 1995 &#8211; right when their other data set shows a <em>decrease<\/em>.<\/p>\n<p> So&#8230; Let&#8217;s summarize the problems here.<\/p>\n<ol>\n<li>  They&#8217;re using an iterative line-matching technique which is, at<br \/>\nbest, questionable.<\/li>\n<li> They&#8217;re applying it to a dataset that is orders of<br \/>\nmagnitude too small to be able to generate a meaningful result for a<br \/>\n<em>single<\/em> slope change, but they use it to identify <em>three<\/em><br \/>\ndifferent slope changes.<\/li>\n<li> They use mixed datasets that measure different things in different ways,<br \/>\nwithout any sort of meta-analysis to reconcile them.<\/li>\n<li> One of the supposed changes occurs at the point of changeover in the datasets. <\/li>\n<li> When one of their datasets shows a <em>decrease<\/em> in the slope, but another<br \/>\nshows an increase, they arbitrarily choose the one that shows an increase.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>This past weekend, my friend Orac sent me a link to an interesting piece of bad math. One of Orac&#8217;s big interest is vaccination and anti-vaccinationists. The piece is a newsletter by a group calling itself the &#8220;Sound Choice Pharmaceutical Institute&#8221; (SCPI), which purports to show a link between vaccinations and autism. But instead of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[8],"tags":[],"class_list":["post-856","post","type-post","status-publish","format-standard","hentry","category-bad-statistics"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-dO","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/856","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=856"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/856\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=856"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=856"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=856"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}