{"id":706,"date":"2008-11-24T16:25:59","date_gmt":"2008-11-24T16:25:59","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2008\/11\/24\/scale-how-large-quantities-of-information-change-everything\/"},"modified":"2008-11-24T16:25:59","modified_gmt":"2008-11-24T16:25:59","slug":"scale-how-large-quantities-of-information-change-everything","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2008\/11\/24\/scale-how-large-quantities-of-information-change-everything\/","title":{"rendered":"Scale: How Large Quantities of Information Change Everything"},"content":{"rendered":"<p><span class=\"technoratitag\">Technorati Tags: <a href=\"http:\/\/www.technorati.com\/tags\/scale\" rel=\"tag\">scale<\/a>, <a href=\"http:\/\/www.technorati.com\/tags\/computation\" rel=\"tag\">computation<\/a>, <a href=\"http:\/\/www.technorati.com\/tags\/information\" rel=\"tag\">information<\/a><\/span><\/p>\n<p> Since people know I work for Google, I get lots of mail from folks with odd questions, or with complaints about some Google policy, or questions about the way that Google does some particular thing.  Obviously, I can&#8217;t answer questions about Google. And even if I could, I wouldn&#8217;t. This isn&#8217;t a Google blog; this is my blog, which I write as a hobby in my free time.<\/p>\n<p> But there are intersections between my work life and my hobby. And one of the ideas that underlies many of the questions that I receive, and which also<br \/>\nhits on my work, and my hobby. And that&#8217;s the idea of <em>scale<\/em>. Scale is computer-science talk for how things change as they get bigger. In particular, I&#8217;m talking about the scale of information; the amount of information that we use on a daily basis has increased dramatically, and the amount of dramatic, fundamental change that has resulted is both amazing, and amazingly unnoticed by most people.<\/p>\n<p><!--more--><\/p>\n<p> What&#8217;s scale?<\/p>\n<p> Take a bunch of information, and think about how you would analyze it. Suppose you have a list of, say, music tracks. On my computer, I&#8217;ve got 4,426 tracks of music. Now, suppose I wanted to go through those tracks, and summarize what&#8217;s there. If all you want to look at is simple information &#8211; the name of the track, the artist, the length, the genre, etc., it&#8217;s trivial. You could do it in seconds on any modern computer. Heck, you could do it in seconds on any modern <em>cellphone<\/em>.<\/p>\n<p> Now, suppose that instead of 4,000 tracks, it were 4 million. That&#8217;s still<br \/>\nnot a big deal. I can still do all sorts of interesting analysis of 4 million<br \/>\nrecords in minutes. With a modern computer, it&#8217;s no big deal.<\/p>\n<p> But what if I wanted to do something more interesting, like look at the files and try to compute the tempo? Now instead of dealing with four million sets of strings, I&#8217;d need to actually look at the full content of the songs, to compute the tempo. OnIn my mumic collection, the average track seems to take up a couple of megabytes; let&#8217;s say it&#8217;s just one megabyte per song. To analyze a library of 4,000 songs at one megabyte each is doable on a modern PC, but it&#8217;s going to take a while. To do it to a library of 4 <em>million<\/em> tracks is simply not feasible. I can buy a disk for my computer large enough to store that many tracks, but there&#8217;s no way that I&#8217;d ever be able to run an interesting sonic analysis and finish it.<\/p>\n<p> Services with music databases like LastFM, Pandora, and iTunes maintain<br \/>\ninformation on tens of millions of tracks, and many of us play with those<br \/>\nevery day without ever thinking about what&#8217;s going on. LastFM and Pandora<br \/>\nanalyze millions of tracks of music, so that when you tell it what you like, it<br \/>\ncan suggest similar things based on the musical structure. Ten years ago, having a computer do that would have been impossible. Today it&#8217;s routine.<\/p>\n<p> That&#8217;s scale &#8211; the way that things change as you increase the size of the problem.<br \/>\nThings get bigger and bigger, and eventually the size of the problem crosses a line so that the increase in size fundamentally changes the problem.<\/p>\n<p> Again, let me try to put the idea of quantities of data into some kind of terms<br \/>\nthat intuitively give you an idea. I bought my first computer in 1982. It was a<br \/>\nFranklin Ace 1000 &#8211; and Apple II clone. It had a floppy disk drive, which stored<br \/>\n140,000 bytes on a 5 1\/4 inch floppy disk. My next computer was a big step up &#8211; an<br \/>\nAmiga, which stored around 800,000 bytes on a 3 1\/2 inche disk drive. Today, my<br \/>\n<em>cellphone<\/em> has a removable flash drive which can store 8,000,000,000 bytes, on<br \/>\na little tiny card about 1\/4 inch on a side, and about the thickness of a greeting<br \/>\ncard. If I were to convert the storage on my cellphone to floppy disks on my old Apple<br \/>\nclone, it would require about 57,000 disks. That would be a stack of floppy disks<br \/>\nalmost 500 feet high. <\/p>\n<p> We&#8217;re routinely carrying around quantities of data that we simply don&#8217;t have<br \/>\nthe ability to intuitively understand.  A gigabyte is a <em>billion<\/em> letters. What does that <em>mean<\/em>?  We just don&#8217;t think about it. The closest we come is<br \/>\nto break it down some other way; instead of calling it a billion characters, we&#8217;ll call it a hundred songs. But the availability of that quantity of data has dramatically changed the way we work, and the things we can do. I used to<br \/>\nhave a little briefcase full of cassettes for my car stereo. I carried this stupid thing around, with 100 little plastic cassettes in it, just to have a decent selection of music to listen to. Now I&#8217;ve got 887 tracks on my Android, and I <em>complain<\/em> about how I don&#8217;t have enough space for a decent selection! That&#8217;s scale &#8211; the<br \/>\namount of data I can afford to carry with me has changed dramatically, and as a result, my expectations about what&#8217;s normal have changed.<\/p>\n<p> The effects of scale are very counter-intuitive. By making the problem bigger,<br \/>\nin some cases, you make it <em>easier<\/em>. There are things that you can do<br \/>\n<em>easily<\/em> with data at massive scale that you couldn&#8217;t do at all with<br \/>\nsmall quantities of data; and there are things that you can easily do with small amounts of data that become impossible when you scale upwards.<\/p>\n<p> We&#8217;ll start with the upside. Scale means that you can do some amazing things<br \/>\nthat used to be impossible. For example, scale means that you can record the entire genomes of large numbers of organisms. Last week, my fellow ScienceBlogger <a href=\"http:\/\/scienceblogs.com\/aetiology\/2008\/11\/new_ebola_subtype_confirmed.php\">Tara Smith at Aetiology<\/a> wrote an article about a recent Ebola outbreak; doctors studying the outbreak wanted to see what strain of Ebola they were dealing with, so they sequenced it, and compared it to the existing library of Ebola genomes, and identified it as a new strain, but related to one of the known strains. Scientists now now <em>routinely<\/em> sequence the genes new infectious diseases, and compare them with a range of known variants. In fact, they do more than that; to quote <a href=\"http:\/\/scbysue3.myweb.uga.edu\/virion.htm\">one website<br \/>\ndiscussing this<\/a>:<\/p>\n<blockquote><p>\n&#8220;Due to the explosive growth of the field of Genomics, several<br \/>\ndatabases containing fully sequenced genomes of many viruses have been created and<br \/>\nplaced online for easy access. By taking the newly sequenced Ebola genome data,<br \/>\nscientists were able to compare its composition to that of other viruses. When the<br \/>\ntranslated genome of the Ebola virus and its respective subtypes was compared to that<br \/>\nof other viruses, something of particular interest stood out. There was an almost<br \/>\nidentical match between the immunosuppressive motifs of the oncogenic retroviruses,<br \/>\nmurine leukemia virus (MuLV) and feline leukemia virus (FeLV) and a region of the<br \/>\nEbola genome.&#8221;\n<\/p><\/blockquote>\n<p> To translate that into simpler terms: they didn&#8217;t just compare the genome of the new Ebola variant to other Ebola viruses; that wouldn&#8217;t be particularly difficult, since the Ebola genome is only about 19K of data. But they compared it to the sequenced genomes of <em>every previously sequenced virus<\/em>, and found nearly identical sequences in Ebola and feline leukemia virus!<\/p>\n<p> That&#8217;s the good side of scale. Things that we wouldn&#8217;t have imagined just a few years ago are not just possible, but they&#8217;re downright routine &#8211; because carrying<br \/>\naround and manipulating massive quantities of data is now routine. Scientists now casually work with massive quantities of information, and by doing that, they&#8217;re finding completely new ways of doing science. Ten years ago, if they could have gotten state of the art equipment, the scientists looking at the Ebola genome could probably have sequenced it. But they would never have <em>dreamt<\/em> of comparing it to feline leukemia virus. But with the scale at which we can now work, it&#8217;s harder to <em>exclude<\/em> some probably irrelevant virus than it is to go ahead and compare it, just on the incredible off-chance that there might be something there. <\/p>\n<p> In some ways, it&#8217;s almost like something out of Douglas Adams&#8217; &#8220;Hitchhiker&#8217;s Guide<br \/>\nto the Galaxy&#8221; made real: we can now generate answers to questions that we didn&#8217;t even<br \/>\nask, and by there are answers laying in the data we have available for questions that<br \/>\nwe don&#8217;t even know. By searching for apparently meaningful patterns in large<br \/>\nquantities of data, we can find answers before we even know what the questions are.<br \/>\n(Of course, just like in HHGtG, we still need to figure out what the question is.)<br \/>\nScale has dramatically changed science: we can routinely look for things and discover<br \/>\ncorrelations that we don&#8217;t expect, which opens up whole new questions that we never<br \/>\nthought of before.<\/p>\n<p> That&#8217;s the happy side of scale: the ability to mine huge quantities of data means<br \/>\nthat we can look at things and discover things in a whole new way. But there&#8217;s also<br \/>\na downside: there are things that would be  easy for small-scale data that becomes unmanageable on a large scale.<\/p>\n<p> In computer science, we&#8217;ve traditionally drawn a line between things that are really computable, and things that while theoretically computable, simply take too long on any real computer to be doable. We&#8217;ve drawn that line in terms of something called algorithmic complexity. The line has traditionally been that anything whose<br \/>\ncomplexity can be bounded by some polynomial computed from the size of its input is computable, while anything whose complexity is bounded by some exponential in the size of its input is <em>not<\/em>. But scale changes that. When you&#8217;re working with large<br \/>\nscale data, polynomial complexity is not really doable. Even if you&#8217;ve got something<br \/>\nwhich grows very slowly, like sorting, becomes close to impossible at scale.<\/p>\n<p> A great example of how even people who deal with massive quantities of data<br \/>\nhaven&#8217;t really grasped what a deep fundamental change we&#8217;re seeing in the scale of<br \/>\nthings comes from the database world. A while ago, a database magazine published an<br \/>\narticle critiquing Google&#8217;s MapReduce for not being sufficiently<br \/>\nrelational-database-like. I wrote <a href=\"http:\/\/scienceblogs.com\/goodmath\/2008\/01\/databases_are_hammers_mapreduc.php\">a<br \/>\npost<\/a> responding to it, explaining why that&#8217;s wrong. I still get mail about that<br \/>\nold post from database folks who think that the stuff that&#8217;s typically done by<br \/>\nMapReduce should be done by relational databases.<\/p>\n<p> Databases were designed to deal with what were (at the time) considered massive<br \/>\nquantities of data. And as database people keep reminding me, modern databases support clustering &#8211; meaning using multiple computers in parallel to make things faster &#8211; just like Google, so that they&#8217;re more scalable than I give them credit for.<\/p>\n<p> BUt there&#8217;s massive, and then there&#8217;s <em>massive<\/em>. Databases are great<br \/>\nat handling large quantities of data. But there are collections of data that are simply <em>too large<\/em> for current database technology. And much of the reason<br \/>\nfor that isn&#8217;t just because databases haven&#8217;t been designed to work with network-scale datasets, but because many of the fundamental database algorithms and representations simply don&#8217;t work on the massive scale.<\/p>\n<p> It&#8217;s easiest to explain this through an example. This past weekend, <a href=\"http:\/\/googleblog.blogspot.com\/2008\/11\/sorting-1pb-with-mapreduce.html\">a team<br \/>\nat Google<\/a> (a team which I have <em>no connection<\/em> to; I know no one involved in<br \/>\nthe task) announced the results of trying to sort one <em>petabyte<\/em> of data,<br \/>\nstored in 10 <em>trillion<\/em> 100-byte records. Doing this required <em>4,000<\/em><br \/>\ncomputers and <em>48,000<\/em> hard drives working together for just over six hours.<\/p>\n<p> When database people talk about clustering, they&#8217;re usually talking about up to a<br \/>\ndozen or so machines. (I just did a quick search of IBM&#8217;s DB information on the web,<br \/>\nand couldn&#8217;t find any reference to using more than 12 CPUs.)<\/p>\n<p> There&#8217;s a big difference between dealing with 12 CPUs and gigabytes of data, and<br \/>\n4,000 CPUs and a petabyte of data stored on 48,000 disk drives! Modern large scale<br \/>\nsystems are increasingly more like the petabyte scale of large map-reduce than the<br \/>\ngigabyte scale of databases. According to the post about the petabyte sort results,<br \/>\nGoogle routinely processes <em>20 petabytes<\/em> of data every single day. I&#8217;m sure<br \/>\nthat other large internet companies are seeing similar quantities of data. Both Yahoo and Microsoft are certainly maintaining search indices that have to be in that<br \/>\nrough size range; ISPs need to log traffic information for millions of users;<br \/>\ncellphone companies must be recieving billions of data packets per hour providing them<br \/>\nwith location information about cellphones; and so on. And it&#8217;s not just<br \/>\ninternet companies: maintaining genomic data on thousands of human beings (which is<br \/>\nsomething medical researchers want to do); or maintaining infection disease tracking<br \/>\ninformation for large populations; or even managing auctions and bids on eBay all involve storing<\/p>\n<p> That&#8217;s scale, and that&#8217;s downright <em>scary<\/em>. Just <em>sorting<\/em> one<br \/>\n20th of the data processed by Google every day takes a team of brilliant engineers<br \/>\nto figure out <em>how to do it<\/em>, and the result requires tens of thousands of computers and hundreds of thousands of disks!<\/p>\n<p> Imagine how many computers it must take to <em>encrypt<\/em> all of that information! Companies like Google, Yahoo, Amazon, and Ebay are recieving millions<br \/>\nto billions of requests per day, and they&#8217;re not storing the information about those requests in text files that any bozo can read. Just think about what it means to<br \/>\nencrypt 20 petabytes of data per day.<\/p>\n<p> There&#8217;s another important downside to scale. When we look at large quantities of information, what we&#8217;re really doing is searching for patterns. And being the kind of creatures that we are, and given the nature of the laws of probability, we <em>are<\/em> going to find patterns. Distinguishing between a real legitimate<br \/>\npattern, and something random that just happens to look like a pattern can be somewhere between difficult and impossible. Using things like Bayesian methods<br \/>\nto screen out the false positives can help, but scale means that scientists need to learn new methods &#8211; both the new ways of doing things that they couldn&#8217;t do before, and the new ways of recognizing when they&#8217;ve screwed up.<\/p>\n<p> There&#8217;s the nature of scale. Tasks that were once simple have become hard<br \/>\nor even impossible, because they can&#8217;t be done at scale. Tasks that were once<br \/>\nimpossible have become easy because scale makes them possible. Scale<br \/>\nchanges everything.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Technorati Tags: scale, computation, information Since people know I work for Google, I get lots of mail from folks with odd questions, or with complaints about some Google policy, or questions about the way that Google does some particular thing. Obviously, I can&#8217;t answer questions about Google. And even if I could, I wouldn&#8217;t. This [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[79],"tags":[],"class_list":["post-706","post","type-post","status-publish","format-standard","hentry","category-computation"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-bo","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/706","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=706"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/706\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=706"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=706"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=706"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}