Petabyte Scale Data-Analysis and the Scientific Method

Once again, there’s a silly article somewhere, and everyone hammers me with requests to write about it. It’s frankly flattering that people see
this sort of nonsense, and immediately think of asking me about it – you folks are going to give me a swelled head!

The article in question is a recent article from Wired magazine, titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”.

The basic idea behind the article is that now that we can retrieve and
analyze truly massive quantities of data, that the whole scientific idea of trying understand why and how things work is now irrelevant. They use Google and Craig Venter as examples of how this works:

Google, according to Mr. Anderson, doesn’t care about why
link analysis suggests that page X should be a search result for keyword
Y. And Venter has done work that basically pulls down huge quantities of genetic material from an unknown number of species, and then sequences it. He can discover new species from this data – even though he doesn’t necessarily know what the species is, what it looks like, where it came from, etc.

There’s a tiny grain of truth to the article. Massive quantities
of data and the tools to analyze that data do change science.
Researchers are finding new approaches, new ways of doing research,
based on being able to look at quantities of data that absolutely boggle the mind. Petabyte-scale computing is an amazing tool, and it makes it possible
both to do new things that we’ve never done before, and to do old things in new ways.

But the idea that massive scale data collection and computing renders the
scientific method obsolete? That we no longer need models, or theories, or experiments? That’s blatant silliness.

What it comes down to, ultimately, is another form of what I call the big numbers problem. Humans are terrible at really, truly understanding what big numbers mean. We’re fantastic at understanding numbers we can count on our fingers. We’re pretty good at understanding the meaning of numbers up to thousands or so. We start having trouble with millions. Once you get beyond billions, our instincts go right out the window.

That poor understanding of really big numbers is exploited by creationists who try to convince people that things like evolution are impossible. The arguments are nonsense, but because the numbers are so big, so incomprehensible, that you can trick people.

Mr. Anderson is making a similar mistake.

When you start working at petabytes of data, you’re looking at a boggling amount of information – and it’s hard to really understand just what it all means.

The big problem with huge amounts of data is that it’s easy to find false correlations. In a huge amount of data, you expect to find tons of patterns. Many of those patterns will be statistical coincidences. And it can be very difficult to identify the ones that are.

For example, I can almost guarantee that if I were to analyze the complete log of searches on Google for the last week, I can find some pattern that encodes the complete win-loss record for the NY Giants last football season. And I’ll similarly find another set that will predict the
complete record for their next season. How? That’s fantastically unlikely, right? There’s 16 games in the regular season. The chances randomly discovering the exact win-loss record is very small – 1 in 216 – that is, 1 in 65,536. How can I possibly find that win-loss record in unrelated data? How can I possibly find data making that prediction?

The answer is, huge freaking numbers. The win-loss record for a football team’s regular season consists of 16 bits. Search logs for a week of Google contains hundreds of millions of queries – each of which has at least one keyword. Suppose that there’s 200 million queries in that period. (I don’t know the real number – but I’m pretty sure that’s actually low.) Suppose that each search on average consists of one five-letter word. (Again, that’s probably very low.) Each query is then averaging something like 30 bits of information. So the full search log of 200 million queries contains 6 billion bits. Somewhere in that 6 billion bits, I can probably find any possible 16 bit sequence.

And 6 billion bits isn’t much data at all. It’s practically miniscule compared to some of the databases that are being analyzed today.

My point is that patterns in huge quantities of data – even seemingly incredibly unlikely patterns – become downright likely when you’re searching at petabit scale. Our intuitions about what’s likely to happen
as a result of randomness or coincidence are a total failure at massive scale.

If you try to do science without understanding – that is, all you do is look for patterns in data – then you’re likely to “discover” a whole bunch of correlations that don’t mean anything. If you don’t try to understand those correlations, then you can’t tell the real ones from the chance ones. You need understanding, theories, models, to be able to determine what’s meaningful and what’s not, and to determine how to test a correlation for real significance.

Mr. Andersen mentions Google, and the fact that we don’t know why our search algorithms produce particular results for a particular query. That’s absolutely true. Do a search for a particular set of keywords, and we can’t, without a lot of work, figure out just why our algorithms produced that
result. That doesn’t mean that we don’t understand our search. Each component of the search process is well understood and motivated by some real theory of how to discover information from links. The general process is well
understood; the specific results are a black box. Mr. Anderson
is confusing the fact that we don’t know what the result will be for a particular query with the idea that we don’t know why our system works well. Web-search systems aren’t based on any kind of randomness: they find relationships between webpages based on hypotheses about how links between pages can provide information about the subject and quality of the link target. Those hypotheses are implemented and tested – and if they test out – meaning if they produce the results that the hypothesis predicts they should – then they get deployed. It’s basic science: look at data, develop a hypothesis; test the hypothesis. Just because the hypothesis and the test are based on huge quantities of data doesn’t change that.

Mr. Anderson’s other example is the discovery of a species using massive
sequencing. A recent technique that’s been used by some biologists is (very loosely speaking) to gather a quantity of biomass, puree it, sieve out the DNA, and sequence it. Doing this, researchers have found genes from previously unknown species. At the moment, we don’t know much about those new species. Without some more work, we don’t know what they look like. Depending on how much of their genetic material we’ve sequenced, we might not even know what family of species they come from!

But the biologists aren’t working blindly. In terms of genetics, they know what a species looks like. No matter how much data they analyzed, if they didn’t have an idea – a theory – about what kinds of patterns should be in the data, about what a species is, about how to identify a species from a genome fragment – then no amount of computation, no mass of data, could possibly produce the hypothesis that they discovered a new species. It’s only by understanding information, developing theories, and then testing those theories by looking at the data, that they can produce any result.

In other words, the results are the product of the scientific method. It’s just a new way of doing an experiment; a new way of analyzing data. Science is still there. The scientific process is, at its core, still exactly the same as it always was. We just have an extremely powerful new tool that scientists can use as part of that method.

0 thoughts on “Petabyte Scale Data-Analysis and the Scientific Method

  1. Frank Quednau

    For another example of petabyte-scale data turn to the LHC. Many millions of events per second, of which a minuscule amount will actually be considered for further analysis, each event generating data in the megabyte range. No way you could do anything useful with that data if you don’t have models that you want to test, that tell you what you are actually looking for, solid assumptions what is deemed a useful event, etc., etc.
    On the contrary, what this world may need in the near future is a new model. That’s my opinion, anyway.

  2. Paul Clapham

    I thought that reminded me of a quote of Darwin’s, and sure enough it did:
    “About thirty years ago there was much talk that geologists ought only to observe and not theorize; and I well remember someone saying that at this rate a man might as well go into a gravel-pit and count the pebbles and describe the colours. How odd it is that anyone should not see that all observation must be for or against some view if it is to be of any service!”

  3. lahdeedah

    Anderson writes in an attention-grabbing style that gets people’s hackles up, of course, but there are ways to make his hypothesis much more defensible and less silly. I’ll make the defense, taking two distinct tacks:
    – first, an argument from practical concerns
    – second, a more-philosophical argument
    Argument 1: Practical
    You can divide the “scientific” effort into two components: (i) the theory-building and understanding-oriented part, and (ii) the development of predictive and diagnostic tools (ii) (“applied science”).
    The results from (ii) are usually the economic motivation for doing science — will this mechanical device work? will these chemicals react to form this product? Any attachment to (i) engendered by economic forces is purely practical: unlike, say, witch-doctoring, astrology, or prayer, the predictive and diagnostic tools that have originated from following the scientific method have a strong empirical track record of working, and so it makes a certain amount of sense to use them.
    If we admit the possibility that there might be other ways to obtain working, reliable predictive and diagnostic tools, we can then take guesses at whether or not those tools might also find use, and by whom.
    I can speak from direct experience that limited versions of some such tools are already in use (inside, eg, health insurers and the like): large databases of health histories and related information, and various data-mined correlations between them find use in planning benefits and pricing schemes; it can be very, very helpful to know the difference between p(demographic D contracts condition X in Y years | had condition C in the past T years) and p(demographic D contracts X in Y years | did not have condition C in the past T years) even if we have no clue *why* the difference is what it is.
    Yes, the usual cautions about correlation and causation can be pointed out, and there’s always the chance that the observed pattern is just another phantom found by data-mining, but even with those risks the information thus found is useful, and used every day.
    It’s easy to imagine — and increasingly likely that we will soon start seeing — predictive tools that are really no more than straight-up probability calculations drawing from huge troves of empirical data; despite their potential shortcomings and “unscientificness”. If a giant database of health histories tells me that people with similar health histories and dietary habits who continued eating steaks after age N had life expectancy L but those who stopped eating steaks after age N had life expectancy L’ > L (even after correcting for any other factors I think to correct for), I’m probably going to seriously consider stopping my steak habit.
    Health information (both diagnostics and prediction) is an obvious area where this will play out, but watch for it in sociology, urban planning, real estate, traffic flow engineering, and any other field that studies the aggregate effects of millions of hard-to-measure (let alone rigorously define!) ‘variables’.
    It doesn’t seem unreasonable to assume that, in the somewhat-near future, the relative pragmatic importance of predictive tools derived directly from their field’s first principles
    dwindling in comparison to the pragmatic importance of heuristics derived from empirical analysis over a large body of data.
    This has nothing to do with the moral or philosophical merits of having an understanding of the studied phenomena or a theory to explain it, and everything to do with pragmatics and so forth. Calling science obsolete in a scenario in which the growth of the apparent (or at least observed) predictive track record of empirical models outpaces that of the apparent track record of theory-based work by a couple orders a magnitude is still hyperbole, but not as much as at first glance.
    Argument 2: Philosophical
    There’s the practical argument. The philosophical argument is that there’s a lot less distinction between theory-building and model-building than is sometimes admitted, and moreover that there’s not always even a completely-defensible rationale for why a particular theory is pronounced as “passing” its tests against empirical evidence.
    I’ll work through a concrete historical example involving Netwonian dynamics to explain the latter point.
    Consider a naive application of newton’s laws (as would have been done in the 17th century), giving a simple parabolic formula for the cannonball’s arc.
    In any empirical test the formula derived by naive application of newton’s laws would be somewhat wrong; to defend a strong belief that newton’s laws applied to cannonballs in the face of such evidence you’d either have to take a bit of a leap of faith — that, by repeated, “recursive” application of the principles underlying newton’s laws you could develop more-comprehensive models (eg by including air resistance, spin-and-slippage effects, etc) that would converge to a “perfect” model — or have to apply empirical evidence of a different kind: “well, these laws are always slightly off when we apply them, but they’re still mostly right for cannonballs and for dropped apples and for the planets around the sun and so on, so it seems the problem has to be not with these laws but with our application of them”. That’s an appeal-to-empiricism (usually disguised as an appeal to occam’s razor), and not as strongly defensible as one might like.
    Now let’s consider a what-if: suppose some pre-newtonian king had run bazillions of cannon-firing experiments and employed an army of calculating scribes to apply some machine-learning approach and arrived at a “perfect” formula for a cannonball’s position at time t being fired from a cannon. The complete formula would involve a great # of variables, ranging from the basic questions (how strong is the force of gravity; how much does the cannonball weigh; how much energy is released by the propellant) to very “advanced” items (how does the cannonball surface interact with the atmosphere? how thick is the atmosphere, which way is the wind blowing, what vortex-and-eddy behavior is occurring in the nearby atmosphere, etc.).
    We know now that the “perfect” formula is (in principal) derivable by repeated application of newton’s laws (to keep incorporating and correcting for factors not considered in simpler models). How likely would it be for a human, viewing the “perfect” formula but ignorant of newtonian dynamics, to ever reverse-engineer newton’s laws and how they were applied in this instance to generate this specific formula?
    This scenario shouldn’t seem so far-fetched: one way of characterizing the standard model is that we’ve conducted a modern-day version of the above scenario and found an amazingly successful model with at least a dozen parameters whose values we know — from empirical determination — but don’t understand why they take those values…but the model works. Is this science? Empirical model building?
    What principal would separate the standard model from, say, a formula that produced a survival function for a cohort of adult males given some N input variables that utilized some 40-odd parameters (with values arrived at ‘blindly’ by some machine learning method applied to the empirical data)?
    There’s argument 2. I think Anderson is at least 100 years ahead of himself in terms of something like what he’s talking about take place, but it’s not as ridiculous as it seems from the headline.

  4. Jonathan Vos Post

    I also feel that the article was, on the surface, silly. But at a deeper level, there is a valid point.
    There is a crisis in science, and there is a crisis in mathematics, with both being due to the difference between human and computer approaches to logic and complexity.
    One of the best articles on the crisis from a mathematical and computational viewpoint is “Whither Mathematics?”, Brian Davies, Notices of the AMS, Vol 52, No. 11, Dec 2005, pp.1350-1356.
    I just an hour ago put a lengthy quotation from that on the “Michael Polanyi and Personal Knowledge” thread of the n-Category Cafe blog, with a reference to Greg Egan fiction and a glimpse at the year 2075.
    Before we address the crises in Astronomy (the Inflation theory tottering, Dark Energy, and the like), Biology (what is a “gene” now that the old paradigm has fallen in a deluge of genomic data?), and Planetary Science (now that comparative planetology has covered much of this solar system and something of 200+ others), we look back at Math.
    The triple crisis, as explained at length in the Brian Davies article may be summarized:
    #1: Kurt Godel demolishes the Frege, Russell-Whitehead, Hilbert program.
    #2: Computer-assisted proofs have “solved” some important problems, but no human being can individually say why.
    #3: There is sometimes no assurance of global consistency.
    #2 examples include Appel & Haken on four-color theorem (1976), Tom Hales on Kepler problem (1998), the 1970s Finite Simple Group collaboration culminating in the 26 sporadic groups led by the Monster, but with Michael Aschbacher (Caltech) sewing up loose ends through 2004 and still admitting the possibility that there might be another finite simple group out there in Platonic possibility which is different from all others; that skepticism amplified by Jean-Perre Serre.
    “We have thus arrived at the following situation. A problem that can be formulated in a few sentences has a solution more than ten thousand pages long. The proof has never been written down in its entirety, may never be written down, and as presently envisaged would not be comprehensible to any single individual. The result is important, and has been used in a variety of other problems…. but it might not be correct.”
    See also: “Science in the Looking Glass: What Do Scientists Really Know?”, E. Brian Davies, Oxford U Press, 2003.

  5. john

    Well, to be fair, statisticians have long been aware of the problem of “discover(ing) a whole bunch of correlations that don’t mean anything.” This isn’t something that we’ve somehow overlooked as a profession. Cross-validation, generalized cross-validation and false discovery rates are three approaches to dealing with this, and very effective ones too. Really, if you are engaged in a large-scale data mining exercise these days, you should have a very low fraction of false discoveries if you’ve done it well. (Many people don’t, however; horror stories abound of clients shortcutting statistical analyses to rush to publish what should be intermediate (as in: not final) results.) Pick up a copy of Hastie and Tibshirani’s “Statistical Learning Theory” before you dismiss the entire idea.
    OTOH, I think we can frame the issue a little differently. Science is based, ultimately, on observation, since without it, the scientific method can’t be applied. Trawling through huge amounts of data with good algorithms provides you with observations (the outputs of the algorithms) that you simply could not get any other way. An extension of the senses, as it were, in the direction of sensing patterns in huge amounts of data. At the moment, we are still in the early stages of learning how to work with these new tools – but that was once true of telescopes and spectroscopes and every new observation technology as well. Now they are just tools. I think the same will happen with data mining.

  6. Patness

    I was hoping you’d do an article on this.
    It’s putting an awful lot of stock in our mathematical methods and systems of logic.

  7. Cooper

    Pretty much that same method was responsible for the discovery of the Pythagorean theorem, as well as a large number of other ‘theorems’ which turned out to be false.

  8. nkirby

    I like the idea contained in lahdeedah’s comment that perhaps what may actually happen is that large scale computation may lead to a decoupling of technology and science. However, I think that part of the modelling and theory-making enterprise is to seek more efficient ways of using our computational resources. Perhaps Einstein’s “any fool can make things complex” will give way to a saying like “any fool can make a computer crash” or as Carl-Erik Froeberg once quipped, “Never in the history of mankind has it been possible to produce so many wrong answers so quickly!”

  9. Dave

    Re: #5. The biggest problem, to my mind, with the supposed 1976 “proof” of the 4-color theorem is with the computer program that did the proof. Nobody ever proved that the program itself was correct. This was a particular bugbear of one of my professors when I was in grad school. This is not to say that I don’t believe the 4-color theorem, just that there was no reason to accept the proof as-is.

  10. vision scientist

    This sort of reasoning reminds me of a paper at CVPR 2008 this year where millions of frontal face images were used to learn a manifold. The fact of the matter is that frontal face images under illumination variation span a 9 dimensional subspace. Unless you account for this fact you can “discover” any arbitrary “manifolds” for such data. It is absolutely imperative that the underlying physics behind generation of data be accounted for before you construct models using petabytes of data.

  11. Stephen Wells

    Charles Darwin put it rather well when he said something to this effect: every observation must be for or against a particular point of view if it is to be of any use. Otherwise, you might as well go into a gravel pit and spend all day numbering the pebbles.
    Just because we now have much more advanced pebble-numbering technology does not mean it’ll give you anything useful without method.

  12. Torbjörn Larsson, OM

    This reminds me of the suggestion to use bayesian theory building to keep all possible hypotheses, correct or falsified, and let bayesian learning sort out the best fit at any particular time and for any particular problem.
    Perhaps you can do that, and even get a terrific fit. But the essence of the scientific method isn’t to use all what could possibly work at a given time but to throw away what verifiably doesn’t at any time, and keep the simplest of the remains. It makes testing, understanding and application so much easier.

    Somewhere in that 6 billion bits, I can probably find any possible 16 bit sequence.

    I need coffee – but isn’t that a 6*10^9/ (16!) or ~ 10^-4 probability to find a certain sequence permutation?

  13. Torbjörn Larsson, OM

    In any empirical test the formula derived by naive application of newton’s laws would be somewhat wrong; to defend a strong belief that newton’s laws applied to cannonballs in the face of such evidence you’d either have to take a bit of a leap of faith

    Philosophy indeed, that doesn’t seem to have anything to do with what actually is going on. We use testing on uncertainty. No faith is ever necessary to accept an approximate model, just ‘mindless’ observation.
    Unless you believe in absolute Truth or similar religious stuff.
    You may want to establish trust in the method, but that is another characteristic. (Curiously, another obvious method related thing philosophy wants to avoid analyzing AFAIU.) “Appeal-to-empiricism”, is that like kicking the wall to find out that it hurts? And why would anyone want to do that?
    @ JVP:
    Heh! I was immediately thinking of the same press release. Data mining (with adequate modeling) is the future.

  14. Tercel

    I agree, in fact, I’ll go one better. Mark makes the point (in short)that theory is necessary to make the distinction between a meaningful pattern in data, and a chance occurrence. I feel that theory is also necessary to even look at a data set in the correct way. In other words, you process a data set with a given technique; theory is necessary not only to interpret the results, but also to develop that technique in the first place.
    Particularly with large data sets, it is not always possible to poke around your data, trying things at random, hoping to find something meaningful. Even worse, it is quite possible to apply a series of tools to a data set, maybe guided by intuition, and arrive at something that looks like a result but without deriving your analysis from theory you cannot really trust it.
    I say this from my experience as both a student, and now an engineer at a large national laboratory research facility. In fact, even now I am working to design an analysis technique for finding features with specific properties in extremely large images. I can easily write a routine which will pick out hundreds of thousands of such features — so many that the results are as incomprehensible to humans as the original data. Only by basing the analysis on an understanding of the principles which govern the data can I produce a result which is both useful and correct.


Leave a Reply