Once again, there’s a silly article somewhere, and everyone hammers me with requests to write about it. It’s frankly flattering that people see
this sort of nonsense, and immediately think of asking me about it – you folks are going to give me a swelled head!
The article in question is a recent article from Wired magazine, titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”.
The basic idea behind the article is that now that we can retrieve and
analyze truly massive quantities of data, that the whole scientific idea of trying understand why and how things work is now irrelevant. They use Google and Craig Venter as examples of how this works:
Google, according to Mr. Anderson, doesn’t care about why
link analysis suggests that page X should be a search result for keyword
Y. And Venter has done work that basically pulls down huge quantities of genetic material from an unknown number of species, and then sequences it. He can discover new species from this data – even though he doesn’t necessarily know what the species is, what it looks like, where it came from, etc.
There’s a tiny grain of truth to the article. Massive quantities
of data and the tools to analyze that data do change science.
Researchers are finding new approaches, new ways of doing research,
based on being able to look at quantities of data that absolutely boggle the mind. Petabyte-scale computing is an amazing tool, and it makes it possible
both to do new things that we’ve never done before, and to do old things in new ways.
But the idea that massive scale data collection and computing renders the
scientific method obsolete? That we no longer need models, or theories, or experiments? That’s blatant silliness.
What it comes down to, ultimately, is another form of what I call the big numbers problem. Humans are terrible at really, truly understanding what big numbers mean. We’re fantastic at understanding numbers we can count on our fingers. We’re pretty good at understanding the meaning of numbers up to thousands or so. We start having trouble with millions. Once you get beyond billions, our instincts go right out the window.
That poor understanding of really big numbers is exploited by creationists who try to convince people that things like evolution are impossible. The arguments are nonsense, but because the numbers are so big, so incomprehensible, that you can trick people.
Mr. Anderson is making a similar mistake.
When you start working at petabytes of data, you’re looking at a boggling amount of information – and it’s hard to really understand just what it all means.
The big problem with huge amounts of data is that it’s easy to find false correlations. In a huge amount of data, you expect to find tons of patterns. Many of those patterns will be statistical coincidences. And it can be very difficult to identify the ones that are.
For example, I can almost guarantee that if I were to analyze the complete log of searches on Google for the last week, I can find some pattern that encodes the complete win-loss record for the NY Giants last football season. And I’ll similarly find another set that will predict the
complete record for their next season. How? That’s fantastically unlikely, right? There’s 16 games in the regular season. The chances randomly discovering the exact win-loss record is very small – 1 in 216 – that is, 1 in 65,536. How can I possibly find that win-loss record in unrelated data? How can I possibly find data making that prediction?
The answer is, huge freaking numbers. The win-loss record for a football team’s regular season consists of 16 bits. Search logs for a week of Google contains hundreds of millions of queries – each of which has at least one keyword. Suppose that there’s 200 million queries in that period. (I don’t know the real number – but I’m pretty sure that’s actually low.) Suppose that each search on average consists of one five-letter word. (Again, that’s probably very low.) Each query is then averaging something like 30 bits of information. So the full search log of 200 million queries contains 6 billion bits. Somewhere in that 6 billion bits, I can probably find any possible 16 bit sequence.
And 6 billion bits isn’t much data at all. It’s practically miniscule compared to some of the databases that are being analyzed today.
My point is that patterns in huge quantities of data – even seemingly incredibly unlikely patterns – become downright likely when you’re searching at petabit scale. Our intuitions about what’s likely to happen
as a result of randomness or coincidence are a total failure at massive scale.
If you try to do science without understanding – that is, all you do is look for patterns in data – then you’re likely to “discover” a whole bunch of correlations that don’t mean anything. If you don’t try to understand those correlations, then you can’t tell the real ones from the chance ones. You need understanding, theories, models, to be able to determine what’s meaningful and what’s not, and to determine how to test a correlation for real significance.
Mr. Andersen mentions Google, and the fact that we don’t know why our search algorithms produce particular results for a particular query. That’s absolutely true. Do a search for a particular set of keywords, and we can’t, without a lot of work, figure out just why our algorithms produced that
result. That doesn’t mean that we don’t understand our search. Each component of the search process is well understood and motivated by some real theory of how to discover information from links. The general process is well
understood; the specific results are a black box. Mr. Anderson
is confusing the fact that we don’t know what the result will be for a particular query with the idea that we don’t know why our system works well. Web-search systems aren’t based on any kind of randomness: they find relationships between webpages based on hypotheses about how links between pages can provide information about the subject and quality of the link target. Those hypotheses are implemented and tested – and if they test out – meaning if they produce the results that the hypothesis predicts they should – then they get deployed. It’s basic science: look at data, develop a hypothesis; test the hypothesis. Just because the hypothesis and the test are based on huge quantities of data doesn’t change that.
Mr. Anderson’s other example is the discovery of a species using massive
sequencing. A recent technique that’s been used by some biologists is (very loosely speaking) to gather a quantity of biomass, puree it, sieve out the DNA, and sequence it. Doing this, researchers have found genes from previously unknown species. At the moment, we don’t know much about those new species. Without some more work, we don’t know what they look like. Depending on how much of their genetic material we’ve sequenced, we might not even know what family of species they come from!
But the biologists aren’t working blindly. In terms of genetics, they know what a species looks like. No matter how much data they analyzed, if they didn’t have an idea – a theory – about what kinds of patterns should be in the data, about what a species is, about how to identify a species from a genome fragment – then no amount of computation, no mass of data, could possibly produce the hypothesis that they discovered a new species. It’s only by understanding information, developing theories, and then testing those theories by looking at the data, that they can produce any result.
In other words, the results are the product of the scientific method. It’s just a new way of doing an experiment; a new way of analyzing data. Science is still there. The scientific process is, at its core, still exactly the same as it always was. We just have an extremely powerful new tool that scientists can use as part of that method.