Wednesday, July 02, 2008

Edge 248 - The End of Theory

Chris Anderson, editor-in-chief at Wired Magazine, recently created a stir with his provocative article, The End of Theory.

Here's a taste of the article for those who may have somehow missed this:

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.

This kind of thinking is poised to go mainstream.
So what does this mean for science?

~C4Chaos offered one good interpretation of this article:
This reminded of the book, The Black Swan (see my review). Theoretical models are useful as starting points and for framing but in the long run our human tendency to categorize (Platonicity) and explain the causes of everything with theories (narrative fallacy) backed up with partial evidence (confirmation bias; fallacy of silent evidence) while concocting models of reality (ludic fallacy) make us blind to Black Swans (i.e. high-impact, hard-to-predict, and rare event beyond the realm of normal expectations).
Many science bloggers disagreed with Anderson's conclusions. Deepak Singh stated bluntly, Chris Anderson, you are wrong.
How can I (or others who actually still do science) take the new paradigms of computing (by the way, bioinformaticians have been using methods typically used in “collective intelligence” for years), and take biology, which is now very much a digital science, and combine them with our scientific reasoning, our ability to take phenomena and develop models that explain those phenomena and do something meaningful with them. I have seen many computer scientists develop some very elegant theoretical models for biological information, but often without any biological context. Yes scientists need to adopt new techniques, develop new theoretical approaches, even rethink the very basic tenets that they know, but to say the scientific method is dead or approaching the end is sensationalist in the least, and completely uneducated in the extreme.
Andrew at the Social Statistics blog also adds his perspective in The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.
1. Anderson has a point--there is definitely a tradeoff between modeling and data. Statistical modeling is what you do to fill in the spaces between data, and as data become denser, modeling becomes less important.

2. That said, if you look at the end result of an analysis, it is often a simple comparison of the "treatment A is more effective than treatment B" variety. In that case, no matter how large your sample size, you'll still have to worry about issues of balance between treatment groups, generalizability, and all the other reasons why people say things like, "correlation is not causation" and "the future is different from the past."

3. Faster computing gives the potential for more modeling along with more data processing. Consider the story of "no pooling" and "complete pooling," leading to "partial pooling" and multilevel modeling. Ideally our algorithms should become better at balancing different sources of information. I suspect this will always be needed.

More dissent comes from Drew Conway at Zero Intelligence Agents in The Hubris of ‘The End of Theory’.

[M]y reaction is best summarized in poetry, and I turn to an excerpt from the venerable T.S. Elliot and his Choruses from the Rock...

Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

Mr. Anderson is well advised to consider these questions posed by Elliot, as his assertion that the “scientific method is becoming obsolete” by the deluge of data stored in the “clouds” of Google is not only misinformed; it presents a dangerous framework for undermining our ability to truly understand individuals as they exist, and have existed, in all aspects of life.

Go read the rest of his entry -- it's one of the better ones.

Now back to Edge issue 248, where there are many responses to Anderson's article. Edge asked George Dyson, Kevin Kelly, Stewart Brand, W. Daniel Hillis, Sean Carroll, Jaron Lanier, Joseph Traub, John Horgan, Bruce Sterling, and Douglas Rushkoff to contribute their insights into The End of Theory -- the results are intriguing.

Here are teaser bits from some of the respondents:
GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not.

KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name.

STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for.

W. DANIEL HILLIS: Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years.

And here is a full response from Douglas Rushkoff, someone I have admired for years, since his first couple of books.
I'm suspicious on a few levels.

First off, I don't think Google has been proven "right." Just effective for the moment. Once advertising itself is revealed to be a temporary business model, then Google's ability to correctly exploit the trajectory of a declining industry will itself be called into question. Without greater context, Google's success is really just a tactic. It's not an extension of human agency (or even corporate agency) but strategic stab based on the logic of a moment. It is not a guided effort, but a passive response. Does it work? For the moment. Does it lead? Not at all.

Likewise, to determine human choice or make policy or derive science from the cloud denies all of these fields the presumption of meaning.

I watched during the 2004 election as market research firms crunched data in this way for the Kerry and Bush campaigns. They would use information unrelated to politics to identify households more likely to contain "swing" voters. The predictive modeling would employ data points such as whether the voters owned a dog or cat, a two-door or four-door car, how far they traveled to work, how much they owed on their mortgage, to determine what kind of voters were inside. These techniques had no logic to them. Logic was seen as a distraction. All that mattered was the correlations, as determined by computers poring over data.

If it turned out that cat-owners with two door cars were more likely to vote a certain way or favor a certain issue, then pollsters could instruct their canvassers which telephone call to make to whom. Kids with dvd players containing ads customized for certain households would show up on the doorsteps of homes, play the computer-assembled piece, leave a flyer, and head to the next one.

Something about that process made me cynical about the whole emerging field of bottom-up, anti-taxonomy.

I'm all for a good "folksonomy," such as when kids tag their favorite videos or blog posts. It's how we know which YouTube clip to watch; we do a search and then look for the hit with the most views. But the numbers most certainly do not speak for themselves. By forgetting taxonomy, ontology, and psychology, we forget why we're there in the first place. Maybe the video consumer can forget those disciplines, but what about the video maker?

When I read Anderson's extremely astute arguments about the direction of science, I find myself concerned that science could very well take the same course as politics or business. The techniques of mindless petabyte churn favor industry over consideration, consumption over creation, and—dare I say it—mindless fascism over thoughtful self-government. They are compatible with the ethics-agnostic agendas of corporations much more than they are the more intentional applied science of a community or civilization.

For while agnostic themselves, these techniques are not without bias. While their bias may be less obvious than that of human scientists trained at elite institutions, their bias is nonetheless implicit in the apparently but falsely post-mechanistic and absolutely open approach to data and its implications. It is no more truly open than open markets, and ultimately biased in their favor. Just because we remove the limits and biases of human narrativity from science, does not mean other biases don't rush in to fill the vacuum.

My thoughts:

I doubt that theory will ever go away -- it's a integral part of human consciousness. In an earlier post of a discussion from Seed Magazine, Michael Gazzaniga discusses his discovery of "the Interpreter," a "module" in the left hemisphere of the brain that seeks to explain our actions (even in the absence of any knowledge as to why we did something).

My sense is that this module is also responsible for all kinds of other theoretical activities. Human beings are defined more than any other trait by their desire to make some sense of their world. Information overload will not change this inclination, but it will give us more data to work with.

No comments: