Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

May 06 2013

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. And here we see what’s really happening: to remove one head of the double-headed arrow, we use “common sense” to choose between two stories: one that’s merely silly, and another that’s so ludicrous we never even think about it. Seems to work here (for a very limited value of “work”); but if I’ve learned one thing, it’s that good old common sense is frequently neither common nor sensible. For more realistic correlations, it certainly seems ironic that we’re doing all this data analysis just to end up relying on common sense.

Now let’s look at something equally hypothetical that isn’t silly. A drug is correlated with reduced risk of death due to heart failure. Good thing, right? Yes–but why? What if the drug has nothing to do with heart failure, but is really an anti-depressant that makes you feel better about yourself so you exercise more? If you’re in the “correlation is as good as causation” club, doesn’t make a difference: you win either way. Except that, if the key is really exercise, there might be much better ways to achieve the same result. Certainly much cheaper, since the drug industry will no doubt price the pills at $100 each. (Tangent: I once saw a truck drive up to an orthopedist’s office and deliver Vioxx samples with a street value probably in the millions…) It’s possible, given some really interesting work being done on the placebo effect, that a properly administered sugar pill will make the patient feel better and exercise, yielding the same result. (Though it’s possible that sugar pills only work as placebos if they’re expensive.) I think we’d like to know, rather than just saying that correlation is just as good as causation, if you have a lot of data.

Perhaps I haven’t gone far enough: with enough data, and enough dimensions to the data, it would be possible to detect the correlations between the drug, psychological state, exercise, and heart disease. But that’s not the point. First, if correlation really is as good as causation, why bother? Second, to analyze data, you have to collect it. And before you collect it, you have to decide what to collect. Data is socially constructed (I promise, this will be the subject of another post), and the data you don’t decide to collect doesn’t exist. Decisions about what data to collect are almost always driven by the stories we want to tell. You can have petabytes of data, but if it isn’t the right data, if it’s data that’s been biased by preconceived notions of what’s important, you’re going to be misled. Indeed, any researcher knows that huge data sets tend to create spurious correlations.

Causation has its own problems, not the least of which is that it’s impossible to prove. Unfortunately, that’s the way the world works. But thinking about cause and how events relate to each other helps us to be more critical about the correlations we discover. As humans we’re storytellers, and an important part of data work is building a story around the data. Mere correlations arising from a gigantic pool of data aren’t enough to satisfy us. But there are good stories and bad ones, and just as it’s possible to be careful in designing your experiments, it’s possible to be careful and ethical in the stories you tell with your data. Those stories may be the closest we get ever get to an understanding of cause; but we have to realize that they’re just stories, that they’re provisional, and that better evidence (which may just be correlations) may force us to retell our stories at any moment. Correlation is as good as causation is just an excuse for intellectual sloppiness; it’s an excuse to replace thought with an odd kind of “common sense,” and to shut down the discussion that leads to good stories and understanding.

April 11 2013

Data skepticism

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl