Like a biblical parable, the typical human-behavior experiment is easily told and easily reduced to a message: People who pay with credit cards were more likely to have potato chips in their grocery bags than were people who paid with cash. (So if you want to lose weight, use cash!) I tell this kind of tale all the time (though I try to be careful about jumping from an experimental result to a policy). But the old yeshiva saying applies: "For example" isn't proof. Though many experiments lend themselves to a convincing blog post, the actual, logical case—what the researchers did that's different from what I do here— is supposed to be made by statistical inference. So if there's something wrong with people's statistics, then their results fall into a category much dreaded in science: anecdotes. Hence the shock value of this paper (pdf) in this month's Nature Neuroscience. In 157 neuroscience studies that made such this-group-versus-that-group claims, the authors write, they found fully half got their statistics wrong in at least one instance. Allowing for papers where the error clearly didn't invalidate the researchers' central claim, that still means about a third of these papers (from Science, Nature, Neuron, The Journal of Neuroscience and Nature Neuroscience itself) might be describing effects that aren't real.
The authors, Sander Nieuwenhuis, Birte U. Forstmann and Eric-Jan Wagenmakers, target a common procedure in neuroscience (and social sciences as well): Performing an experimental manipulation on two different groups, you find that one reacts in a satisfying clear way, while the other does not react that way. (To be more precise, you test for the probability of your data occurring when there was no effect from your treatment. Using one very common standard, when that probability is less than 5 percent, you declare you have a statistically significant finding.)
So there you are testing the effect of a drug on Group 1 (a strain of mutant mice) and, for comparison, on Group 2 (a strain of plain old vanilla lab mice). The drug has a statistically significant effect on Group 1 and a much smaller, statistically insignificant effect on Group 2. Therefore, you conclude that your drug has a different effect on mutant mice than on normal ones. And when you do this, you are wrong.
The reason: The results from Group 1 and from Group 2 are distinct pieces of information. In order to compare them statistically, you have to relate them to one another. You need to know the probability finding that difference between Group 1's effect and Group 2's—not the probability of either result in isolation. In fact, as this paper points out, the appearance of a statistically significant result in Group 1 and an insignificant result in Group 2 is not, itself, necessarily statistically significant. A large contrast between results from the two groups could be due to a very small difference in the underlying cause.
This is a lot less compelling than a neat story line (Ben Goldacre at The Guardian called his lucid explanation last week "400 words of pain"). But doing the stats right is the essential underpinning for the narrative version. So I was simply astonished that half the researchers making this sort of claim in the very prestigious sample were, according to the paper, not doing it correctly.
I try, dear reader, to sort out the wheat and the chaff here, worrying about soundness as well as the gee whiz factor, and trying to separate the experiments that actually took place from hype that could be derived from them. But Wagenmakers, who has made himself a scourge of statistical error and woolly thinking in general, has me worried.
I first encountered his skepticism of psychology's methods when he and his co-authors dismantled claims that standard psychology's methods could yield evidence of psychic powers. Then, last May, he and another set of co-authors published this paper (pdf), in which they look at 855 statistical tests in papers published in 2007 in two major psychology journals, and find that 70 percent would flunk an alternative (and, they say, better) test of significance.
I mean, it would be one thing if a lot of contemporary research on human behavior was superseded, corrected, improved upon or reinterpreted in the future. Given the way science is supposed to work, one of those fates is to be expected. What I can't get my mind around is the possibility that, instead, a great deal of this work, sheaf upon sheaf of it, will turn out to be simply meaningless.
ADDENDUM: The notion that scientists don't get statistics doesn't shock statisticians, it seems. At least, it doesn't shock my favorite statistics guru, Andrew Vickers of Sloan-Kettering, author of this very clear and handy guide to his field. After I sent him the paper by Nieuwenhuis et al., he emailed: "Bad statistics in neuroscience? Isn't that a bit like going out of your way to say that the Mets have a bad record against Atlanta? They lose against pretty much every team and there is no need to go through the sub-group analyses of multiple different opponents. By the same token, the surprise would be if neuroscientists didn't make the same mistakes as everyone else."
It makes sense to me that the oddities of statistical thinking would be no more congenial to scientists than to the rest of us (if your passion is alligator brains or star clusters, there's no particular reason you should cotton to p-values). Perhaps this leads to a "black box" approach to statistical software that helps explain the situation that Nieuwenhuis et al. decry. On the other hand, Goldacre sees things more darkly, suggesting the trouble may be a desire to publish at all costs.
I do think it's a subject we science writers ought to pay more attention to.
Nieuwenhuis, S., Forstmann, B., & Wagenmakers, E. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance Nature Neuroscience, 14 (9), 1105-1107 DOI: 10.1038/nn.2886
Wetzels, R., Matzke, D., Lee, M., Rouder, J., Iverson, G., & Wagenmakers, E. (2011). Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests Perspectives on Psychological Science, 6 (3), 291-298 DOI: 10.1177/1745691611406923