Re-examining Significant Research: The Problem of False-Positives

These days, it seems like wherever you turn, there’s a story of a researcher who has gone over to the dark side. There was Marc Hauser at Harvard, who resigned after he’d been accused of eight counts of scientific misconduct. There was Frank Fischer, the Rutgers political scientist who committed the equivalent of data fraud in a non-science discipline, by plagiarizing large blocks of text from others with no attribution. Now, there’s Diederik Stapel, who has apparently falsified not just a few, but dozens of sets of data, including materials used by many of his graduate students. And the list goes on (Andrew Gelman details several more cases on his blog, here). But could it be the case that the behavior is even more prevalent, extending to well-enough-meaning cases of individuals who aren’t actually doing anything but following standard—or at the very least, widely accepted—research practices?

The ease of reporting false positives

In a recent paper, researchers from the University of California, Berkeley and the University of Pennsylvania demonstrated that more minor instances of cheating on data—that would not even be seen as cheating under normal circumstances, but may be no less detrimental to proper research and academic integrity—are widespread in the literature. According to their calculations, it is remarkably easy to report false-positive findings, or results that support an effect that, in reality, does not (or may not) exist.

To demonstrate their point, the researchers set out to prove something as true that was actually demonstrably false: that certain songs could change listeners’ age. First, they had participants listen to two songs, “Kalimba” or the Beatles’ “When I’m Sixty-Four.” Then, in a task that was presented as unrelated, they asked everyone to indicate their own age and their father’s age. Then, they ran a standard ANCOVA analysis. Lo and behold, they found that after listening to “When I’m Sixty-Four,” as opposed to “Kalimba,” participants were a good year and a half younger.

The result was clearly nonsensical. And yet, the analysis had seemingly been sound. So what went wrong?

The culprit is something called degrees of freedom: how much data is collected? How many observations are included? How many participants will there be – and how many elements of data will be gathered per participant? What will the controls be? Which measures or conditions will be combined? In theory, all of these questions should be addressed ahead of time. In practice, they are often addressed as the experiment unfolds—and therein lies the problem.

As the authors argue, each time a post-hoc or in-the-moment decision is made, the integrity of the data is undermined, and the risk of finding an effect that isn’t really there rises dramatically. Take, for instance, the decision of when to stop collecting data. Often, researchers have a general number in mind, but may test to see if they are getting the desired effect at various points in the data collection. If they see the effect at a statistically significant level, they are likely to stop. If they don’t, they keep collecting data. In, say, ten participants, they repeat the process. It doesn’t seem necessarily wrong to do so—but each time, as the paper’s authors demonstrate, the false positive rate increases by approximately 50%.

Or, take the number of dependent variables being analyzed. If there are two potential effects, researchers can test if a manipulation influenced one, the other, or both of them. That flexibility in analysis almost doubles the rate of false positives.

The authors go on to examine a number of additional practices that increase degrees of freedom, and in so doing, may undermine data integrity—and even allow you to claim, as they did after following these “standard practices,” that a certain song can lower your age. Some amount to so-called data mining: collecting so much data that you are able to run analyses in multiple fashions, controlling or not controlling for something, using or not using something as a covariate, reporting or not reporting a particular measure or effect or statistic—in other words, whatever it takes to get a significant result that will get you a publication. It’s not falsification per se; the data are real and present enough. And it’s not even a question of malice; the practices are so common that most people don’t think twice about them. Didn’t get a significant result? Why not try analyzing the data this way? It will increase your degrees of freedom and your statistical power—and who knows, you may be able to squeeze something out of it yet. After all, you don’t want to have wasted time and money on a study that went nowhere.

A biased publication process

The culprits, I think, are two inherent biases: in the publication world and in the mind. Publications don’t like null findings. The pressure to find something is great. Find nothing and you won’t get published. Find something that is too complicated or not strong or striking enough, and you likewise run the risk of non-publication. Is it really so strange, then, in a time when publications equal success, funding, and job-ability, that researchers end up finding something—even if that something was not what they had originally looked for, in a sample of a different size that they’d originally sought, obtained by correcting for things they didn’t think needed correcting for, or just one thing in a sea of non-findings that will never be reported and were not included in the write-up of the experimental design? And that goes back to bias number two: our minds like to confirm what we want to find. If we want to see an effect, we will do our very best to see it, evidence be damned (while that may be putting it a bit too strongly, the idea is in the right spirit).

Correcting the error

Simmons, Nelson, and Simonsohn don’t just point out the errors and run. They design a set of best principles that would help address the problem, which include the following: researchers should decide on a rule for stopping data collection and report that rule in the article, rather than stopping at an arbitrary point that is determined by the data themselves; they must list every variable that was collected in a study, instead of just listing only those elements that they end up reporting; and they must report all experimental conditions, including those that failed to produce any effect. Reviewers, on the other hand, must increase their tolerance of results that are not as striking or perfectly formed and must require a justification of all data collection and analysis.

The recommendations are simple enough. But somehow, I highly doubt they will be followed. After all, it still sounds far better if your journal publishes cool and intriguing and strong studies—not borderline, messy studies that hedge their bets. And those null results? Forget it. And if you’re a researcher, don’t you want to publish with the best of them, in the most high-profile journal you possibly can? Changing the standards would require a severe upheaval of the academic status quo. I just don’t know if academia is ready.

If you’d like to receive information on new posts and other updates, follow Maria on Twitter @mkonnikova

[photo credit: Shutterstock.com]