You may have missed some of the discussion on fraud, errors and biases shaking the scientific community of late, so I will quickly bring you up to speed.
Firstly, a series of fraud cases (Ruggiero, Hauser, Stapel) in Psychology and related fields makes everyone wonder why only internal whistleblowers ever discover major fraud cases like these.
Secondly, a well regarded journal publishes an article by Daryl Bem (2011) claiming that we can feel the future. Wagenmakers et al. (2011) apply a different statistical analysis and claim that Bem’s evidence for precognition is so weak as to be meaningless. The debate continues. Meanwhile a related failed replication paper claims to have trouble getting published.
Thirdly, John Bargh criticises everyone involved in a failed replication of an effect he is particularly well known for. He criticises the experimenters, the journal, even a blogger who wrote about it.
This all happened within the last year and suddenly everyone speaks about replication. Ed Yong wrote about it in nature, the Psychologist had a special issue on it, some researchers set up a big replication project, the blogosphere goes crazy with it.
Some may wonder why replication was singled out as the big issue. Isn’t this about the ruthless, immoral energy of fraudsters? Or about publishers’ craving for articles that create buzz? Or about a researcher’s taste for scandal? Perhaps it is indeed about a series of individual problems related to human nature. But the solution is still a systemic one: replication. It is the only way of overcoming the unfortunate fact that science is only done by mere humans.
This may surprise some people because replication is not done all that much. And the way researchers get rewarded for their work totally goes against doing replications. The field carries on as if there were procedures, techniques and analyses that overcome the need for replication. The most common of which is inferential hypothesis testing.
This way of analysing your data simply asks whether any differences found among the people who were studied would hold up in the population at large. If so, the difference is said to be a ‘statistically significant’ difference. Usually, this is boiled down to a p-value which reports the likelihood of finding the same statistically significant difference again and again in experiment after experiment if in truth the difference didn’t exist at all in the population. So, imagine that women and men in truth were equally intelligent (I have no idea whether they are). Inferential hypothesis testing predicts that 5% of experiments will report a significant difference between male and female IQs. This difference won’t be replicated by the other 95% of experiments.
And this is where replication comes in: the p-value can be thought of as a prediction of how likely failed replications of an effect will be. Needless to say that a prediction is a poor substitute for the real thing.
This was brought home to me by Luck in his great book An Introduction to the Event-Related Potential Technique (2005, p. 251). He basically says that replication is the only approach in science which is not based on assumptions needed to run the aforementioned statistical analyses.
Replication does not depend on assumptions about normality, sphericity, or independence. Replication is not distorted by outliers. Replication is a cornerstone of science. Replication is the best statistic.
In other words, it is the only way of overcoming the human factor involved in choosing how to get to a p-value. You can disagree on many things, but not on the implication of a straight replication. If the effect is consistently replicated, it is real.
For example, Simmons and colleagues (2011) report that researchers can tweak their data easily without anyone knowing. This is not really fraud but it is not something you want to admit, either. Using four ways of tweaking the statistical analysis towards a significant result – which is desirable for publication – resulted in a statistically significant difference having a non-replication likelihood of 60%. Now, this wouldn’t be a problem if anyone actually bothered to do a replication – including the exact same tweaks to the data. It is very likely that the effect wouldn’t hold up.
Many people believe that this is what really happened with Bem’s pre-cognition results. They are perhaps not fraudulous, but the way they were analysed and reported inflated the chances of finding effects which are not real. Similarly, replication is what did not happen with Stapel and other fraudsters. My guess is that if anyone had actually bothered to replicate, it would have become clear that Stapel has a history of unreplicability (see my earlier blog post about the Stapel affair for clues).
So, if we continue to let humans do research, we have to address the weakness inherent in this approach. Replication is the only solution we know of.
Bem, D.J., Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. Journal of Personality and Social Psychology, 100, 407-425. DOI: 10.1037/a0021524
Luck, S.J. (2005). An Introduction to the Event-Related Potential Technique. London: MIT Press.
Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22, 1359-1366. DOI: 10.1177/0956797611417632
Wagenmakers, E.J., Wetzels, R., Borsboom, D., van der Maas, H. (2011). Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi. Journal of Personality and Social Psychology, 100, 426-432. doi: 10.1037/a0022790