Replicability

Discovering a glaring error in a research paper – a personal account

New York Magazine has published a great article about how grad student Steven Ludeke tried to correct mistakes in the research of Pete Hatemi and Brad Verhulst. Overall, Ludeke summarises his experience as ‘not recommendable’. Back in my undergraduate years I spotted an error in an article by David DeMatteo and did little to correct it. Why?

Christian Bale playing a non-incarcerated American Psycho.

David DeMatteo, assistant professor in Psychology at Drexel University, investigates psychopathy. In 2010, I was a lowly undergraduate student and noticed a glaring mistake in one of his top ten publications which has now been cited 50 times according to Google Scholar.

The error

The study investigated the characteristics of psychopaths who live among us, the non-incarcerated population. How do these psychopaths manage to avoid prison? DeMatteo et al. (2006) measured their psychopathy in terms of personality features and in terms of overt behaviours. ‘Participants exhibited the core personality features of psychopathy (Factor 1) to a greater extent than the core behavioral features of psychopathy (Factor 2). This finding may be helpful in explaining why many of the study participants, despite having elevated levels of psychopathic characteristics, have had no prior involvement with the criminal justice system.’ (p. 142)

The glaring mistake in this publication is that Factor 2 scores at 7.1 (the behavioural features of psychopathy) are actually higher than the Factor 1 scores at 5.2 (the personality features of psychopathy). The numbers tell the exactly opposite story to the words.

DeMatteo_mistake.jpg

The error in short. The numbers obviously do not match up with the statement.

The numbers are given twice in the paper making a typo unlikely (p. 138 and p. 139). Adjusting the scores for the maxima of the scales that they are from (factor 1 x/x_max = 0.325 < factor 2 x/x_max=0.394) or the sample maximum (factor 1 x/x_max_obtained = 0.433 < factor 2 x/x_max_obtained = 0.44375) makes no difference. No outlier rejection is mentioned in the paper.

In sum, it appears as if DeMatteo and his co-authors interpret their numbers in a way which makes intuitive sense but which is in direct contradiction to their own data. When researchers disagree with their own data, we have a real problem.

The reaction

1) Self doubt. I consulted with my professor (the late Paddy O’Donnel) who confirmed the glaring mistake.

2) Contact the author. I contacted DeMatteo in 2010 but his e-mail response was evasive and did nothing to resolve the issue. I have contacted him again, inviting him to react to this post.

3) Check others’ reactions. I found three publications which cited DeMatteo et al.’s article (Rucevic, 2010; Gao & Raine, 2010; Ullrich et al., 2008) and simply ignored the contradictory numbers. They went with the story that community dwelling psychopaths show psychopathic personalities more than psychopathic behaviours, even though the data in the article favours the exactly opposite conclusion.

4) Realising my predicament. At this point I realised my options. Either I pursued this full force while finishing a degree and, afterwards, moving on to my Master’s in a different country. Or I let it go. I had a suspicion which Ludeke’s story in New York Magazine confirmed: in these situations one has much to lose and little to gain. Pursuing a mistake in the research literature is ‘clearly a bad choice’ according to Ludeke.

The current situation

And now this blog post detailing my experience. Why? Well, on the one hand, I have very little to lose from a disagreement with DeMatteo as I certainly don’t want a career in law psychology research and perhaps not even in research in general. The balance went from ‘little to gain, much to lose’ to ‘little to gain, little to lose’. On the other hand, following my recent blog posts and article (Kunert, 2016) about the replication crisis in Psychology, I have come to the conclusion that science cynicism is not the way forward. So, I finally went fully transparent.

I am not particularly happy with how I handled this whole affair. I have zero documentation of my contact with DeMatteo. So, expect his word to stand against mine soon. I also feel I should have taken a risk earlier in exposing this. But then, I used to be passionate about science and wanted a career in it. I didn’t want to make enemies before I had even started my Master’s degree.

In short, only once I stopped caring about my career in science did I find the space to care about science itself.

— — —

DeMatteo, D., Heilbrun, K., & Marczyk, G. (2006). An empirical investigation of psychopathy in a noninstitutionalized and noncriminal sample Behavioral Sciences & the Law, 24 (2), 133-146 DOI: 10.1002/bsl.667

Gao, Y., & Raine, A. (2010). Successful and unsuccessful psychopaths: A neurobiological model Behavioral Sciences & the Law DOI: 10.1002/bsl.924

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success Psychonomic Bulletin & Review DOI: 10.3758/s13423-016-1030-9

Rucević S (2010). Psychopathic personality traits and delinquent and risky sexual behaviors in Croatian sample of non-referred boys and girls. Law and human behavior, 34 (5), 379-91 PMID: 19728057

Ullrich, S., Farrington, D., & Coid, J. (2008). Psychopathic personality traits and life-success Personality and Individual Differences, 44 (5), 1162-1171 DOI: 10.1016/j.paid.2007.11.008

— — —

Update 16/11/2016: corrected numerical typo in sentence beginning ‘Adjusting the scores for the maxima…’ pointed out to me by Tom Foulsham via twitter (@TomFoulsh).

Advertisements

How to test for music skills

In a new article I evaluate a recently developed test for music listening skills. To my great surprise the test behaves very well. This could open the path to better understand the psychology underlying music listening. Why am I surprised?

I got my first taste of how difficult it is to replicate published scientific results during my very first empirical study as an undergraduate (eventually published as Kunert & Scheepers, 2014). Back then, I used a 25 minute long dyslexia screening test to distinguish dyslexic participants from non-dyslexic participants (the Lucid Adult Dyslexia Screener). Even though previous studies had suggested an excellent sensitivity (identifying actually dyslexic readers as dyslexic) of 90% and a moderate to excellent specificity (identifying actually non-dylexic readers as non-dyslexic) of 66% – 91% (Singleton et al., 2009; Nichols et al., 2009), my own values were worse at 61% sensitivity and 65% specificity. In other words, the dyslexia test only flagged someone with an official dyslexia diagnosis in 11/18 cases and only categorised someone without known reading problems as non-dyslexic in 13/20 cases. The dyslexia screener didn’t perform exactly as suggested by the published literature and I have been suspicious of ability tests every since.

Five years later I acquired data to look at how music can influence language processing (Kunert et al., 2016) and added a newly proposed music abilitily measure called PROMS (Law & Zentner, 2012) to the experimental sessions to see how bad it is. I really thought I would see the music listening ability scores derived from the PROMS to be conflated with things which on the face of it have little to do with music (digit span, i.e. the ability to repeat increasingly longer digit sequences), because previous music ability tests had that problem. Similarly, I expected people with better music training to not have that much better PROMS scores. In other words, I expected the PROMS to perform worse than suggested by the people who developed the test, in line with my negative experience with the dylexia screener.

It then came as a surprise to see that PROMS scores were hardly associated with the ability to repeat increasingly longer digit sequences (either in the same order, i.e. forward digit span, or in reverse order, i.e. backward digit span), see Figure 1A and 1B. This makes the PROMS scores surprisingly robust against variation in working memory, as you would expect from a good music ability test.

journal.pone.0159103.g002

Figure 1. How the brief PROMS (vertical axis) correlates with various validity measures (horizontal axis). Each dot is one participant. Lines are best fit lines with equal weights for each participant (dark) or downweighting unusual participants (light). Inserted correlation values reflect dark line (Pearson r) or a rank-order equivalent of it which is robust to outliers (Spearman rho). Correlation values range from -1 to +1.

The second surprise came when musical training was actually associated with better music skill scores, as one would expect for a good test of music skills, see Figures 1C, 1D, 1E, and 1H. To top it of, the PROMS score was also correlated with the music task performance in the experiment looking at how language influences music processing. This association between the PROMS and musical task accuracy was visible in two independent samples, see Figures 1F and 1G, which is truly surprising because the music task targets harmonic music perception which is not directly tested by the PROMS.

To conclude, I can honestly recommend the PROMS to music researchers. To my surprise it is a good test which could truly tell us something about where music skills actually come from. I’m glad that this time I have been proven wrong regarding my suspicions about ability tests.

— — —

Kunert R, & Scheepers C (2014). Speed and accuracy of dyslexic versus typical word recognition: an eye-movement investigation. Frontiers in psychology, 5 PMID: 25346708

Kunert R, Willems RM, & Hagoort P (2016). Language influences music harmony perception: effects of shared syntactic integration resources beyond attention. Royal Society open science, 3 (2) PMID: 26998339

Kunert R, Willems RM, & Hagoort P (2016). An Independent Psychometric Evaluation of the PROMS Measure of Music Perception Skills. PloS one, 11 (7) PMID: 27398805

Law LN, & Zentner M (2012). Assessing musical abilities objectively: construction and validation of the profile of music perception skills. PloS one, 7 (12) PMID: 23285071

Nichols SA, McLeod JS, Holder RL, & McLeod HS (2009). Screening for dyslexia, dyspraxia and Meares-Irlen syndrome in higher education. Dyslexia, 15 (1), 42-60 PMID: 19089876

Singleton, C., Horne, J., & Simmons, F. (2009). Computerised screening for dyslexia in adults Journal of Research in Reading, 32 (1), 137-152 DOI: 10.1111/j.1467-9817.2008.01386.x
— — —

A critical comment on “Contextual sensitivity in scientific reproducibility”

Psychological science is surprisingly difficult to replicate (Open Science Collaboration, 2015). Researchers are desperate to find out why. A new study in the prestigious journal PNAS (Van Bavel et al., 2016) claims that unknown contextual factors of psychological phenomena (“hidden moderators”) are to blame. The more an effect is sensitive to unknown contextual factors, the less likely a successful replication is. In this blog post I will lay out why I am not convinced by this argument.

Before I start I should say that I really appreciate that the authors of this paper make their point with reference to data and analyses thereof. I believe that this is a big improvement on the state of the replicability debate of a few years back when it was dominated by less substantiated opinions. Moreover, they share their key data and some analysis code, following good scientific practice. Still, I am not convinced by their argument. Here’s why:

1) No full engagement with the opposite side of the argument

Van Bavel et al.’s (2016) suggested influence of replication contexts on replication success cannot explain the following patterns in the data set they used (Open Science Collaboration, 2015):

a) replication effect sizes are mostly lower than original effect sizes. Effects might well “vary by [replication] context” (p. 2) but why the consistent reduction in effect size when replicating an effect?

b) internal conceptual replications are not related to independent replication success (Kunert, 2016). This goes directly against Van Bavel et al.’s (2016) suggestion that “conceptual replications can even improve the probability of successful replications” (p. 5).

c) why are most original effects just barely statistically significant (see previous blog post)?

I believe that all three patterns point to some combination of questionable research practices affecting the original studies. Nothing in Van Bavel et al.’s (2016) article manages to convince me otherwise.

2) The central result completely depends on how you define ‘replication success’

The central claim of the article is based on the correlation between one measure of replication success (subjective judgment by replication team of whether replication was successful) and one measure of the contextual sensitivity of a replicated effect. While the strength of the association (r = -.23) is statistically significant (p = .024), it doesn’t actually provide convincing evidence for either the null or the alternative hypothesis according to a standard Bayesian JZS correlation test (BF01 = 1). [For all analyses: R-code below.]

Moreover, another measure of replication success (reduction of effect size between original and replication study) is so weakly correlated with the contextual sensitivity variable (r = -.01) as to provide strong evidence for a lack of association between contextual sensitivity and replication success (BF01 = 12, notice that even the direction of the correlation is in the wrong direction according to Van Bavel et al.’s (2016) account).

Bevel_figure

[Update: The corresponding values for the other measures of replication success are: replication p < .05 (r = -0.18; p = .0721; BF01 = 2.5), original effect size in 95%CI of replication effect size (r = -.3, p = .0032, BF10 = 6). I could not locate the data column for whether the meta-analytic effect size is different from zero.]

3) The contextual sensitivity variable could be confounded

How do we know which original effects were plagued by hidden moderators (i.e. by unknown context sensitivity) if, well, these moderators are hidden? Three of the authors of the article simply rated all replicated studies for contextual sensitivity without knowing each study’s replication status (but after the replication success of each study was known in general). The authors provide evidence for the ratings to be reliable but no one knows whether they are valid.

For example, the raters tried not to be influenced by ‘whether the specific replication attempt in question would succeed’ (p. 2). Still, all raters knew they would benefit (in the form of a prestigious publication) from a significant association between their ratings and replication success. How do we know that the ratings do not simply reflect some sort of implicit replicability doubt? From another PNAS study (Dreber et al., 2015) we know that scientists can predict replication success before a replication study is run.

Revealing hidden moderators

My problem with the contextual sensitivity account claiming that unknown moderators are to blame for replication failures is not so much that it is an unlikely explanation. I agree with Van Bavel et al. (2016) that some psychological phenomena are more sensitive to replication contexts than others. I would equally welcome it if scientific authors were more cautious in generalising their results.

My problem is that this account is so general as to be nearly unfalsifiable, and an unfalsifiable account is scientifically useless. Somehow unknown moderators always get invoked once a replication attempt has failed. All sorts of wild claims could be retrospectively claimed to be true within the context of the original finding.

In short: a convincing claim that contextual factors are to blame for replication failures needs to reveal the crucial replication contexts and then show that they indeed influence replication success. The proof of the unknown pudding is in the eating.

— — —
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research Proceedings of the National Academy of Sciences, 112 (50), 15343-15347 DOI: 10.1073/pnas.1516179112

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success Psychonomic Bulletin & Review DOI: 10.3758/s13423-016-1030-9

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

Van Bavel, J.J., Mende-Siedlecki, P., Brady, W.J., & Reinero, D.A. (2016). Contextual sensitivity in scientific reproducibility PNAS
— — —

########################################################################################################
# Script for article "A critical comment on "Contextual sensitivity in scientific reproducibility""    #
# Submitted to Brain's Idea                                                                            #
# Responsible for this file: R. Kunert (rikunert@gmail.com)                                            # 
########################################################################################################   
 
# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))
 
if(!require(BayesMed)){install.packages('BayesMed')} #Bayesian analysis of correlation
library(BayesMed)
 
if(!require(Hmisc)){install.packages('Hmisc')} #correlations
library(Hmisc)
 
if(!require(reshape2)){install.packages('reshape2')}#melt function
library(reshape2)
 
if(!require(grid)){install.packages('grid')} #arranging figures
library(grid)
 
#get raw data from OSF website
info <- GET('https://osf.io/pra2u/?action=download', write_disk('rpp_Bevel_data.csv', overwrite = TRUE)) #downloads data file from the OSF
RPPdata <- read.csv("rpp_Bevel_data.csv")[1:100, ]
colnames(RPPdata)[1] <- "ID" # Change first column name
 
#------------------------------------------------------------------------------------------------------------
#2) The central result completely depends on how you define 'replication success'----------------------------
 
#replication with subjective judgment of whether it replicated
rcorr(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary, type = 'spearman')
#As far as I know there is currently no Bayesian Spearman rank correlation analysis. Therefore, use standard correlation analysis with raw and ranked data and hope that the result is similar.
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary)#parametric Bayes factor test
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C), rank(RPPdata$Replicate_Binary))#parametric Bayes factor test
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
 
#replication with effect size reduction
rcorr(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)], RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)], type = 'spearman')
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)], RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)])
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)]), rank(RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)]))
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
 
#------------------------------------------------------------------------------------------------------------
#Figure 1----------------------------------------------------------------------------------------------------
 
#general look
theme_set(theme_bw(12)+#remove gray background, set font-size
            theme(axis.line = element_line(colour = "black"),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  panel.background = element_blank(),
                  panel.border = element_blank(),
                  legend.title = element_blank(),
                  legend.key = element_blank(),
                  legend.position = "top",
                  legend.direction = 'vertical'))
 
#Panel A: replication success measure = binary replication team judgment
dat_box = melt(data.frame(dat = c(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1],
                                  RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0]),
                          replication_status = c(rep('replicated', sum(RPPdata$Replicate_Binary == 1)),
                                                 rep('not replicated', sum(RPPdata$Replicate_Binary == 0)))),
               id = c('replication_status'))
 
#draw basic box plot
plot_box = ggplot(dat_box, aes(x=replication_status, y=value)) +
  geom_boxplot(size = 1.2,#line size
               alpha = 0.3,#transparency of fill colour
               width = 0.8,#box width
               notch = T, notchwidth = 0.8,#notch setting               
               show_guide = F,#do not show legend
               fill='black', color='grey40') +  
  labs(x = "Replication status", y = "Context sensitivity score")#axis titles
 
#add mean values and rhythm effect lines to box plot
 
#prepare data frame
dat_sum = melt(data.frame(dat = c(mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1]),
                                  mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0])),
                          replication_status = c('replicated', 'not replicated')),
               id = 'replication_status')
 
#add mean values
plot_box = plot_box +
  geom_line(data = dat_sum, mapping = aes(y = value, group = 1),
            size= c(1.5), color = 'grey40')+
  geom_point(data = dat_sum, size=12, shape=20,#dot rim
             fill = 'grey40',
             color = 'grey40') +
  geom_point(data = dat_sum, size=6, shape=20,#dot fill
             fill = 'black',
             color = 'black')
plot_box
 
#Panel B: replication success measure = effect size reduction
dat_corr = data.frame("x" = RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)],
                      "y" = RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)])#plotted data
 
plot_corr = ggplot(dat_corr, aes(x = x, y = y))+
  geom_point(size = 2) +#add points
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression")) +
  stat_smooth(method = "rlm", size = 1, se = FALSE,
              aes(colour = "robust regression")) +
  labs(x = "Effect size reduction (original - replication)", y = "Contextual sensitivity score") +#axis labels
  scale_color_grey()+#colour scale for lines
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression"),
              lty = 2)
plot_corr
 
#arrange figure with both panels
multi.PLOT(plot_box + ggtitle("Replication success = replication team judgment"),
           plot_corr + ggtitle("Replication success = effect size stability"),
           cols=2)

Created by Pretty R at inside-R.org

Is Replicability in Economics better than in Psychology?

Colin Camerer and colleagues recently published a Science article on the replicability of behavioural economics. ‘It appears that there is some difference in replication success’ between psychology and economics, they write, given their reproducibility rate of 61% and psychology’s of 36%. I took a closer look at the data to find out whether there really are any substantial differences between fields.

Commenting on the replication success rates in psychology and economics, Colin Camerer is quoted as saying: “It is like a grade of B+ for psychology versus A– for economics.” Unsurprisingly, his team’s Science paper also includes speculation as to what contributes to economics’ “relatively good replication success”. However, such speculation is premature as it is not established whether economics actually displays better replicability than the only other research field which has tried to estimate its replicability (that would be psychology). Let’s check the numbers in Figure 1.

RPP_EERP_replicability

Figure 1. Replicability in economics and psychology. Panel A displays replication p-values of originally significant effects. Note that the bottom 25% quartile is at p = .001 and p = .0047 respectively and, thus, not visible here. Panel B displays the effect size reduction from original to replication study.Violin plots display density, i.e. thicker parts represent more data points.

Looking at the left panel of Figure 1, you will notice that the p-values of the replication studies in economics tend to be lower than in psychology, indicating that economics is more replicable. In order to formally test this, I define a replication success as p < .05 (the typical threshold for proclaiming that an effect was found) and count successes in both data sets. In economics, there are 11 successes and 7 failures. In psychology, there are 34 successes and 58 failures. When comparing these proportions formally with a Bayesian contingency table test, the resulting Bayes Factor of BF10 = 1.77 indicates that the replicability difference between economics and psychology is so small as to be worth no more than a bare mention. Otherwise said, the replicability projects in economics and psychology were too small to say that one field produces more replicable effects than the other.

However, a different measure of replicability which doesn’t depend on an arbitrary cut-off at p = .05 might give a clearer picture. Figure 1’s right panel displays the difference between the effect sizes reported in the original publications and those observed by the replication teams. You will notice that most effect size differences are negative, i.e. when replicating an experiment you will probably observe a (much) smaller effect compared to what you read in the original paper. For a junior researcher like me this is an endless source of self-doubt and frustration.

Are effect sizes more similar between original and replication studies in economics compared to psychology? Figure 1B doesn’t really suggest that there is a huge difference. The Bayes factor of a Bayesian t-test comparing the right and left distributions of Figure 1B supports this impression. The null hypothesis of no difference is favored BF01 = 3.82 times more than the alternative hypothesis of a difference (or BF01 = 3.22 if you are an expert and insist on using Cohen’s q). In Table 1, I give some more information for the expert reader.

The take-home message is that there is not enough information to claim that economics displays better replicability than psychology. Unfortunately, psychologists shouldn’t just adopt the speculative factors contributing to the replication success in economics. Instead, we should look elsewhere for inspiration: the simulations of different research practices showing time and again what leads to high replicability (big sample sizes, pre-registration, …) and what not (publication bias, questionable research practices…). For the moment, psychologists should not look towards economists to find role models of replicability.

Table 1. Comparison of Replicability in Economics and Psychology.
Economics Psychology Bayes Factora Posterior median [95% Credible Interval]1
Independent Replications p < .05 11 out of 18 34 out of 92 BF10 = 1.77

0.95

[-0.03; 1.99]

Effect size reduction (simple subtraction)

M = 0.20

(SD = 0.20)

M = 0.20

(SD = 0.21)

BF01 = 3.82

0.02

[-0.12; 0.15]

Effect size reduction (Cohen’s q) M = 0.27

(SD = 0.36)

M = 0.22

(SD = 0.26)

BF01 = 3.22

0.03

[-0.15; 0.17]

a Assumes normality. See for yourself whether you believe this assumption is met.

1 Log odds for proportions. Difference values for quantities.

— — —

Camerer, C., Dreber, A., Forsell, E., Ho, T., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics Science DOI: 10.1126/science.aaf0918

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716
— — —

PS: I am indebted to Alex Etz and EJ Wagenmakers who presented a similar analysis of parts of the data on the OSF website: https://osf.io/p743r/

— — —

R code for reproducing Figure and Table (drop me a line if you find a mistake):

# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))

if(!require(BayesFactor)){install.packages('BayesFactor')} #Bayesian analysis
library(BayesFactor)

if(!require(BEST)){install.packages('BEST')} #distribution overlap
library(BEST)#requires JAGS version 3

if(!require(xlsx)){install.packages('xlsx')} #for reading excel sheets
library(xlsx)

#How many draws are to be taken from posterior distribution for BF and Credible Interval calculations? The more samples the more precise the estimate and the slower the calculation.
draws = 10000 * 10#BayesFactor package standard = 10000

##########################################################################################################################################################################################
#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Figures 1: p-values

#get RPP raw data from OSF website
RPPdata &amp;amp;lt;- get.OSFfile(code='https://osf.io/fgjvw/',dfCln=T)$df
# Select the completed replication studies
RPPdata &amp;amp;lt;- dplyr::filter(RPPdata, !is.na(T.pval.USE.O),!is.na(T.pval.USE.R))

#get EERP raw data from local xls file based on Table S1 of Camerer et al., 2016 (just write me an e-mail if you want it)
EERPdata = read.xlsx(&amp;amp;quot;EE_RP_data.xls&amp;amp;quot;, 1)

# Restructure the data to &amp;amp;quot;long&amp;amp;quot; format: Study type will be a factor
df1 &amp;amp;lt;- dplyr::select(RPPdata,starts_with(&amp;amp;quot;T.&amp;amp;quot;))
df &amp;amp;lt;- data.frame(p.value=as.numeric(c(as.character(EERPdata$p_rep),
df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05])),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(EERPdata$p_rep)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=sum(df1$T.pval.USE.O &amp;amp;lt; .05)))))

# Create some variables for plotting
df$grpN &amp;amp;lt;- as.numeric(df$grp)
probs &amp;amp;lt;- seq(0,1,.25)

# VQP PANEL A: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles &amp;amp;lt;- ldply(unique(df$grpN),function(gr) quantile(round(df$p.value[df$grpN==gr],digits=4),probs,na.rm=T,type=3))
freqs &amp;amp;lt;- ldply(unique(df$grpN),function(gr) table(cut(df$p.value[df$grpN==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels &amp;amp;lt;- sapply(unique(df$grpN),function(gr)levels(cut(round(df$p.value[df$grpN==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Check the Quantile bins!
Economics &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[1,])))
rownames(Economics) &amp;amp;lt;- labels[,1]
Economics

Psychology &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[2,])))
rownames(Psychology) &amp;amp;lt;- labels[,2]
Psychology

# Get regular violinplot using package ggplot2
g.pv &amp;amp;lt;- ggplot(df,aes(x=grp,y=p.value)) + geom_violin(aes(group=grp),scale=&amp;amp;quot;width&amp;amp;quot;,color=&amp;amp;quot;grey30&amp;amp;quot;,fill=&amp;amp;quot;grey30&amp;amp;quot;,trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 &amp;amp;lt;- vioQtile(g.pv,qtiles,probs)
# Garnish
g.pv1 &amp;amp;lt;- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
ggtitle(&amp;amp;quot;A&amp;amp;quot;) + xlab(&amp;amp;quot;&amp;amp;quot;) + ylab(&amp;amp;quot;replication p-value&amp;amp;quot;) +
mytheme
# View
g.pv1

## Uncomment to save panel A as a seperate file
# ggsave(&amp;amp;quot;RPP_F1_VQPpv.eps&amp;amp;quot;,plot=g.pv1)

#calculate counts
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;lt;= .05)#How many economic effects 'worked' upon replication?
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;gt; .05)##How many economic effects 'did not work' upon replication?
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;lt;= .05)#How many psychological effects 'worked' upon replication?
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;gt; .05)#How many psychological effects 'did not work' upon replication?

#prepare BayesFactor analysis
data_contingency = matrix(c(sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;lt;= .05),#row 1, col 1
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;gt; .05),#row 2, col 1
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;lt;= .05),#row 1, col 2
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;gt; .05)),#row 2, col 2
nrow = 2, ncol = 2, byrow = F)#prepare BayesFactor analysis
bf = contingencyTableBF(data_contingency, sampleType = &amp;amp;quot;indepMulti&amp;amp;quot;, fixedMargin = &amp;amp;quot;cols&amp;amp;quot;)#run BayesFactor comparison
sprintf('BF10 = %1.2f', exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log

#Parameter estimation
chains = posterior(bf, iterations = draws)#draw samples from the posterior
odds_ratio = (chains[,&amp;amp;quot;omega[1,1]&amp;amp;quot;] * chains[,&amp;amp;quot;omega[2,2]&amp;amp;quot;]) / (chains[,&amp;amp;quot;omega[2,1]&amp;amp;quot;] * chains[,&amp;amp;quot;omega[1,2]&amp;amp;quot;])
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(log(odds_ratio)),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(log(odds_ratio), 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(log(odds_ratio), 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
#plot(mcmc(log(odds_ratio)), main = &amp;amp;quot;Log Odds Ratio&amp;amp;quot;)

# VQP PANEL B: reduction in effect size -------------------------------------------------

econ_r_diff = as.numeric(as.character(EERPdata$r_rep)) - as.numeric(as.character(EERPdata$r_orig))
psych_r_diff = as.numeric(df1$T.r.R) - as.numeric(df1$T.r.O)
df &amp;amp;lt;- data.frame(EffectSizeDifference= c(econ_r_diff, psych_r_diff[!is.na(psych_r_diff)]),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(econ_r_diff)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=length(psych_r_diff[!is.na(psych_r_diff)])))))

# Create some variables for plotting
df$grpN &amp;amp;lt;- as.numeric(df$grp)
probs &amp;amp;lt;- seq(0,1,.25)

# Get effect size quantiles and frequencies from data
qtiles &amp;amp;lt;- ldply(unique(df$grpN),function(gr) quantile(df$EffectSizeDifference[df$grpN==gr],probs,na.rm=T,type=3,include.lowest=T))
freqs &amp;amp;lt;- ldply(unique(df$grpN),function(gr) table(cut(df$EffectSizeDifference[df$grpN==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T)))
labels &amp;amp;lt;- sapply(unique(df$grpN),function(gr)levels(cut(round(df$EffectSizeDifference[df$grpN==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Check the Quantile bins!
Economics &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[1,])))
rownames(Economics) &amp;amp;lt;- labels[,1]
Economics

Psychology &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[2,])))
rownames(Psychology) &amp;amp;lt;- labels[,2]
Psychology

# Get regular violinplot using package ggplot2
g.es &amp;amp;lt;- ggplot(df,aes(x=grp,y=EffectSizeDifference)) +
geom_violin(aes(group=grpN),scale=&amp;amp;quot;width&amp;amp;quot;,fill=&amp;amp;quot;grey40&amp;amp;quot;,color=&amp;amp;quot;grey40&amp;amp;quot;,trim=T,adjust=1)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 &amp;amp;lt;- vioQtile(g.es,qtiles=qtiles,probs=probs)
# Garnish
g.es1 &amp;amp;lt;- g.es0 +
ggtitle(&amp;amp;quot;B&amp;amp;quot;) + xlab(&amp;amp;quot;&amp;amp;quot;) + ylab(&amp;amp;quot;Replicated - Original Effect Size r&amp;amp;quot;) +
scale_y_continuous(breaks=c(-.25,-.5, -0.75, -1, 0,.25,.5,.75,1),limits=c(-1,0.5)) + mytheme
# View
g.es1

# # Uncomment to save panel B as a seperate file
# ggsave(&amp;amp;quot;RPP_F1_VQPes.eps&amp;amp;quot;,plot=g.es1)

# VIEW panels in one plot using the multi.PLOT() function from C-3PR
multi.PLOT(g.pv1,g.es1,cols=2)

# SAVE combined plots as PDF
pdf(&amp;amp;quot;RPP_Figure1_vioQtile.pdf&amp;amp;quot;,pagecentre=T, width=20,height=8 ,paper = &amp;amp;quot;special&amp;amp;quot;)
multi.PLOT(g.pv1,g.es1,cols=2)
dev.off()

#Effect Size Reduction (simple subtraction)-------------------------------------------------

#calculate means and standard deviations
mean(econ_r_diff)#mean ES reduction of economic effects
sd(econ_r_diff)#Standard Deviation ES reduction of economic effects
mean(psych_r_diff[!is.na(psych_r_diff)])#mean ES reduction of psychological effects
sd(psych_r_diff[!is.na(psych_r_diff)])#Standard Deviation ES reduction of psychological effects

#perform BayesFactor analysis
bf = ttestBF(formula = EffectSizeDifference ~ grp, data = df)#Bayesian t-test to test the difference/similarity between the previous two
sprintf('BF01 = %1.2f', 1/exp(bf@bayesFactor$bf[1]))#exponentiate BF10 because stored as natural log, turn into BF01

##Parameter estimation: use BEST package to estimate posterior median and 95% Credible Interval
BESTout = BESTmcmc(econ_r_diff,
psych_r_diff[!is.na(psych_r_diff)],
priors=NULL, parallel=FALSE)
#plotAll(BESTout)
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(BESTout$mu1 - BESTout$mu2),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)

#Effect Size Reduction (Cohen's q)-------------------------------------------------

#prepare function to calculate Cohen's q
Cohenq &amp;amp;lt;- function(r1, r2) {
fis_r1 = 0.5 * (log((1+r1)/(1-r1)))
fis_r2 = 0.5 * (log((1+r2)/(1-r2)))
fis_r1 - fis_r2
}

#calculate means and standard deviations
econ_Cohen_q = Cohenq(as.numeric(as.character(EERPdata$r_rep)), as.numeric(as.character(EERPdata$r_orig)))
psych_Cohen_q = Cohenq(as.numeric(df1$T.r.R), as.numeric(df1$T.r.O))
mean(econ_Cohen_q)#mean ES reduction of economic effects
sd(econ_Cohen_q)#Standard Deviation ES reduction of economic effects
mean(psych_Cohen_q[!is.na(psych_Cohen_q)])#mean ES reduction of psychological effects
sd(psych_Cohen_q[!is.na(psych_Cohen_q)])#Standard Deviation ES reduction of psychological effects

#perform BayesFactor analysis
dat_bf &amp;amp;lt;- data.frame(EffectSizeDifference = c(econ_Cohen_q,
psych_Cohen_q[!is.na(psych_Cohen_q)]),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(econ_Cohen_q)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=length(psych_Cohen_q[!is.na(psych_Cohen_q)])))))#prepare BayesFactor analysis
bf = ttestBF(formula = EffectSizeDifference ~ grp, data = dat_bf)#Bayesian t-test to test the difference/similarity between the previous two
#null Interval is positive because effect size reduction is expressed negatively, H1 predicts less reduction in case of internally replicated effects
sprintf('BF01 = %1.2f', 1/exp(bf@bayesFactor$bf[1]))#exponentiate BF10 because stored as natural log, turn into BF01

#Parameter estimation: use BEST package to estimate posterior median and 95% Credible Interval
BESTout = BESTmcmc(econ_Cohen_q,
psych_Cohen_q[!is.na(psych_Cohen_q)],
priors=NULL, parallel=FALSE)
#plotAll(BESTout)
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(BESTout$mu1 - BESTout$mu2),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)

Are pre-registrations the solution to the replication crisis in Psychology?

Most psychology findings are not replicable. What can be done? In his Psychological Science editorial, Stephen Lindsay advertises pre-registration as a solution, writing that “Personally, I aim never again to submit for publication a report of a study that was not preregistered”. I took a look at whether pre-registrations are effective and feasible [TL;DR: maybe and possibly].

[I updated the blog post using comments by Cortex editor Chris Chambers, see below for full comments. It turns out that many of my concerns have already been addressed. Updates in square brackets.]

A recent study published in Science found that the majority of Psychological research cannot be reproduced by independent replication teams (Open Science Collaboration, 2015). I believe that this is due to questionable research practices (LINK) and that internal replications are no solution to this problem (LINK). However, might pre-registrations be the solution? I don’t think so. The reason why I am pessimistic is three-fold.

What is a pre-registration? A pre-registered study submits its design and analysis before data is acquired. After data acquisition the pre-registered data analysis plan is executed and the results can confidently be labelled confirmatory (i.e. more believable). Analyses not specified before are labelled exploratory (i.e. less believable). Some journals offer peer-review of the pre-registration document. Once it has been approved, the chances of the journal accepting a manuscript based on the proposed design and analysis are supposedly very high. [Chris Chambers: “for more info on RRs see https://osf.io/8mpji/wiki/home/”%5D

 

1) Pre-registration does not remove all incentives to employ questionable research practices

Pre-registrations should enforce honesty about post hoc changes in the design/analysis. Ironically, the efficacy of pre-registrations is itself dependent on the honesty of researchers. The reason is simple: including the information that an experiment was pre-registered is optional. So, if the planned analysis is optimal, a researcher can boost its impact by revealing that the entire experiment was pre-registered. If not, s/he deletes the pre-registration document and proceeds as if it had never existed, a novel questionable research practice (anyone want to invent a name for it? Optional forgetting?).

Defenders of pre-registration could counter that peer-reviewed pre-registrations are different because there is no incentive to deviate from the planned design/analysis. Publication is guaranteed if the pre-registered study is executed as promised. However, two motives remove this publication advantage:

1a) the credibility boost of presenting a successful post hoc design or analysis decision as a priori can still be achieved by publishing the paper in a different journal which is unaware of the pre-registration document.

1b) the credibility loss of a wider research agenda due to a single unsuccessful experiment can still be avoided by simply withdrawing the study from the journal and forgetting about it.

The take-home message is that one can opt-in and out of pre-registration as one pleases. The maximal cost is the rejection of one peer-reviewed pre-registered paper at one journal. Given that paper rejection is the most normal thing in the world for a scientist these days, this threat is not effective.

[Chris Chambers: “all pre-registrations made now on the OSF become public within 4 years – so as far as I understand, it is no longer possible to register privately and thus game the system in the way you describe, at least on the OSF.”]

2) Pre-registrations did not clean up other research fields

Note that the argument so far assumes that when the pre-registration document is revealed, it is effective in stopping undisclosed post hoc design/analysis decisions. The medical sciences, in which randomized control trials have to be pre-registered since a 2004 decision by journal editors, teach us that this is not so. There are four aspects to this surprising ineffectivetiveness of pre-registrations:

2a) Many pre-registered studies are not published. For example, Chan et al. (2004a,b) could not locate the publications of 54% – 63% of the pre-registered studies. It’s possible that this is due to the aforementioned publication bias (see 1b above), or other reasons (lack of funding, manuscript under review…).

2b) Medical authors feel free to frequently deviate from their planned designs/analyses. For example 31% – 62% of randomized controlled trials changed at least one primary outcome between pre-registration and publication (Mathieu et al., 2009; Chan et al., 2004a,b). If you thought that psychological scientists are somehow better than medical ones, early indications are that this is not so (Franco et al., 2015).

pre-registration deviations in psych science

2c) Deviations from pre-registered designs/analyses are not discovered because 66% of journal reviewers do not consult the pre-registration document (Mathieu et al., 2013).

2d) In the medical sciences pre-registration documents are usually not peer-reviewed and quite often sloppy. For example, Mathieu et al., (2013) found 37% of trials to be post-registered (the pointless exercise of registering a study which has already taken place), and 17% of pre-registrations being too imprecise to be useful.

[Chris Chambers: “The concerns raised by others about reviewers not checking protocols apply to clinical trial registries but this is moot for RRs because checking happens at an editorial level (if not at both an editorial and reviewer level) and there is continuity of the review process from protocol through to study completion.”]

3) Pre-registration is a practical night-mare for early career researchers

Now, one might argue that pre-registering is still better than not pre-registering. In terms of non-peer-reviewed pre-registration documents, this is certainly true. However, their value is limited because they can be written so vaguely as to be not useless (see 2d) and they can simply be deleted if they ‘stand in the way of a good story’, i.e. if an exploratory design/analysis choice gets reported as confirmatory (see 1a).

The story is different for peer-reviewed pre-registrations. They are impractical because of one factor which tenured decision makers sometimes forget: time. Most research is done by junior scientists who have temporary contracts running anywhere between a few months and five years [reference needed]. These people cannot wait for a peer-review decision which, on average, takes something like one year and ten months (Nosek & Bar-Anan, 2012). This is the submission-to-publication-time distribution for one prominent researcher (Brian Nosek):

hist_publication_times

What does this mean? As a case study, let’s take Richard Kunert, a fine specimen of a junior researcher, who was given three years of funding by the Max-Planck-Gesellschaft in order to obtain a PhD. Given the experience by Brain Nosek with his articles, and assuming Richard submits three pre-registration documents on day 1 of his 3-year PhD, each individual document has a 84.6% chance of being accepted within three years. The chance that all three will be accepted is 60.6% (0.8463). This scenario is obviously unrealistic because it leaves no time for setting up the studies and for actually carrying them out.

For the more realistic case of one year of piloting and one year of actually carrying out the studies, Richard has a 2.2% chance (0.2823) that all three studies are peer-reviewed at the pre-registration stage and published. However, Richard is not silly (or so I have heard), so he submits 5 studies, hoping that at least three of them will be eventually carried out. In this case he has a 14% that at least three studies are peer-reviewed at the pre-registration stage and published. Only if Richard submits 10 or more pre-registration documents for peer-review after 1 year of piloting, he has a more than 50% chance of being left with at least 3 studies to carry out within 1 year.

For all people who hate numbers, let me put it into plain words. Peer-review is so slow that requiring PhD students to only perform pre-registered studies means the overwhelming majority of PhD students will fail their PhD requirements in their funded time. In this scenario cutting-edge, world-leading science will be done by people flipping burgers to pay the rent because funding ran out too quickly.

[Chris Chambers: “Average decision times from Cortex, not including time taken by authors to make revisions: initial trial = 5 days; Stage 1 provisional acceptance = 9 weeks (1-3 rounds of in-depth review); Stage 2 full acceptance = 4 weeks”]

What to do

The arrival of pre-registration in the field of Psychology is undoubtedly a good sign for science. However, given what we know now, no one should be under the illusion that this instrument is the solution to the replication crisis which psychological researchers are facing. At the most, it is a tiny piece of a wider strategy to make Psychology what it has long claimed to be: a robust, evidence based, scientific enterprise.

 

[Please do yourself a favour and read the comments below. You won’t get better people commenting than this.]

— — —

Chan AW, Krleza-Jerić K, Schmid I, & Altman DG (2004). Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. CMAJ : Canadian Medical Association journal = journal de l’Association medicale canadienne, 171 (7), 735-40 PMID: 15451835

Chan, A., Hróbjartsson, A., Haahr, M., Gøtzsche, P., & Altman, D. (2004). Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials JAMA, 291 (20) DOI: 10.1001/jama.291.20.2457

Franco, A., Malhotra, N., & Simonovits, G. (2015). Underreporting in Psychology Experiments: Evidence From a Study Registry Social Psychological and Personality Science DOI: 10.1177/1948550615598377

Lindsay, D. (2015). Replication in Psychological Science Psychological Science DOI: 10.1177/0956797615616374

Mathieu, S., Boutron, I., Moher, D., Altman, D.G., & Ravaud, P. (2009). Comparison of Registered and Published Primary Outcomes in Randomized Controlled Trials JAMA, 302 (9) DOI: 10.1001/jama.2009.1242

Mathieu, S., Chan, A., & Ravaud, P. (2013). Use of Trial Register Information during the Peer Review Process PLoS ONE, 8 (4) DOI: 10.1371/journal.pone.0059910

Nosek, B., & Bar-Anan, Y. (2012). Scientific Utopia: I. Opening Scientific Communication Psychological Inquiry, 23 (3), 217-243 DOI: 10.1080/1047840X.2012.692215

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443
— — —

Correcting for Human Researchers – The Rediscovery of Replication

We need to control for this.

You may have missed some of the discussion on fraud, errors and biases shaking the scientific community of late, so I will quickly bring you up to speed.
Firstly, a series of fraud cases (Ruggiero, Hauser, Stapel) in Psychology and related fields makes everyone wonder why only internal whistleblowers ever discover major fraud cases like these.
Secondly, a well regarded journal publishes an article by Daryl Bem (2011) claiming that we can feel the future. Wagenmakers et al. (2011) apply a different statistical analysis and claim that Bem’s evidence for precognition is so weak as to be meaningless. The debate continues. Meanwhile a related failed replication paper claims to have trouble getting published.
Thirdly, John Bargh criticises everyone involved in a failed replication of an effect he is particularly well known for. He criticises the experimenters, the journal, even a blogger who wrote about it.
This all happened within the last year and suddenly everyone speaks about replication. Ed Yong wrote about it in nature, the Psychologist had a special issue on it, some researchers set up a big replication project, the blogosphere goes crazy with it.
Some may wonder why replication was singled out as the big issue. Isn’t this about the ruthless, immoral energy of fraudsters? Or about publishers’ craving for articles that create buzz? Or about a researcher’s taste for scandal? Perhaps it is indeed about a series of individual problems related to human nature. But the solution is still a systemic one: replication. It is the only way of overcoming the unfortunate fact that science is only done by mere humans.
This may surprise some people because replication is not done all that much. And the way researchers get rewarded for their work totally goes against doing replications. The field carries on as if there were procedures, techniques and analyses that overcome the need for replication. The most common of which is inferential hypothesis testing.
This way of analysing your data simply asks whether any differences found among the people who were studied would hold up in the population at large. If so, the difference is said to be a ‘statistically significant’ difference. Usually, this is boiled down to a p-value which reports the likelihood of finding the same statistically significant difference again and again in experiment after experiment if in truth the difference didn’t exist at all in the population. So, imagine that women and men in truth were equally intelligent (I have no idea whether they are). Inferential hypothesis testing predicts that 5% of experiments will report a significant difference between male and female IQs. This difference won’t be replicated by the other 95% of experiments.
And this is where replication comes in: the p-value can be thought of as a prediction of how likely failed replications of an effect will be. Needless to say that a prediction is a poor substitute for the real thing.
This was brought home to me by Luck in his great book An Introduction to the Event-Related Potential Technique (2005, p. 251). He basically says that replication is the only approach in science which is not based on assumptions needed to run the aforementioned statistical analyses.
Replication does not depend on assumptions about normality, sphericity, or independence. Replication is not distorted by outliers. Replication is a cornerstone of science. Replication is the best statistic.
In other words, it is the only way of overcoming the human factor involved in choosing how to get to a p-value. You can disagree on many things, but not on the implication of a straight replication. If the effect is consistently replicated, it is real.
For example, Simmons and colleagues (2011) report that researchers can tweak their data easily without anyone knowing. This is not really fraud but it is not something you want to admit, either. Using four ways of tweaking the statistical analysis towards a significant result – which is desirable for publication – resulted in a statistically significant difference having a non-replication likelihood of 60%. Now, this wouldn’t be a problem if anyone actually bothered to do a replication – including the exact same tweaks to the data. It is very likely that the effect wouldn’t hold up.
Many people believe that this is what really happened with Bem’s pre-cognition results. They are perhaps not fraudulous, but the way they were analysed and reported inflated the chances of finding effects which are not real. Similarly, replication is what did not happen with Stapel and other fraudsters. My guess is that if anyone had actually bothered to replicate, it would have become clear that Stapel has a history of unreplicability (see my earlier blog post about the Stapel affair for clues).
So, if we continue to let humans do research, we have to address the weakness inherent in this approach. Replication is the only solution we know of.
.

———————————————————————–

.
Bem, D.J., Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. Journal of Personality and Social Psychology, 100, 407-425. DOI: 10.1037/a0021524
Luck, S.J. (2005). An Introduction to the Event-Related Potential Technique. London: MIT Press.
Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22, 1359-1366. DOI: 10.1177/0956797611417632
Wagenmakers, E.J., Wetzels, R., Borsboom, D., van der Maas, H. (2011). Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi. Journal of Personality and Social Psychology, 100, 426-432. doi: 10.1037/a0022790