Publication bias

A critical comment on “Contextual sensitivity in scientific reproducibility”

Psychological science is surprisingly difficult to replicate (Open Science Collaboration, 2015). Researchers are desperate to find out why. A new study in the prestigious journal PNAS (Van Bavel et al., 2016) claims that unknown contextual factors of psychological phenomena (“hidden moderators”) are to blame. The more an effect is sensitive to unknown contextual factors, the less likely a successful replication is. In this blog post I will lay out why I am not convinced by this argument.

Before I start I should say that I really appreciate that the authors of this paper make their point with reference to data and analyses thereof. I believe that this is a big improvement on the state of the replicability debate of a few years back when it was dominated by less substantiated opinions. Moreover, they share their key data and some analysis code, following good scientific practice. Still, I am not convinced by their argument. Here’s why:

1) No full engagement with the opposite side of the argument

Van Bavel et al.’s (2016) suggested influence of replication contexts on replication success cannot explain the following patterns in the data set they used (Open Science Collaboration, 2015):

a) replication effect sizes are mostly lower than original effect sizes. Effects might well “vary by [replication] context” (p. 2) but why the consistent reduction in effect size when replicating an effect?

b) internal conceptual replications are not related to independent replication success (Kunert, 2016). This goes directly against Van Bavel et al.’s (2016) suggestion that “conceptual replications can even improve the probability of successful replications” (p. 5).

c) why are most original effects just barely statistically significant (see previous blog post)?

I believe that all three patterns point to some combination of questionable research practices affecting the original studies. Nothing in Van Bavel et al.’s (2016) article manages to convince me otherwise.

2) The central result completely depends on how you define ‘replication success’

The central claim of the article is based on the correlation between one measure of replication success (subjective judgment by replication team of whether replication was successful) and one measure of the contextual sensitivity of a replicated effect. While the strength of the association (r = -.23) is statistically significant (p = .024), it doesn’t actually provide convincing evidence for either the null or the alternative hypothesis according to a standard Bayesian JZS correlation test (BF01 = 1). [For all analyses: R-code below.]

Moreover, another measure of replication success (reduction of effect size between original and replication study) is so weakly correlated with the contextual sensitivity variable (r = -.01) as to provide strong evidence for a lack of association between contextual sensitivity and replication success (BF01 = 12, notice that even the direction of the correlation is in the wrong direction according to Van Bavel et al.’s (2016) account).

Bevel_figure

[Update: The corresponding values for the other measures of replication success are: replication p < .05 (r = -0.18; p = .0721; BF01 = 2.5), original effect size in 95%CI of replication effect size (r = -.3, p = .0032, BF10 = 6). I could not locate the data column for whether the meta-analytic effect size is different from zero.]

3) The contextual sensitivity variable could be confounded

How do we know which original effects were plagued by hidden moderators (i.e. by unknown context sensitivity) if, well, these moderators are hidden? Three of the authors of the article simply rated all replicated studies for contextual sensitivity without knowing each study’s replication status (but after the replication success of each study was known in general). The authors provide evidence for the ratings to be reliable but no one knows whether they are valid.

For example, the raters tried not to be influenced by ‘whether the specific replication attempt in question would succeed’ (p. 2). Still, all raters knew they would benefit (in the form of a prestigious publication) from a significant association between their ratings and replication success. How do we know that the ratings do not simply reflect some sort of implicit replicability doubt? From another PNAS study (Dreber et al., 2015) we know that scientists can predict replication success before a replication study is run.

Revealing hidden moderators

My problem with the contextual sensitivity account claiming that unknown moderators are to blame for replication failures is not so much that it is an unlikely explanation. I agree with Van Bavel et al. (2016) that some psychological phenomena are more sensitive to replication contexts than others. I would equally welcome it if scientific authors were more cautious in generalising their results.

My problem is that this account is so general as to be nearly unfalsifiable, and an unfalsifiable account is scientifically useless. Somehow unknown moderators always get invoked once a replication attempt has failed. All sorts of wild claims could be retrospectively claimed to be true within the context of the original finding.

In short: a convincing claim that contextual factors are to blame for replication failures needs to reveal the crucial replication contexts and then show that they indeed influence replication success. The proof of the unknown pudding is in the eating.

— — —
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research Proceedings of the National Academy of Sciences, 112 (50), 15343-15347 DOI: 10.1073/pnas.1516179112

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success Psychonomic Bulletin & Review DOI: 10.3758/s13423-016-1030-9

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

Van Bavel, J.J., Mende-Siedlecki, P., Brady, W.J., & Reinero, D.A. (2016). Contextual sensitivity in scientific reproducibility PNAS
— — —

########################################################################################################
# Script for article "A critical comment on "Contextual sensitivity in scientific reproducibility""    #
# Submitted to Brain's Idea                                                                            #
# Responsible for this file: R. Kunert (rikunert@gmail.com)                                            # 
########################################################################################################   
 
# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))
 
if(!require(BayesMed)){install.packages('BayesMed')} #Bayesian analysis of correlation
library(BayesMed)
 
if(!require(Hmisc)){install.packages('Hmisc')} #correlations
library(Hmisc)
 
if(!require(reshape2)){install.packages('reshape2')}#melt function
library(reshape2)
 
if(!require(grid)){install.packages('grid')} #arranging figures
library(grid)
 
#get raw data from OSF website
info <- GET('https://osf.io/pra2u/?action=download', write_disk('rpp_Bevel_data.csv', overwrite = TRUE)) #downloads data file from the OSF
RPPdata <- read.csv("rpp_Bevel_data.csv")[1:100, ]
colnames(RPPdata)[1] <- "ID" # Change first column name
 
#------------------------------------------------------------------------------------------------------------
#2) The central result completely depends on how you define 'replication success'----------------------------
 
#replication with subjective judgment of whether it replicated
rcorr(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary, type = 'spearman')
#As far as I know there is currently no Bayesian Spearman rank correlation analysis. Therefore, use standard correlation analysis with raw and ranked data and hope that the result is similar.
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary)#parametric Bayes factor test
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C), rank(RPPdata$Replicate_Binary))#parametric Bayes factor test
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
 
#replication with effect size reduction
rcorr(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)], RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)], type = 'spearman')
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)], RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)])
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)]), rank(RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)]))
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
 
#------------------------------------------------------------------------------------------------------------
#Figure 1----------------------------------------------------------------------------------------------------
 
#general look
theme_set(theme_bw(12)+#remove gray background, set font-size
            theme(axis.line = element_line(colour = "black"),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  panel.background = element_blank(),
                  panel.border = element_blank(),
                  legend.title = element_blank(),
                  legend.key = element_blank(),
                  legend.position = "top",
                  legend.direction = 'vertical'))
 
#Panel A: replication success measure = binary replication team judgment
dat_box = melt(data.frame(dat = c(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1],
                                  RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0]),
                          replication_status = c(rep('replicated', sum(RPPdata$Replicate_Binary == 1)),
                                                 rep('not replicated', sum(RPPdata$Replicate_Binary == 0)))),
               id = c('replication_status'))
 
#draw basic box plot
plot_box = ggplot(dat_box, aes(x=replication_status, y=value)) +
  geom_boxplot(size = 1.2,#line size
               alpha = 0.3,#transparency of fill colour
               width = 0.8,#box width
               notch = T, notchwidth = 0.8,#notch setting               
               show_guide = F,#do not show legend
               fill='black', color='grey40') +  
  labs(x = "Replication status", y = "Context sensitivity score")#axis titles
 
#add mean values and rhythm effect lines to box plot
 
#prepare data frame
dat_sum = melt(data.frame(dat = c(mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1]),
                                  mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0])),
                          replication_status = c('replicated', 'not replicated')),
               id = 'replication_status')
 
#add mean values
plot_box = plot_box +
  geom_line(data = dat_sum, mapping = aes(y = value, group = 1),
            size= c(1.5), color = 'grey40')+
  geom_point(data = dat_sum, size=12, shape=20,#dot rim
             fill = 'grey40',
             color = 'grey40') +
  geom_point(data = dat_sum, size=6, shape=20,#dot fill
             fill = 'black',
             color = 'black')
plot_box
 
#Panel B: replication success measure = effect size reduction
dat_corr = data.frame("x" = RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)],
                      "y" = RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)])#plotted data
 
plot_corr = ggplot(dat_corr, aes(x = x, y = y))+
  geom_point(size = 2) +#add points
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression")) +
  stat_smooth(method = "rlm", size = 1, se = FALSE,
              aes(colour = "robust regression")) +
  labs(x = "Effect size reduction (original - replication)", y = "Contextual sensitivity score") +#axis labels
  scale_color_grey()+#colour scale for lines
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression"),
              lty = 2)
plot_corr
 
#arrange figure with both panels
multi.PLOT(plot_box + ggtitle("Replication success = replication team judgment"),
           plot_corr + ggtitle("Replication success = effect size stability"),
           cols=2)

Created by Pretty R at inside-R.org

Advertisements

Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

The replicability of psychological research is surprisingly low. Why? In this blog post I present new evidence showing that questionable research practices contributed to failures to replicate psychological effects.

Quick recap. A recent publication in Science claims that only around 40% of psychological findings are replicable, based on 100 replication attempts in the Reproducibility Project Psychology (Open Science Collaboration, 2015). A few months later, a critical commentary in the same journal made all sorts of claims, including that the surprisingly low 40% replication success rate is due to replications having been unfaithful to the original studies’ methods (Gilbert et al., 2016). A little while later, I published an article in Psychonomic Bulletin & Review re-analysing the data by the 100 replication teams (Kunert, 2016). I found evidence for questionable research practices being at the heart of failures to replicate, rather than the unfaithfulness of replications to original methods.

However, my previous re-analysis depended on replication teams having done good work. In this blog post I will show that even when just looking at the original studies in the Reproducibility Project: Psychology one cannot fail to notice that questionable research practices were employed by the original discoverers of the effects which often failed to replicate. The reanalysis I will present here is based on the caliper test introduced by Gerber and colleagues (Gerber & Malhotra, 2008; Gerber et al., 2010).

The idea of the caliper test is simple. The research community has decided that an entirely arbitrary threshold of p = 0.05 distinguishes between effects which might just be due to chance (p > 0.05) and effects which are more likely due to something other than chance (p < 0.05). If researchers want to game the system they slightly rig their methods and analyses to push their p-values just below the arbitrary border between ‘statistical fluke’ and ‘interesting effect’. Alternatively, they just don’t publish anything which came up p > 0.05. Such behaviour should lead to an unlikely amount of p-values just below 0.05 compared to just above 0.05.

The figure below shows the data of the Reproducibility Project: Psychology. On the horizontal axis I plot z-values which are related to p-values. The higher the z-value the lower the p-value. On the vertical axis I just show how many z-values I found in each range. The dashed vertical line is the arbitrary threshold between p < .05 (significant effects on the right) and p > .05 (non-significant effects on the left).

RPP_density_plot

The independent replications in blue show many z-values left of the dashed line, i.e. replication attempts which were unsuccessful. Otherwise the blue distribution is relatively smooth. There is certainly nothing fishy going on around the arbitrary p = 0.05 threshold. The blue curve looks very much like what I would expect psychological research to be if questionable research practices did not exist.

However, the story is completely different for the green distribution representing the original effects. Just right of the arbitrary p = 0.05 threshold there is a surprising clustering of z-values. It’s as if the human mind magically leads to effects which are just about significant rather than just about not significant. This bump immediately to the right of the dashed line is a clear sign that original authors used questionable research practices. This behaviour renders psychological research unreplicable.

For the expert reader, the formal analysis of the caliper test is shown in the table below using both a Bayesian analysis and a classical frequentist analysis. The conclusion is clear. There is no strong evidence for replication studies failing the caliper test, indicating that questionable research practices were probably not employed. The original studies do not pass the caliper test, indicating that questionable research practices were employed.

 

over caliper

(significant)

below caliper (non-sign.) Binomial test Bayesian proportion test posterior median

[95% Credible Interval]1

10 % caliper (1.76 < z < 1.96 versus 1.96 < z < 2.16)

Original 9 4 p = 0.267 BF10 = 1.09 0.53

[-0.36; 1.55]

Replication 3 2 p = 1 BF01 = 1.30 0.18

[-1.00; 1.45]

15 % caliper (1.67 < z < 1.96 versus 1.96 < z < 2.25)

Original 17 4 p = 0.007 BF10 = 12.9 1.07

[0.24; 2.08]

Replication 4 5 p = 1 BF01 = 1.54 -0.13

[-1.18; 0.87]

20 % caliper (1.76 < z < 1.57 versus 1.96 < z < 2.35)

Original 29 4 p < 0.001 BF10 = 2813 1.59

[0.79; 2.58]

Replication 5 5 p = 1 BF01 = 1.64 0.00

[-0.99; 0.98]

1Based on 100,000 draws from the posterior distribution of log odds.

 

As far as I know, this is the first analysis showing that data from the original studies of the Reproducibility Project: Psychology point to questionable research practices [I have since been made aware of others, see this comment below]. Instead of sloppy science on the part of independent replication teams, this analysis rather points to original investigators employing questionable research practices. This alone could explain the surprisingly low replication rates in psychology.

Psychology failing the caliper test is by no means a new insight. Huge text-mining analyses have shown that psychology as a whole tends to fail the caliper test (Kühberger et al., 2013, Head et al., 2015). The analysis I have presented here links this result to replicability. If a research field employs questionable research practices (as indicated by the caliper test) then it can no longer claim to deliver insights which stand the replication test (as indicated by the Reproducibility Project: Psychology).

It is time to get rid of questionable research practices. There are enough ideas for how to do so (e.g., Asendorpf et al., 2013; Ioannidis, Munafò, Fusar-Poli, Nosek, & Lakens, 2014). The Reproducibility Project: Psychology shows why there is no time to waste: it is currently very difficult to distinguish an interesting psychological effect from a statistical fluke. I doubt that this state of affairs is what psychological researchers get paid for.

PS: full R-code for recreating all analyses and figures is posted below. If you find mistakes please let me know.

PPS: I am indebted to Jelte Wicherts for pointing me to this analysis.

Update 25/4/2015:

I adjusted text to clarify that caliper test cannot distinguish between many different questionable research practices, following tweet by .

I toned down the language somewhat following tweet by .

I added reference to Uli Schimmack’s analysis by linking his comment.

— — —

Asendorpf, J., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J., Fiedler, K., Fiedler, S., Funder, D., Kliegl, R., Nosek, B., Perugini, M., Roberts, B., Schmitt, M., van Aken, M., Weber, H., & Wicherts, J. (2013). Recommendations for Increasing Replicability in Psychology European Journal of Personality, 27 (2), 108-119 DOI: 10.1002/per.1919

Gerber, A., & Malhotra, N. (2008). Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results? Sociological Methods & Research, 37 (1), 3-30 DOI: 10.1177/0049124108318973

Gerber, A., Malhotra, N., Dowling, C., & Doherty, D. (2010). Publication Bias in Two Political Behavior Literatures American Politics Research, 38 (4), 591-613 DOI: 10.1177/1532673X09350979

Gilbert, D., King, G., Pettigrew, S., & Wilson, T. (2016). Comment on “Estimating the reproducibility of psychological science” Science, 351 (6277), 1037-1037 DOI: 10.1126/science.aad7243

Head ML, Holman L, Lanfear R, Kahn AT, & Jennions MD (2015). The extent and consequences of p-hacking in science. PLoS biology, 13 (3) PMID: 25768323

Ioannidis JP, Munafò MR, Fusar-Poli P, Nosek BA, & David SP (2014). Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in cognitive sciences, 18 (5), 235-41 PMID: 24656991

Kühberger A, Fritz A, & Scherndl T (2014). Publication bias in psychology: a diagnosis based on the correlation between effect size and sample size. PloS one, 9 (9) PMID: 25192357

Kunert R (2016). Internal conceptual replications do not increase independent replication success. Psychonomic bulletin & review PMID: 27068542

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

— — —

##################################################################################################################
# Script for article "Questionable research practices in original studies of Reproducibility Project: Psychology"#
# Submitted to Brain's Idea (status: published)                                                                                               #
# Responsible for this file: R. Kunert (rikunert@gmail.com)                                                      # 
##################################################################################################################    
 
##########################################################################################################################################################################################
#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Figure 1: p-value density
 
# source functions
if(!require(httr)){install.packages('httr')}
library(httr)
info <- GET('https://osf.io/b2vn7/?action=download', write_disk('functions.r', overwrite = TRUE)) #downloads data file from the OSF
source('functions.r')
 
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))
 
if(!require(BayesFactor)){install.packages('BayesFactor')} #Bayesian analysis
library(BayesFactor)
 
if(!require(BEST)){install.packages('BEST')} #distribution overlap
library(BEST)#requires JAGS version 3
 
if(!require(RCurl)){install.packages('RCurl')} #
library(RCurl)#
 
#the following few lines are an excerpt of the Reproducibility Project: Psychology's 
# masterscript.R to be found here: https://osf.io/vdnrb/
 
# Read in Tilburg data
info <- GET('https://osf.io/fgjvw/?action=download', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER <- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file
 
#for studies with exact p-values
id <- MASTER$ID[!is.na(MASTER$T_pval_USE..O.) & !is.na(MASTER$T_pval_USE..R.)]
 
#FYI: turn p-values into z-scores 
#z = qnorm(1 - (pval/2)) 
 
#prepare data point for plotting
dat_vis <- data.frame(p = c(MASTER$T_pval_USE..O.[id],
                            MASTER$T_pval_USE..R.[id], MASTER$T_pval_USE..R.[id]),
                      z = c(qnorm(1 - (MASTER$T_pval_USE..O.[id]/2)),
                            qnorm(1 - (MASTER$T_pval_USE..R.[id]/2)),
                            qnorm(1 - (MASTER$T_pval_USE..R.[id]/2))),
                      Study_set= c(rep("Original Publications", length(id)),
                                   rep("Independent Replications", length(id)),
                                   rep("zIndependent Replications2", length(id))))
 
#prepare plotting colours etc
cols_emp_in = c("#1E90FF","#088A08")#colour_definitions of area
cols_emp_out = c("#08088a","#0B610B")#colour_definitions of outline
legend_labels = list("Independent\nReplication", "Original\nStudy")
 
#execute actual plotting
density_plot = ggplot(dat_vis, aes(x=z, fill = Study_set, color = Study_set, linetype = Study_set))+
  geom_density(adjust =0.6, size = 1, alpha=1) +  #density plot call
  scale_linetype_manual(values = c(1,1,3)) +#outline line types
  scale_fill_manual(values = c(cols_emp_in,NA), labels = legend_labels)+#specify the to be used colours (outline)
  scale_color_manual(values = cols_emp_out[c(1,2,1)])+#specify the to be used colours (area)
  labs(x="z-value", y="Density")+ #add axis titles  
  ggtitle("Reproducibility Project: Psychology") +#add title
  geom_vline(xintercept = qnorm(1 - (0.05/2)), linetype = 2) +
  annotate("text", x = 2.92, y = -0.02, label = "p < .05", vjust = 1, hjust = 1)+
  annotate("text", x = 1.8, y = -0.02, label = "p > .05", vjust = 1, hjust = 1)+
  theme(legend.position="none",#remove legend
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line  = element_line(colour = "black"),#clean look
        text = element_text(size=18),
        plot.title=element_text(size=30))
density_plot
 
#common legend
#draw a nonsense bar graph which will provide a legend
legend_labels = list("Independent\nReplication", "Original\nStudy")
dat_vis <- data.frame(ric = factor(legend_labels, levels=legend_labels), kun = c(1, 2))
dat_vis$ric = relevel(dat_vis$ric, "Original\nStudy")
nonsense_plot = ggplot(data=dat_vis, aes(x=ric, y=kun, fill=ric)) + 
  geom_bar(stat="identity")+
  scale_fill_manual(values=cols_emp_in[c(2,1)], name=" ") +
  theme(legend.text=element_text(size=18))
#extract legend
tmp <- ggplot_gtable(ggplot_build(nonsense_plot)) 
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") 
leg_plot <- tmp$grobs[[leg]]
#combine plots
grid.arrange(grobs = list(density_plot,leg_plot), ncol = 2, widths = c(2,0.4))
 
#caliper test according to Gerber et al.
 
#turn p-values into z-values
z_o = qnorm(1 - (MASTER$T_pval_USE..O.[id]/2))
z_r = qnorm(1 - (MASTER$T_pval_USE..R.[id]/2))
 
#How many draws are to be taken from posterior distribution for BF and Credible Interval calculations? The more samples the more precise the estimate and the slower the calculation.
draws = 10000 * 10#BayesFactor package standard = 10000
 
#choose one of the calipers
#z_c = c(1.76, 2.16)#10% caliper
#z_c = c(1.67, 2.25)#15% caliper
#z_c = c(1.57, 2.35)#20% caliper
 
#calculate counts
print(sprintf('Originals: over caliper N = %d', sum(z_o <= z_c[2] & z_o >= 1.96)))
print(sprintf('Originals: under caliper N = %d', sum(z_o >= z_c[1] & z_o <= 1.96)))
print(sprintf('Replications: over caliper N = %d', sum(z_r <= z_c[2] & z_r >= 1.96)))
print(sprintf('Replications: under caliper N = %d', sum(z_r >= z_c[1] & z_r <= 1.96)))
 
#formal caliper test: originals
#Bayesian analysis
bf = proportionBF(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2)
sprintf('Bayesian test of single proportion: BF10 = %1.2f', exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log
#sample from posterior
samples_o = proportionBF(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2,
                       posterior = TRUE, iterations = draws)
plot(samples_o[,"logodds"])
sprintf('Posterior Median = %1.2f [%1.2f; %1.2f]',
        median(samples_o[,"logodds"]),#Median 
        quantile(samples_o[,"logodds"], 0.025),#Lower edge of 95% Credible Interval
        quantile(samples_o[,"logodds"], 0.975))#Higher edge of 95% Credible Interval
#classical frequentist test
bt = binom.test(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2)
sprintf('Binomial test: p = %1.3f', bt$p.value)#
 
#formal caliper test: replications
bf = proportionBF(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2)
sprintf('Bayesian test of single proportion: BF01 = %1.2f', 1/exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log, turn into BF01
#sample from posterior
samples_r = proportionBF(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2,
                       posterior = TRUE, iterations = draws)
plot(samples[,"logodds"])
sprintf('Posterior Median = %1.2f [%1.2f; %1.2f]',
        median(samples_r[,"logodds"]),#Median 
        quantile(samples_r[,"logodds"], 0.025),#Lower edge of 95% Credible Interval
        quantile(samples_r[,"logodds"], 0.975))#Higher edge of 95% Credible Interval
#classical frequentist test
bt = binom.test(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2)
sprintf('Binomial test: p = %1.3f', bt$p.value)#
 
#possibly of interest: overlap of posteriors
#postPriorOverlap(samples_o[,"logodds"], samples_r[,"logodds"])#overlap of distribitions

Created by Pretty R at inside-R.org

Is Replicability in Economics better than in Psychology?

Colin Camerer and colleagues recently published a Science article on the replicability of behavioural economics. ‘It appears that there is some difference in replication success’ between psychology and economics, they write, given their reproducibility rate of 61% and psychology’s of 36%. I took a closer look at the data to find out whether there really are any substantial differences between fields.

Commenting on the replication success rates in psychology and economics, Colin Camerer is quoted as saying: “It is like a grade of B+ for psychology versus A– for economics.” Unsurprisingly, his team’s Science paper also includes speculation as to what contributes to economics’ “relatively good replication success”. However, such speculation is premature as it is not established whether economics actually displays better replicability than the only other research field which has tried to estimate its replicability (that would be psychology). Let’s check the numbers in Figure 1.

RPP_EERP_replicability

Figure 1. Replicability in economics and psychology. Panel A displays replication p-values of originally significant effects. Note that the bottom 25% quartile is at p = .001 and p = .0047 respectively and, thus, not visible here. Panel B displays the effect size reduction from original to replication study.Violin plots display density, i.e. thicker parts represent more data points.

Looking at the left panel of Figure 1, you will notice that the p-values of the replication studies in economics tend to be lower than in psychology, indicating that economics is more replicable. In order to formally test this, I define a replication success as p < .05 (the typical threshold for proclaiming that an effect was found) and count successes in both data sets. In economics, there are 11 successes and 7 failures. In psychology, there are 34 successes and 58 failures. When comparing these proportions formally with a Bayesian contingency table test, the resulting Bayes Factor of BF10 = 1.77 indicates that the replicability difference between economics and psychology is so small as to be worth no more than a bare mention. Otherwise said, the replicability projects in economics and psychology were too small to say that one field produces more replicable effects than the other.

However, a different measure of replicability which doesn’t depend on an arbitrary cut-off at p = .05 might give a clearer picture. Figure 1’s right panel displays the difference between the effect sizes reported in the original publications and those observed by the replication teams. You will notice that most effect size differences are negative, i.e. when replicating an experiment you will probably observe a (much) smaller effect compared to what you read in the original paper. For a junior researcher like me this is an endless source of self-doubt and frustration.

Are effect sizes more similar between original and replication studies in economics compared to psychology? Figure 1B doesn’t really suggest that there is a huge difference. The Bayes factor of a Bayesian t-test comparing the right and left distributions of Figure 1B supports this impression. The null hypothesis of no difference is favored BF01 = 3.82 times more than the alternative hypothesis of a difference (or BF01 = 3.22 if you are an expert and insist on using Cohen’s q). In Table 1, I give some more information for the expert reader.

The take-home message is that there is not enough information to claim that economics displays better replicability than psychology. Unfortunately, psychologists shouldn’t just adopt the speculative factors contributing to the replication success in economics. Instead, we should look elsewhere for inspiration: the simulations of different research practices showing time and again what leads to high replicability (big sample sizes, pre-registration, …) and what not (publication bias, questionable research practices…). For the moment, psychologists should not look towards economists to find role models of replicability.

Table 1. Comparison of Replicability in Economics and Psychology.
Economics Psychology Bayes Factora Posterior median [95% Credible Interval]1
Independent Replications p < .05 11 out of 18 34 out of 92 BF10 = 1.77

0.95

[-0.03; 1.99]

Effect size reduction (simple subtraction)

M = 0.20

(SD = 0.20)

M = 0.20

(SD = 0.21)

BF01 = 3.82

0.02

[-0.12; 0.15]

Effect size reduction (Cohen’s q) M = 0.27

(SD = 0.36)

M = 0.22

(SD = 0.26)

BF01 = 3.22

0.03

[-0.15; 0.17]

a Assumes normality. See for yourself whether you believe this assumption is met.

1 Log odds for proportions. Difference values for quantities.

— — —

Camerer, C., Dreber, A., Forsell, E., Ho, T., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics Science DOI: 10.1126/science.aaf0918

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716
— — —

PS: I am indebted to Alex Etz and EJ Wagenmakers who presented a similar analysis of parts of the data on the OSF website: https://osf.io/p743r/

— — —

R code for reproducing Figure and Table (drop me a line if you find a mistake):

# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))

if(!require(BayesFactor)){install.packages('BayesFactor')} #Bayesian analysis
library(BayesFactor)

if(!require(BEST)){install.packages('BEST')} #distribution overlap
library(BEST)#requires JAGS version 3

if(!require(xlsx)){install.packages('xlsx')} #for reading excel sheets
library(xlsx)

#How many draws are to be taken from posterior distribution for BF and Credible Interval calculations? The more samples the more precise the estimate and the slower the calculation.
draws = 10000 * 10#BayesFactor package standard = 10000

##########################################################################################################################################################################################
#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Figures 1: p-values

#get RPP raw data from OSF website
RPPdata &amp;amp;lt;- get.OSFfile(code='https://osf.io/fgjvw/',dfCln=T)$df
# Select the completed replication studies
RPPdata &amp;amp;lt;- dplyr::filter(RPPdata, !is.na(T.pval.USE.O),!is.na(T.pval.USE.R))

#get EERP raw data from local xls file based on Table S1 of Camerer et al., 2016 (just write me an e-mail if you want it)
EERPdata = read.xlsx(&amp;amp;quot;EE_RP_data.xls&amp;amp;quot;, 1)

# Restructure the data to &amp;amp;quot;long&amp;amp;quot; format: Study type will be a factor
df1 &amp;amp;lt;- dplyr::select(RPPdata,starts_with(&amp;amp;quot;T.&amp;amp;quot;))
df &amp;amp;lt;- data.frame(p.value=as.numeric(c(as.character(EERPdata$p_rep),
df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05])),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(EERPdata$p_rep)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=sum(df1$T.pval.USE.O &amp;amp;lt; .05)))))

# Create some variables for plotting
df$grpN &amp;amp;lt;- as.numeric(df$grp)
probs &amp;amp;lt;- seq(0,1,.25)

# VQP PANEL A: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles &amp;amp;lt;- ldply(unique(df$grpN),function(gr) quantile(round(df$p.value[df$grpN==gr],digits=4),probs,na.rm=T,type=3))
freqs &amp;amp;lt;- ldply(unique(df$grpN),function(gr) table(cut(df$p.value[df$grpN==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels &amp;amp;lt;- sapply(unique(df$grpN),function(gr)levels(cut(round(df$p.value[df$grpN==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Check the Quantile bins!
Economics &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[1,])))
rownames(Economics) &amp;amp;lt;- labels[,1]
Economics

Psychology &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[2,])))
rownames(Psychology) &amp;amp;lt;- labels[,2]
Psychology

# Get regular violinplot using package ggplot2
g.pv &amp;amp;lt;- ggplot(df,aes(x=grp,y=p.value)) + geom_violin(aes(group=grp),scale=&amp;amp;quot;width&amp;amp;quot;,color=&amp;amp;quot;grey30&amp;amp;quot;,fill=&amp;amp;quot;grey30&amp;amp;quot;,trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 &amp;amp;lt;- vioQtile(g.pv,qtiles,probs)
# Garnish
g.pv1 &amp;amp;lt;- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
ggtitle(&amp;amp;quot;A&amp;amp;quot;) + xlab(&amp;amp;quot;&amp;amp;quot;) + ylab(&amp;amp;quot;replication p-value&amp;amp;quot;) +
mytheme
# View
g.pv1

## Uncomment to save panel A as a seperate file
# ggsave(&amp;amp;quot;RPP_F1_VQPpv.eps&amp;amp;quot;,plot=g.pv1)

#calculate counts
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;lt;= .05)#How many economic effects 'worked' upon replication?
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;gt; .05)##How many economic effects 'did not work' upon replication?
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;lt;= .05)#How many psychological effects 'worked' upon replication?
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;gt; .05)#How many psychological effects 'did not work' upon replication?

#prepare BayesFactor analysis
data_contingency = matrix(c(sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;lt;= .05),#row 1, col 1
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;gt; .05),#row 2, col 1
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;lt;= .05),#row 1, col 2
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;gt; .05)),#row 2, col 2
nrow = 2, ncol = 2, byrow = F)#prepare BayesFactor analysis
bf = contingencyTableBF(data_contingency, sampleType = &amp;amp;quot;indepMulti&amp;amp;quot;, fixedMargin = &amp;amp;quot;cols&amp;amp;quot;)#run BayesFactor comparison
sprintf('BF10 = %1.2f', exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log

#Parameter estimation
chains = posterior(bf, iterations = draws)#draw samples from the posterior
odds_ratio = (chains[,&amp;amp;quot;omega[1,1]&amp;amp;quot;] * chains[,&amp;amp;quot;omega[2,2]&amp;amp;quot;]) / (chains[,&amp;amp;quot;omega[2,1]&amp;amp;quot;] * chains[,&amp;amp;quot;omega[1,2]&amp;amp;quot;])
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(log(odds_ratio)),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(log(odds_ratio), 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(log(odds_ratio), 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
#plot(mcmc(log(odds_ratio)), main = &amp;amp;quot;Log Odds Ratio&amp;amp;quot;)

# VQP PANEL B: reduction in effect size -------------------------------------------------

econ_r_diff = as.numeric(as.character(EERPdata$r_rep)) - as.numeric(as.character(EERPdata$r_orig))
psych_r_diff = as.numeric(df1$T.r.R) - as.numeric(df1$T.r.O)
df &amp;amp;lt;- data.frame(EffectSizeDifference= c(econ_r_diff, psych_r_diff[!is.na(psych_r_diff)]),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(econ_r_diff)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=length(psych_r_diff[!is.na(psych_r_diff)])))))

# Create some variables for plotting
df$grpN &amp;amp;lt;- as.numeric(df$grp)
probs &amp;amp;lt;- seq(0,1,.25)

# Get effect size quantiles and frequencies from data
qtiles &amp;amp;lt;- ldply(unique(df$grpN),function(gr) quantile(df$EffectSizeDifference[df$grpN==gr],probs,na.rm=T,type=3,include.lowest=T))
freqs &amp;amp;lt;- ldply(unique(df$grpN),function(gr) table(cut(df$EffectSizeDifference[df$grpN==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T)))
labels &amp;amp;lt;- sapply(unique(df$grpN),function(gr)levels(cut(round(df$EffectSizeDifference[df$grpN==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Check the Quantile bins!
Economics &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[1,])))
rownames(Economics) &amp;amp;lt;- labels[,1]
Economics

Psychology &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[2,])))
rownames(Psychology) &amp;amp;lt;- labels[,2]
Psychology

# Get regular violinplot using package ggplot2
g.es &amp;amp;lt;- ggplot(df,aes(x=grp,y=EffectSizeDifference)) +
geom_violin(aes(group=grpN),scale=&amp;amp;quot;width&amp;amp;quot;,fill=&amp;amp;quot;grey40&amp;amp;quot;,color=&amp;amp;quot;grey40&amp;amp;quot;,trim=T,adjust=1)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 &amp;amp;lt;- vioQtile(g.es,qtiles=qtiles,probs=probs)
# Garnish
g.es1 &amp;amp;lt;- g.es0 +
ggtitle(&amp;amp;quot;B&amp;amp;quot;) + xlab(&amp;amp;quot;&amp;amp;quot;) + ylab(&amp;amp;quot;Replicated - Original Effect Size r&amp;amp;quot;) +
scale_y_continuous(breaks=c(-.25,-.5, -0.75, -1, 0,.25,.5,.75,1),limits=c(-1,0.5)) + mytheme
# View
g.es1

# # Uncomment to save panel B as a seperate file
# ggsave(&amp;amp;quot;RPP_F1_VQPes.eps&amp;amp;quot;,plot=g.es1)

# VIEW panels in one plot using the multi.PLOT() function from C-3PR
multi.PLOT(g.pv1,g.es1,cols=2)

# SAVE combined plots as PDF
pdf(&amp;amp;quot;RPP_Figure1_vioQtile.pdf&amp;amp;quot;,pagecentre=T, width=20,height=8 ,paper = &amp;amp;quot;special&amp;amp;quot;)
multi.PLOT(g.pv1,g.es1,cols=2)
dev.off()

#Effect Size Reduction (simple subtraction)-------------------------------------------------

#calculate means and standard deviations
mean(econ_r_diff)#mean ES reduction of economic effects
sd(econ_r_diff)#Standard Deviation ES reduction of economic effects
mean(psych_r_diff[!is.na(psych_r_diff)])#mean ES reduction of psychological effects
sd(psych_r_diff[!is.na(psych_r_diff)])#Standard Deviation ES reduction of psychological effects

#perform BayesFactor analysis
bf = ttestBF(formula = EffectSizeDifference ~ grp, data = df)#Bayesian t-test to test the difference/similarity between the previous two
sprintf('BF01 = %1.2f', 1/exp(bf@bayesFactor$bf[1]))#exponentiate BF10 because stored as natural log, turn into BF01

##Parameter estimation: use BEST package to estimate posterior median and 95% Credible Interval
BESTout = BESTmcmc(econ_r_diff,
psych_r_diff[!is.na(psych_r_diff)],
priors=NULL, parallel=FALSE)
#plotAll(BESTout)
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(BESTout$mu1 - BESTout$mu2),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)

#Effect Size Reduction (Cohen's q)-------------------------------------------------

#prepare function to calculate Cohen's q
Cohenq &amp;amp;lt;- function(r1, r2) {
fis_r1 = 0.5 * (log((1+r1)/(1-r1)))
fis_r2 = 0.5 * (log((1+r2)/(1-r2)))
fis_r1 - fis_r2
}

#calculate means and standard deviations
econ_Cohen_q = Cohenq(as.numeric(as.character(EERPdata$r_rep)), as.numeric(as.character(EERPdata$r_orig)))
psych_Cohen_q = Cohenq(as.numeric(df1$T.r.R), as.numeric(df1$T.r.O))
mean(econ_Cohen_q)#mean ES reduction of economic effects
sd(econ_Cohen_q)#Standard Deviation ES reduction of economic effects
mean(psych_Cohen_q[!is.na(psych_Cohen_q)])#mean ES reduction of psychological effects
sd(psych_Cohen_q[!is.na(psych_Cohen_q)])#Standard Deviation ES reduction of psychological effects

#perform BayesFactor analysis
dat_bf &amp;amp;lt;- data.frame(EffectSizeDifference = c(econ_Cohen_q,
psych_Cohen_q[!is.na(psych_Cohen_q)]),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(econ_Cohen_q)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=length(psych_Cohen_q[!is.na(psych_Cohen_q)])))))#prepare BayesFactor analysis
bf = ttestBF(formula = EffectSizeDifference ~ grp, data = dat_bf)#Bayesian t-test to test the difference/similarity between the previous two
#null Interval is positive because effect size reduction is expressed negatively, H1 predicts less reduction in case of internally replicated effects
sprintf('BF01 = %1.2f', 1/exp(bf@bayesFactor$bf[1]))#exponentiate BF10 because stored as natural log, turn into BF01

#Parameter estimation: use BEST package to estimate posterior median and 95% Credible Interval
BESTout = BESTmcmc(econ_Cohen_q,
psych_Cohen_q[!is.na(psych_Cohen_q)],
priors=NULL, parallel=FALSE)
#plotAll(BESTout)
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(BESTout$mu1 - BESTout$mu2),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)

Are internal replications the solution to the replication crisis in Psychology? No.

Most Psychology findings are not replicable. What can be done? Stanford psychologist Michael Frank has an idea : Cumulative study sets with internal replication. ‘If I had to advocate for a single change to practice, this would be it.’ I took a look whether this makes any difference.

A recent paper in the journal Science has tried to replicate 97 statistically significant effects (Open Science Collaboration, 2015). In only 35 cases this was successful. Most findings were suddenly a lot weaker upon replication. This has led to a lot of soul searching among psychologists. Fortunately, the authors of the Science paper have made their data freely available. So, soul searching can be accompanied by trying out different ideas for improvements.

What can be done to solve Psychology’s replication crisis?

One idea to improve the situation is to demand study authors to replicate their own experiments in the same paper. Stanford psychologist Michael Frank writes:

If I had to advocate for a single change to practice, this would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. […] If you show me a one-off study and I fail to replicate it in my lab, I will tend to suspect that you got lucky or p-hacked your way to a result. But if you show me a package of studies with four internal replications of an effect, I will believe that you know how to get that effect – and if I don’t get it, I’ll think that I’m doing something wrong.
If this argument were true, then the 41 studies which were successfully, conceptually replicated in their own paper should show higher rates of replication than the 56 studies which were not. Of the 41 internally replicated studies, 19 were replicated once, 10 twice, 8 thrice, 4 more than three times. I will treat all of these as equally internally replicated.

Are internal replications the solution? No.

internal_external_replication

So, does the data by the reprocucibility project show a difference? I made so-called violin plots, thicker parts represent more data points. In the left plot you see the reduction in effect sizes from a bigger original effect to a smaller replicated effect. The reduction associated with internally replicated effects (left) and effects which were only reported once in a paper (right) is more or less the same. In the right plot you can see the p-value of the replication attempt. The dotted line represents the arbitrary 0.05 threshold used to determine statistical significance. Again, replicators appear to have had as hard a task with effects that were found more than once in a paper as with effects which were only found once.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 29% of internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The equivalent number of not internally replicated effects is 41%. A contingency table Bayes factor test (Gunel & Dickey, 1974) shows that the null hypothesis of no difference is 1.97 times more likely than the alternative. In other words, the 12 %-point replication advantage for non-replicated effects does not provide convincing evidence for an unexpected reversed replication advantage. The 12%-point difference is not due to statistical power. Power was 92% on average in the case of internally replicated and not internally replicated studies. So, the picture doesn’t support internal replications at all. They are hardly the solution to Psychology’s replication problem according to this data set.

The problem with internal replications

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

What can be done?

In my previous post I have shown evidence for questionable research practices in this data set. These lead to less replicable results. Pre-registering studies makes questionable research practices a lot harder and science more reproducible. It would be interesting to see data on whether this hunch is true.

[update 7/9/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to take into account the different denominators for replicated and unreplicated effects. Lee Jussim pointed me to this.]

[update 24/10/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to provide correct numbers, Bayesian analysis and power comparison.]

— — —
Bem DJ (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of personality and social psychology, 100 (3), 407-25 PMID: 21280961

Galak, J., LeBoeuf, R., Nelson, L., & Simmons, J. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103 (6), 933-948 DOI: 10.1037/a0029709

Gunel, E., & Dickey, J. (1974). Bayes Factors for Independence in Contingency Tables. Biometrika, 61(3), 545–557. http://doi.org/10.2307/2334738

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Ritchie SJ, Wiseman R, & French CC (2012). Failing the future: three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PloS one, 7 (3) PMID: 22432019

Schimmack U (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17 (4), 551-66 PMID: 22924598
— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between internal replication and independent reproducibility of an effect

#Richard Kunert for Brain's Idea 5/9/2015

# a lot of code was taken from the reproducibility project code here https://osf.io/vdnrb/

# installing/loading the packages:
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))

#loading the data
RPPdata <- get.OSFfile(code='https://osf.io/fgjvw/',dfCln=T)$df
RPPdata <- dplyr::filter(RPPdata, !is.na(T.pval.USE.O),!is.na(T.pval.USE.R), complete.cases(RPPdata$T.r.O,RPPdata$T.r.R))#97 studies with significant effects

#prepare IDs for internally replicated effects and non-internally replicated effects
idIntRepl <- RPPdata$Successful.conceptual.replications.O > 0
idNotIntRepl <- RPPdata$Successful.conceptual.replications.O == 0

# Get ggplot2 themes predefined in C-3PR
mytheme <- gg.theme("clean")

#restructure data in data frame
dat <- data.frame(EffectSizeDifference = as.numeric(c(c(RPPdata$T.r.R[idIntRepl]) - c(RPPdata$T.r.O[idIntRepl]),
                                                          c(RPPdata$T.r.R[idNotIntRepl]) - c(RPPdata$T.r.O[idNotIntRepl]))),
                  ReplicationPValue = as.numeric(c(RPPdata$T.pval.USE.R[idIntRepl],
                                                   RPPdata$T.pval.USE.R[idNotIntRepl])),
                  grp=factor(c(rep("Internally Replicated Studies",times=sum(idIntRepl)),
                               rep("Internally Unreplicated Studies",times=sum(idNotIntRepl))))
  )

# Create some variables for plotting
dat$grp <- as.numeric(dat$grp)
probs   <- seq(0,1,.25)

# VQP PANEL A: reduction in effect size -------------------------------------------------

# Get effect size difference quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$EffectSizeDifference[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$EffectSizeDifference[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$EffectSizeDifference[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.es <- ggplot(dat,aes(x=grp,y=EffectSizeDifference)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 <- vioQtile(g.es,qtiles,probs)
# Garnish (what does this word mean???)
g.es1 <- g.es0 +
  ggtitle("Effect size reduction") + xlab("") + ylab("Replicated - Original Effect Size") + 
  xlim("Internally Replicated", "Not Internally Replicated") +
  mytheme + theme(axis.text.x = element_text(size=20))
# View
g.es1


# VQP PANEL B: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$ReplicationPValue[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$ReplicationPValue[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$ReplicationPValue[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.pv <- ggplot(dat,aes(x=grp,y=ReplicationPValue)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 <- vioQtile(g.pv,qtiles,probs)
# Garnish (I still don't know what this word means!)
g.pv1 <- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
  ggtitle("Independent replication p-value") + xlab("") + ylab("Independent replication p-value") + 
  xlim("Internally Replicated", "Not Internally Replicated")+
  mytheme + theme(axis.text.x = element_text(size=20))
# View
g.pv1

#put two plots together
multi.PLOT(g.es1,g.pv1,cols=2)</pre>
<pre>

Why are Psychological findings mostly unreplicable?

Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.

Sometimes the title of a paper just sounds incredible. Estimating the reproducibility of psychological science. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

One reason for poor replication: sampling until significant

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a p-value below the arbitrary threshold of 0.05. The consequence is this:

KunertFig5

Focus on the left panel first. The green replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a stronger relation between the effect size they found and their sample size. They must have done more than just smart a priori power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size measured in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the expected effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The right panel shows a Bayesian correlation analysis of the data. What you are looking at is the belief in the strength of the correlation, called the posterior distribution. The overlap of the distributions can be used as a measure of believing that the correlations are not different. The overlap is less than 7%. If you are more inclined to believe in frequentist statistics, the associated p-value is .001 (Pearson and Filon’s z = 3.355). Therefore, there is strong evidence that original studies display a stronger negative correlation between sample size and measured effect size than replication studies.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the p-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

[24/10/2015: Added Bayesian analysis and changed figure. Code below is from old figure.]

[27/11/2015: Adjusted percentage overlap of posterior distributions.]

— — —
John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project https://osf.io/vdnrb/

#Richard Kunert for Brain's Idea 3/9/2015
#load necessary libraries
library(httr)
library(Hmisc)
library(ggplot2)
library(cocor)

#get raw data from OSF website
info &lt;- GET('https://osf.io/fgjvw/?action=download', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER &lt;- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] &lt;- "ID" # Change first column name to ID to be able to load .csv file

#restrict studies to those with appropriate data
studies&lt;-MASTER$ID[!is.na(MASTER$T_r..O.) &amp; !is.na(MASTER$T_r..R.)] ##to keep track of which studies are which
studies&lt;-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047)

#set font size for plotting
theme_set(theme_gray(base_size = 30))

#prepare correlation coefficients
dat_rank &lt;- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])),
sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])),
effect_size_O = rank(cbind(MASTER$T_r..O.[studies])),
effect_size_R = rank(cbind(MASTER$T_r..R.[studies])))
corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous
corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman")

#compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!)
htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R,
data = dat_rank, return.htest = FALSE)

#visualisation
#prepare data frame
dat_vis &lt;- data.frame(study = rep(c("Original", "Replication"), each=length(studies)),
sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])),
effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies])))

#The plotting call
plot.new
ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot
geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm,   # Add linear regression lines
se=FALSE,    # Don't add shaded confidence region
size=2,
aes(color=study))+#colour lines according to data points for consistency
geom_text(aes(x=750, y=0.46,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]),
color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies
guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text
geom_text(aes(x=750, y=0.2,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]),
color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies
geom_text(x=1500, y=0.33,
label=sprintf("Difference: Pearson &amp; Filon z = %1.3f (p = %1.3f)",
htest@pearson1898$statistic, htest@pearson1898$p.value),
color="black", hjust=0)+#add text about testing difference between correlation coefficients
guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text
ggtitle("Sampling until significant versus a priori power analysis")+#add figure title
labs(x="Sample Size", y="Effect size r")#add axis titles

Neuroscience is Forgetting

The scientific enterprise, like anything else humans are doing, is subject to fashions, trends and hypes. Old ideas get replaced by new ones; good new ideas spread; progress is made. However, is Neuroscience forgetting where it came from?

This is not that hard to investigate in a quantitative way because every source that is cited by an article is referenced. Below, I plot the ages of all references found in reviews and review-like articles (opinions, perspectives…) published in a well known Neuroscience journal (Nature Reviews Neuroscience). As you can see, the relative majority of articles which a review cites are very very recent, i.e. less than two years old. Twenty year old sources hardly every get mentioned.
Forgetting Rate in Nature Reviews Neuroscience
One could argue that the pattern seen above – the very fast rate of decline which levels off near zero – is simply a reflection of the publication rate in Neuroscience. If more is published, more can get referenced. However, compare the above plot with the one below, which shows the publication rate in the Neurosciences from 2011 all the way back to 1975. An increase in publication rate is there but it is not nearly as quick as the aforementioned graph would have you believe.
Neuroscience Publication Rate
Something else is happening.
Notice the white bars in Figure 2. These show the overall number of reviews in Neuroscience. While in 1975 only 46 Neuroscience reviews got published, in 2011 it was nearly 3000. Given that it is impossible to read anything like 40000 original articles each year, it makes sense to only read the summaries. Perhaps, once summarised, an original article’s gist survives in the form of a brief mention in a review while its details are simply forgotten.
Another possibility is the decline effect I blogged about before. Old articles which proved unreliable should be forgotten in order to edge closer to some scientific truth. They may have suffered from publication bias, sub-optimal techniques and/or sampling bias.
So, Neuroscience is indeed forgetting. However, whether this is all that bad is another question.

——————————————————————————————

Figures:
1) Data processed with own code. Original data from NRN website.
2) Data from Web of Science.

Correcting for Human Researchers – The Rediscovery of Replication

We need to control for this.

You may have missed some of the discussion on fraud, errors and biases shaking the scientific community of late, so I will quickly bring you up to speed.
Firstly, a series of fraud cases (Ruggiero, Hauser, Stapel) in Psychology and related fields makes everyone wonder why only internal whistleblowers ever discover major fraud cases like these.
Secondly, a well regarded journal publishes an article by Daryl Bem (2011) claiming that we can feel the future. Wagenmakers et al. (2011) apply a different statistical analysis and claim that Bem’s evidence for precognition is so weak as to be meaningless. The debate continues. Meanwhile a related failed replication paper claims to have trouble getting published.
Thirdly, John Bargh criticises everyone involved in a failed replication of an effect he is particularly well known for. He criticises the experimenters, the journal, even a blogger who wrote about it.
This all happened within the last year and suddenly everyone speaks about replication. Ed Yong wrote about it in nature, the Psychologist had a special issue on it, some researchers set up a big replication project, the blogosphere goes crazy with it.
Some may wonder why replication was singled out as the big issue. Isn’t this about the ruthless, immoral energy of fraudsters? Or about publishers’ craving for articles that create buzz? Or about a researcher’s taste for scandal? Perhaps it is indeed about a series of individual problems related to human nature. But the solution is still a systemic one: replication. It is the only way of overcoming the unfortunate fact that science is only done by mere humans.
This may surprise some people because replication is not done all that much. And the way researchers get rewarded for their work totally goes against doing replications. The field carries on as if there were procedures, techniques and analyses that overcome the need for replication. The most common of which is inferential hypothesis testing.
This way of analysing your data simply asks whether any differences found among the people who were studied would hold up in the population at large. If so, the difference is said to be a ‘statistically significant’ difference. Usually, this is boiled down to a p-value which reports the likelihood of finding the same statistically significant difference again and again in experiment after experiment if in truth the difference didn’t exist at all in the population. So, imagine that women and men in truth were equally intelligent (I have no idea whether they are). Inferential hypothesis testing predicts that 5% of experiments will report a significant difference between male and female IQs. This difference won’t be replicated by the other 95% of experiments.
And this is where replication comes in: the p-value can be thought of as a prediction of how likely failed replications of an effect will be. Needless to say that a prediction is a poor substitute for the real thing.
This was brought home to me by Luck in his great book An Introduction to the Event-Related Potential Technique (2005, p. 251). He basically says that replication is the only approach in science which is not based on assumptions needed to run the aforementioned statistical analyses.
Replication does not depend on assumptions about normality, sphericity, or independence. Replication is not distorted by outliers. Replication is a cornerstone of science. Replication is the best statistic.
In other words, it is the only way of overcoming the human factor involved in choosing how to get to a p-value. You can disagree on many things, but not on the implication of a straight replication. If the effect is consistently replicated, it is real.
For example, Simmons and colleagues (2011) report that researchers can tweak their data easily without anyone knowing. This is not really fraud but it is not something you want to admit, either. Using four ways of tweaking the statistical analysis towards a significant result – which is desirable for publication – resulted in a statistically significant difference having a non-replication likelihood of 60%. Now, this wouldn’t be a problem if anyone actually bothered to do a replication – including the exact same tweaks to the data. It is very likely that the effect wouldn’t hold up.
Many people believe that this is what really happened with Bem’s pre-cognition results. They are perhaps not fraudulous, but the way they were analysed and reported inflated the chances of finding effects which are not real. Similarly, replication is what did not happen with Stapel and other fraudsters. My guess is that if anyone had actually bothered to replicate, it would have become clear that Stapel has a history of unreplicability (see my earlier blog post about the Stapel affair for clues).
So, if we continue to let humans do research, we have to address the weakness inherent in this approach. Replication is the only solution we know of.
.

———————————————————————–

.
Bem, D.J., Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. Journal of Personality and Social Psychology, 100, 407-425. DOI: 10.1037/a0021524
Luck, S.J. (2005). An Introduction to the Event-Related Potential Technique. London: MIT Press.
Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22, 1359-1366. DOI: 10.1177/0956797611417632
Wagenmakers, E.J., Wetzels, R., Borsboom, D., van der Maas, H. (2011). Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi. Journal of Personality and Social Psychology, 100, 426-432. doi: 10.1037/a0022790

Whose discovery? Declining Effect Sizes vs. Growing Influence

It is good practice in scientific publications to cite the original paper which established an idea, technique or model. However, there are worrying signs that these original papers exaggerated their effect sizes and are thus less trustworthy than later studies. Torn between a desire to grant credit where credit is due and writing about solid scientific findings, how should one decide?
A good example of a scientific breakthrough which established an idea comes from the neurosciences. Today, much of brain imaging is concerned with the localisation of function: Where in the brain is x? Whether we want to localise vision, intelligence or consciousness does not appear to matter to some researchers. There is much to say about this approach, its assumptions and its value. However, in this post I want to talk about its beginnings.
The first time a successful brain localisation was claimed was in the 19th century by anatomist Paul Broca. His discovery of the left inferior frontal cortex’s (the area below your left temple) importance for speech is still widely influential. After one of his patients with apparently normal general mental abilities but impaired speech died, Broca did a post mortem and found a lesion in his left frontal cortex. The same was true for 12 other patients. This was a breakthrough in the quest to link the brain to behaviour. The brain area Paul Broca identified as crucial for speech output is called Broca’s area to this day. The speech impairments resulting from its damage are still called Broca’s aphasia.
Discoveries like these can be game changers. However, there are worrying signs of reductions in how big effects are as results get replicated. Below is an example from psycholinguistics. Two sentences can mean practically identical things, such as ‘Peter gives the toy to Anna.’ (using a so called Prepositional Object)  and ‘Peter gives Anna the toy.’ (using a so called Direct Object). When language users have to decide which construction to use they show a strong tendency to unconsciously adopt the one they heard just before. This effect is called syntactic priming and is typically associated with Kathryn Bock’s 1986 article in the journal Cognitive Psychology. It has been used to argue for convergence of syntax in dialogue, shared mental resources for comprehension and production or overlap of the various languages available to a speaker.
As can be seen below, since its initial characterisation by Bock, the effect size of Prepositional vs. Direct Object priming has declined. Because this is actually a new, unpublished finding, let me elaborate what this figure actually shows (you can skip the next two paragraphs in case you trust me). On the vertical axis I show a standard way of reporting the priming effect size. Mind that the way this is calculated can differ between studies, so I recalculated all values ignoring answers not classifiable as either a Prepositional or a Direct Object. On the horizontal axis is simply the publication year. The thirteen data points refer to all studies I could find on Web of Science in December of 2011 which replicated Bock’s initial finding using the same task (picture description), totalling 1169 participants. The standard way of looking for associations between variables (here, effect size and time) is the Pearson correlation coefficient and it is significantly negative, indicating that the later a study is published the lower its effect size.

One should not over-interpret this finding. It could be due to a) unusual effect sizes, b) publication bias or c) task differences. What happens if I control for these things?
a) To check whether unusually high values early on or unusually low values later on carry the effect I replicate it with a so called non-parametric test (Kendall tau). This test ranks all values and correlates the ranks. The effect size is still significantly negatively associated with publication year. Extreme values (also called outliers) do not carry the effect.
b) Furthermore, I show the studies’ sample sizes in the size of the data points. Some may argue that small studies tend to have more extreme values. The extremely low ones are not published because of publication bias, i.e. journals’ tendency to only accept significant finings. This publication pattern may have declined over the years due to higher sample sizes or less publication pressure. To control for this possibility I statistically control for sample size and the decline effect still holds.Differences in sample size do not carry the effect.
c) Finally, there is a small difference in the methodology for two publications (three effect sizes marked in red) related to whether the priming was from comprehension to production with a repetition of the prime (as in Bock’s original article) or whether it was slightly different. Rejecting these two studies only strengthens the negative correlations (r=-0.71 (p=0.02); t=-0.69 (p=0.01); weighted r=-0.73 (p=0.02)).
If someone has another idea how this decline effect came about, I would be happy to hear it. For the moment it is clear that Bock’s syntactic priming effect mysteriously declines with every year.
In terms of Paul Broca’s finding of brain areas associated with speech production the decline effect has to be recast in terms of what Broca’s finding stands for: the localisation of a cognitive function in the brain. So, evidence for the decline effect would be both the fact that the same brain area can be associated with more than one function and the fact that the function of speech production is associated with more than one brain area.
What sort of functions is Broca’s area associated with today? I checked Brainscanr, a free online tool developed by Voytek which shows how likely two terms are to co-occur in the literature. As can be seen below, Broca’s area is indeed related to language production. However, it is also associated with language comprehension and even mirror neurons. The one-area-one-function mapping does not hold for Broca’s area.
But what about speech production? Perhaps Broca’s area is able to handle different jobs but still each job resides in a specific place in the brain. Again, I use a free online tool, Neurosynth, to check this. For this example, Neurosynth aggregates 44 studies showing various brain areas associated with speech production. Broca’s area is indeed present but so is its right hemisphere homologue and medial areas (in the middle of the brain) also light up. So, the decline effect can be shown quite nicely for the brain localisation of cognitive functions as well.
Jonah Lehrer gives more examples from drug, memory and sexual selection research. Mind that the decline effect is not a case of fraud-ish research but instead a good example of how the scientific process refines its methods over time and edges ever closer to agreeing on some basic truths about the world. The problem is that the truth is typically more difficult (read: messier) than initially thought.
But who is to take credit for the discovery then? The original scientist whose finding did not fully stand the test of time or the later scientist whose result is more believable? To me, the current agreement appears to be to credit the former (at least as long as he or she is still alive) and possibly add some findings by the latter if need be. Does this amount to good scientific practice?

.

.

Bock, J.K. (1986). Syntactic Persistence in Language Production. Cognitive Psychology, 18, 335-387. doi: 10.1016/0010-0285(86)90004-6

[12/4/2012: UPDATE]

I recently discovered a study which should have been included in the small scale meta-analysis. The revised values are:

correlations for analysis as reported in figure 1: r=-0.46 (p=0.07); t=-0.31 (p=0.12); r_weighted=-0.45 (p=0.09)

correlations for analysis without the two studies which slightly changed the design: r=-0.52 (p=0.07); t=-0.37 (p=0.12); r_weighted=-0.48 (p=0.12)

Thus, the original finding of a decline effect in the syntactic priming literature is at the most only marginally significant.