**Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.**

Sometimes the title of a paper just sounds incredible. **Estimating the reproducibility of psychological science**. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

**One reason for poor replication: sampling until significant**

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a *p*-value below the arbitrary threshold of 0.05. The consequence is this:

Focus on the left panel first. The green replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a stronger relation between the effect size they found and their sample size. They must have done more than just smart *a priori* power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size *measured* in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the *expected* effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The right panel shows a Bayesian correlation analysis of the data. What you are looking at is the belief in the strength of the correlation, called the posterior distribution. The overlap of the distributions can be used as a measure of believing that the correlations are not different. The overlap is less than 7%. If you are more inclined to believe in frequentist statistics, the associated *p*-value is .001 (Pearson and Filon’s *z* = 3.355). Therefore, there is strong evidence that original studies display a stronger negative correlation between sample size and measured effect size than replication studies.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the *p*-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

[24/10/2015: Added Bayesian analysis and changed figure. Code below is from old figure.]

[27/11/2015: Adjusted percentage overlap of posterior distributions.]

— — —

John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project https://osf.io/vdnrb/ #Richard Kunert for Brain's Idea 3/9/2015 #load necessary libraries library(httr) library(Hmisc) library(ggplot2) library(cocor) #get raw data from OSF website info <- GET('https://osf.io/fgjvw/?action=download', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF MASTER <- read.csv("rpp_data.csv")[1:167, ] colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file #restrict studies to those with appropriate data studies<-MASTER$ID[!is.na(MASTER$T_r..O.) & !is.na(MASTER$T_r..R.)] ##to keep track of which studies are which studies<-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047) #set font size for plotting theme_set(theme_gray(base_size = 30)) #prepare correlation coefficients dat_rank <- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])), sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])), effect_size_O = rank(cbind(MASTER$T_r..O.[studies])), effect_size_R = rank(cbind(MASTER$T_r..R.[studies]))) corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman") #compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!) htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R, data = dat_rank, return.htest = FALSE) #visualisation #prepare data frame dat_vis <- data.frame(study = rep(c("Original", "Replication"), each=length(studies)), sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])), effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies]))) #The plotting call plot.new ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape scale_colour_hue(l=50) + # Use a slightly darker palette than normal geom_smooth(method=lm, # Add linear regression lines se=FALSE, # Don't add shaded confidence region size=2, aes(color=study))+#colour lines according to data points for consistency geom_text(aes(x=750, y=0.46, label=sprintf("Spearman rho = %1.3f (p = %1.3f)", corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]), color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text geom_text(aes(x=750, y=0.2, label=sprintf("Spearman rho = %1.3f (p = %1.3f)", corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]), color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies geom_text(x=1500, y=0.33, label=sprintf("Difference: Pearson & Filon z = %1.3f (p = %1.3f)", htest@pearson1898$statistic, htest@pearson1898$p.value), color="black", hjust=0)+#add text about testing difference between correlation coefficients guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text ggtitle("Sampling until significant versus a priori power analysis")+#add figure title labs(x="Sample Size", y="Effect size r")#add axis titles

Great post! Note that selective stopping is not the only explanation for the negative correlation between sample size and effect size.

If we assume that people choose sample sizes more or less at random, and then publish only the significant comparisons (p<0.05), we will observe the same negative correlation.

Because with a small sample size, significance will only be achieved if the effect size happens to be large. Whereas larger studies could have smaller effect sizes (and smaller ones are more likely to arise than large ones if we assume that the true effect size is small or zero.)

In other words publication bias, by itself, could explain the correlation. In practice, I suspect it's a combination of Questionable Research Practices that explain it.

Thanks for the interesting post. Noticed “stastically,” happy for the comment to be deleted when corrrected.