Richard Kunert

Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

The replicability of psychological research is surprisingly low. Why? In this blog post I present new evidence showing that questionable research practices contributed to failures to replicate psychological effects.

Quick recap. A recent publication in Science claims that only around 40% of psychological findings are replicable, based on 100 replication attempts in the Reproducibility Project Psychology (Open Science Collaboration, 2015). A few months later, a critical commentary in the same journal made all sorts of claims, including that the surprisingly low 40% replication success rate is due to replications having been unfaithful to the original studies’ methods (Gilbert et al., 2016). A little while later, I published an article in Psychonomic Bulletin & Review re-analysing the data by the 100 replication teams (Kunert, 2016). I found evidence for questionable research practices being at the heart of failures to replicate, rather than the unfaithfulness of replications to original methods.

However, my previous re-analysis depended on replication teams having done good work. In this blog post I will show that even when just looking at the original studies in the Reproducibility Project: Psychology one cannot fail to notice that questionable research practices were employed by the original discoverers of the effects which often failed to replicate. The reanalysis I will present here is based on the caliper test introduced by Gerber and colleagues (Gerber & Malhotra, 2008; Gerber et al., 2010).

The idea of the caliper test is simple. The research community has decided that an entirely arbitrary threshold of p = 0.05 distinguishes between effects which might just be due to chance (p > 0.05) and effects which are more likely due to something other than chance (p < 0.05). If researchers want to game the system they slightly rig their methods and analyses to push their p-values just below the arbitrary border between ‘statistical fluke’ and ‘interesting effect’. Alternatively, they just don’t publish anything which came up p > 0.05. Such behaviour should lead to an unlikely amount of p-values just below 0.05 compared to just above 0.05.

The figure below shows the data of the Reproducibility Project: Psychology. On the horizontal axis I plot z-values which are related to p-values. The higher the z-value the lower the p-value. On the vertical axis I just show how many z-values I found in each range. The dashed vertical line is the arbitrary threshold between p < .05 (significant effects on the right) and p > .05 (non-significant effects on the left).

RPP_density_plot

The independent replications in blue show many z-values left of the dashed line, i.e. replication attempts which were unsuccessful. Otherwise the blue distribution is relatively smooth. There is certainly nothing fishy going on around the arbitrary p = 0.05 threshold. The blue curve looks very much like what I would expect psychological research to be if questionable research practices did not exist.

However, the story is completely different for the green distribution representing the original effects. Just right of the arbitrary p = 0.05 threshold there is a surprising clustering of z-values. It’s as if the human mind magically leads to effects which are just about significant rather than just about not significant. This bump immediately to the right of the dashed line is a clear sign that original authors used questionable research practices. This behaviour renders psychological research unreplicable.

For the expert reader, the formal analysis of the caliper test is shown in the table below using both a Bayesian analysis and a classical frequentist analysis. The conclusion is clear. There is no strong evidence for replication studies failing the caliper test, indicating that questionable research practices were probably not employed. The original studies do not pass the caliper test, indicating that questionable research practices were employed.

 

over caliper

(significant)

below caliper (non-sign.) Binomial test Bayesian proportion test posterior median

[95% Credible Interval]1

10 % caliper (1.76 < z < 1.96 versus 1.96 < z < 2.16)

Original 9 4 p = 0.267 BF10 = 1.09 0.53

[-0.36; 1.55]

Replication 3 2 p = 1 BF01 = 1.30 0.18

[-1.00; 1.45]

15 % caliper (1.67 < z < 1.96 versus 1.96 < z < 2.25)

Original 17 4 p = 0.007 BF10 = 12.9 1.07

[0.24; 2.08]

Replication 4 5 p = 1 BF01 = 1.54 -0.13

[-1.18; 0.87]

20 % caliper (1.76 < z < 1.57 versus 1.96 < z < 2.35)

Original 29 4 p < 0.001 BF10 = 2813 1.59

[0.79; 2.58]

Replication 5 5 p = 1 BF01 = 1.64 0.00

[-0.99; 0.98]

1Based on 100,000 draws from the posterior distribution of log odds.

 

As far as I know, this is the first analysis showing that data from the original studies of the Reproducibility Project: Psychology point to questionable research practices [I have since been made aware of others, see this comment below]. Instead of sloppy science on the part of independent replication teams, this analysis rather points to original investigators employing questionable research practices. This alone could explain the surprisingly low replication rates in psychology.

Psychology failing the caliper test is by no means a new insight. Huge text-mining analyses have shown that psychology as a whole tends to fail the caliper test (Kühberger et al., 2013, Head et al., 2015). The analysis I have presented here links this result to replicability. If a research field employs questionable research practices (as indicated by the caliper test) then it can no longer claim to deliver insights which stand the replication test (as indicated by the Reproducibility Project: Psychology).

It is time to get rid of questionable research practices. There are enough ideas for how to do so (e.g., Asendorpf et al., 2013; Ioannidis, Munafò, Fusar-Poli, Nosek, & Lakens, 2014). The Reproducibility Project: Psychology shows why there is no time to waste: it is currently very difficult to distinguish an interesting psychological effect from a statistical fluke. I doubt that this state of affairs is what psychological researchers get paid for.

PS: full R-code for recreating all analyses and figures is posted below. If you find mistakes please let me know.

PPS: I am indebted to Jelte Wicherts for pointing me to this analysis.

Update 25/4/2015:

I adjusted text to clarify that caliper test cannot distinguish between many different questionable research practices, following tweet by .

I toned down the language somewhat following tweet by .

I added reference to Uli Schimmack’s analysis by linking his comment.

— — —

Asendorpf, J., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J., Fiedler, K., Fiedler, S., Funder, D., Kliegl, R., Nosek, B., Perugini, M., Roberts, B., Schmitt, M., van Aken, M., Weber, H., & Wicherts, J. (2013). Recommendations for Increasing Replicability in Psychology European Journal of Personality, 27 (2), 108-119 DOI: 10.1002/per.1919

Gerber, A., & Malhotra, N. (2008). Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results? Sociological Methods & Research, 37 (1), 3-30 DOI: 10.1177/0049124108318973

Gerber, A., Malhotra, N., Dowling, C., & Doherty, D. (2010). Publication Bias in Two Political Behavior Literatures American Politics Research, 38 (4), 591-613 DOI: 10.1177/1532673X09350979

Gilbert, D., King, G., Pettigrew, S., & Wilson, T. (2016). Comment on “Estimating the reproducibility of psychological science” Science, 351 (6277), 1037-1037 DOI: 10.1126/science.aad7243

Head ML, Holman L, Lanfear R, Kahn AT, & Jennions MD (2015). The extent and consequences of p-hacking in science. PLoS biology, 13 (3) PMID: 25768323

Ioannidis JP, Munafò MR, Fusar-Poli P, Nosek BA, & David SP (2014). Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in cognitive sciences, 18 (5), 235-41 PMID: 24656991

Kühberger A, Fritz A, & Scherndl T (2014). Publication bias in psychology: a diagnosis based on the correlation between effect size and sample size. PloS one, 9 (9) PMID: 25192357

Kunert R (2016). Internal conceptual replications do not increase independent replication success. Psychonomic bulletin & review PMID: 27068542

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

— — —

##################################################################################################################
# Script for article "Questionable research practices in original studies of Reproducibility Project: Psychology"#
# Submitted to Brain's Idea (status: published)                                                                                               #
# Responsible for this file: R. Kunert (rikunert@gmail.com)                                                      # 
##################################################################################################################    
 
##########################################################################################################################################################################################
#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Figure 1: p-value density
 
# source functions
if(!require(httr)){install.packages('httr')}
library(httr)
info <- GET('https://osf.io/b2vn7/?action=download', write_disk('functions.r', overwrite = TRUE)) #downloads data file from the OSF
source('functions.r')
 
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))
 
if(!require(BayesFactor)){install.packages('BayesFactor')} #Bayesian analysis
library(BayesFactor)
 
if(!require(BEST)){install.packages('BEST')} #distribution overlap
library(BEST)#requires JAGS version 3
 
if(!require(RCurl)){install.packages('RCurl')} #
library(RCurl)#
 
#the following few lines are an excerpt of the Reproducibility Project: Psychology's 
# masterscript.R to be found here: https://osf.io/vdnrb/
 
# Read in Tilburg data
info <- GET('https://osf.io/fgjvw/?action=download', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER <- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file
 
#for studies with exact p-values
id <- MASTER$ID[!is.na(MASTER$T_pval_USE..O.) & !is.na(MASTER$T_pval_USE..R.)]
 
#FYI: turn p-values into z-scores 
#z = qnorm(1 - (pval/2)) 
 
#prepare data point for plotting
dat_vis <- data.frame(p = c(MASTER$T_pval_USE..O.[id],
                            MASTER$T_pval_USE..R.[id], MASTER$T_pval_USE..R.[id]),
                      z = c(qnorm(1 - (MASTER$T_pval_USE..O.[id]/2)),
                            qnorm(1 - (MASTER$T_pval_USE..R.[id]/2)),
                            qnorm(1 - (MASTER$T_pval_USE..R.[id]/2))),
                      Study_set= c(rep("Original Publications", length(id)),
                                   rep("Independent Replications", length(id)),
                                   rep("zIndependent Replications2", length(id))))
 
#prepare plotting colours etc
cols_emp_in = c("#1E90FF","#088A08")#colour_definitions of area
cols_emp_out = c("#08088a","#0B610B")#colour_definitions of outline
legend_labels = list("Independent\nReplication", "Original\nStudy")
 
#execute actual plotting
density_plot = ggplot(dat_vis, aes(x=z, fill = Study_set, color = Study_set, linetype = Study_set))+
  geom_density(adjust =0.6, size = 1, alpha=1) +  #density plot call
  scale_linetype_manual(values = c(1,1,3)) +#outline line types
  scale_fill_manual(values = c(cols_emp_in,NA), labels = legend_labels)+#specify the to be used colours (outline)
  scale_color_manual(values = cols_emp_out[c(1,2,1)])+#specify the to be used colours (area)
  labs(x="z-value", y="Density")+ #add axis titles  
  ggtitle("Reproducibility Project: Psychology") +#add title
  geom_vline(xintercept = qnorm(1 - (0.05/2)), linetype = 2) +
  annotate("text", x = 2.92, y = -0.02, label = "p < .05", vjust = 1, hjust = 1)+
  annotate("text", x = 1.8, y = -0.02, label = "p > .05", vjust = 1, hjust = 1)+
  theme(legend.position="none",#remove legend
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line  = element_line(colour = "black"),#clean look
        text = element_text(size=18),
        plot.title=element_text(size=30))
density_plot
 
#common legend
#draw a nonsense bar graph which will provide a legend
legend_labels = list("Independent\nReplication", "Original\nStudy")
dat_vis <- data.frame(ric = factor(legend_labels, levels=legend_labels), kun = c(1, 2))
dat_vis$ric = relevel(dat_vis$ric, "Original\nStudy")
nonsense_plot = ggplot(data=dat_vis, aes(x=ric, y=kun, fill=ric)) + 
  geom_bar(stat="identity")+
  scale_fill_manual(values=cols_emp_in[c(2,1)], name=" ") +
  theme(legend.text=element_text(size=18))
#extract legend
tmp <- ggplot_gtable(ggplot_build(nonsense_plot)) 
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") 
leg_plot <- tmp$grobs[[leg]]
#combine plots
grid.arrange(grobs = list(density_plot,leg_plot), ncol = 2, widths = c(2,0.4))
 
#caliper test according to Gerber et al.
 
#turn p-values into z-values
z_o = qnorm(1 - (MASTER$T_pval_USE..O.[id]/2))
z_r = qnorm(1 - (MASTER$T_pval_USE..R.[id]/2))
 
#How many draws are to be taken from posterior distribution for BF and Credible Interval calculations? The more samples the more precise the estimate and the slower the calculation.
draws = 10000 * 10#BayesFactor package standard = 10000
 
#choose one of the calipers
#z_c = c(1.76, 2.16)#10% caliper
#z_c = c(1.67, 2.25)#15% caliper
#z_c = c(1.57, 2.35)#20% caliper
 
#calculate counts
print(sprintf('Originals: over caliper N = %d', sum(z_o <= z_c[2] & z_o >= 1.96)))
print(sprintf('Originals: under caliper N = %d', sum(z_o >= z_c[1] & z_o <= 1.96)))
print(sprintf('Replications: over caliper N = %d', sum(z_r <= z_c[2] & z_r >= 1.96)))
print(sprintf('Replications: under caliper N = %d', sum(z_r >= z_c[1] & z_r <= 1.96)))
 
#formal caliper test: originals
#Bayesian analysis
bf = proportionBF(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2)
sprintf('Bayesian test of single proportion: BF10 = %1.2f', exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log
#sample from posterior
samples_o = proportionBF(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2,
                       posterior = TRUE, iterations = draws)
plot(samples_o[,"logodds"])
sprintf('Posterior Median = %1.2f [%1.2f; %1.2f]',
        median(samples_o[,"logodds"]),#Median 
        quantile(samples_o[,"logodds"], 0.025),#Lower edge of 95% Credible Interval
        quantile(samples_o[,"logodds"], 0.975))#Higher edge of 95% Credible Interval
#classical frequentist test
bt = binom.test(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2)
sprintf('Binomial test: p = %1.3f', bt$p.value)#
 
#formal caliper test: replications
bf = proportionBF(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2)
sprintf('Bayesian test of single proportion: BF01 = %1.2f', 1/exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log, turn into BF01
#sample from posterior
samples_r = proportionBF(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2,
                       posterior = TRUE, iterations = draws)
plot(samples[,"logodds"])
sprintf('Posterior Median = %1.2f [%1.2f; %1.2f]',
        median(samples_r[,"logodds"]),#Median 
        quantile(samples_r[,"logodds"], 0.025),#Lower edge of 95% Credible Interval
        quantile(samples_r[,"logodds"], 0.975))#Higher edge of 95% Credible Interval
#classical frequentist test
bt = binom.test(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2)
sprintf('Binomial test: p = %1.3f', bt$p.value)#
 
#possibly of interest: overlap of posteriors
#postPriorOverlap(samples_o[,"logodds"], samples_r[,"logodds"])#overlap of distribitions

Created by Pretty R at inside-R.org

10 things I learned while working for the Dutch science funding council (NWO)

 

The way science is currently funded is very controversial. During the last 6 months I was on a break from my PhD and worked for the organisation funding science in the Netherlands (NWO). These are 10 insights I gained.

12472348_10153473241208193_331748265476589914_n

1) Belangenverstrengeling

This is the first word I learned when arriving in The Hague. There is an anal obsession with avoiding (any potential for) conflicts of interest (belangenverstrengeling in Dutch). It might not seem a big deal to you, but it is a big deal at NWO.

 

2) Work ethic

Work e-mails on Sunday evening? Check. Unhealthy deadline obsession? Check. Stories of burn-out diagnoses? Check. In short, I found no evidence for the mythical low work ethic of NWO. My colleagues seemed to be in a perfectly normal, modern, semi-stressful job.

 

3) Perks

While the career prospects at NWO are somewhat limited, there are some nice perks to working in The Hague including: an affordable, good cantine, free fruit all day, subsidised in-house gym, free massage (unsurprisingly, with a waiting list from hell), free health check … The work atmosphere is, perhaps as a result, quite pleasant.

 

4) Closed access

Incredible but true, NWO does not have access to the pay-walled research literature it funds. Among other things, I was tasked with checking that research funds were appropriately used. You can imagine that this is challenging if the end-product of science funding (scientific articles) is beyond reach. Given a Herculean push to make all Dutch scientific output open access, this problem will soon be a thing of the past.

 

5) Peer-review

NWO itself does not generally assess grant proposals in terms of content (except for very small grants). What it does is organise peer-review, very similar to the peer-review of journal articles. My impression is that the peer-review quality is similar if not better at NWO compared to the journals that I have published in. NWO has minimum standards for reviewers and tries to diversify the national/scientific/gender background of the reviewer group assigned to a given grant proposal. I very much doubt that this is the case for most scientific journals.

 

6) NWO peer-reviewed

NWO itself also applies for funding, usually to national political institutions, businesses, and the EU. Got your grant proposal rejected at NWO? Find comfort in the thought that NWO itself also gets rejected.

 

7) Funding decisions in the making

In many ways my fears for how it is decided who gets funding were confirmed. Unfortunately, I cannot share more information other than to say: science has a long way to go before focussing rewards on good scientists doing good research.

 

8) Not funding decisions

I worked on grants which were not tied to some societal challenge, political objective, or business need. The funds I helped distribute are meant to simply facilitate the best science, no matter what that science is (often blue sky research, Vernieuwingsimpuls for people in the know). Approximately 10% of grant proposals receive funding. In other words, bad apples do not get funding. Good apples also do not get funding. Very good apples equally get zero funding. Only outstanding/excellent/superman apples get funding. If you think you are good at what you do, do not apply for grant money through the Vernieuwingsimpuls. It’s a waste of time. If, on the other hand, you haven’t seen someone as excellent as you for a while, then you might stand a chance.

 

9) Crisis response

Readers of this blog will be well aware that the field of psychology is currently going through something of a revolution related to depressingly low replication rates of influential findings (Open Science Framework, 2015; Etz & Vandekerckhove, 2016; Kunert, 2016). To my surprise, NWO wants to play its part to overcome the replication crisis engulfing science. I arrived at a fortunate moment, presenting my ideas of the problem and potential solutions to NWO. I am glad NWO will set aside money just for replicating findings.

 

10) No civil servant life for me

Being a junior policy officer at NWO turned out to be more or less the job I thought it would be. It was monotonous, cognitively relaxing, and low on responsibilities. In other words, quite different to doing a PhD. Other PhD students standing at the precipice of a burn out might also want to consider this as an option to get some breathing space. For me, it was just that, but not more than that.

— — —

This blog post does not represent the views of my former or current employers. NWO did not endorse this blog post. As far as I know, NWO doesn’t even know that this blog post exists.

— — —

Etz, A., & Vandekerckhove, J. (2016). A Bayesian Perspective on the Reproducibility Project: Psychology PLOS ONE, 11 (2) DOI: 10.1371/journal.pone.0149794

Kunert R (2016). Internal conceptual replications do not increase independent replication success. Psychonomic bulletin & review PMID: 27068542

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

Is Replicability in Economics better than in Psychology?

Colin Camerer and colleagues recently published a Science article on the replicability of behavioural economics. ‘It appears that there is some difference in replication success’ between psychology and economics, they write, given their reproducibility rate of 61% and psychology’s of 36%. I took a closer look at the data to find out whether there really are any substantial differences between fields.

Commenting on the replication success rates in psychology and economics, Colin Camerer is quoted as saying: “It is like a grade of B+ for psychology versus A– for economics.” Unsurprisingly, his team’s Science paper also includes speculation as to what contributes to economics’ “relatively good replication success”. However, such speculation is premature as it is not established whether economics actually displays better replicability than the only other research field which has tried to estimate its replicability (that would be psychology). Let’s check the numbers in Figure 1.

RPP_EERP_replicability

Figure 1. Replicability in economics and psychology. Panel A displays replication p-values of originally significant effects. Note that the bottom 25% quartile is at p = .001 and p = .0047 respectively and, thus, not visible here. Panel B displays the effect size reduction from original to replication study.Violin plots display density, i.e. thicker parts represent more data points.

Looking at the left panel of Figure 1, you will notice that the p-values of the replication studies in economics tend to be lower than in psychology, indicating that economics is more replicable. In order to formally test this, I define a replication success as p < .05 (the typical threshold for proclaiming that an effect was found) and count successes in both data sets. In economics, there are 11 successes and 7 failures. In psychology, there are 34 successes and 58 failures. When comparing these proportions formally with a Bayesian contingency table test, the resulting Bayes Factor of BF10 = 1.77 indicates that the replicability difference between economics and psychology is so small as to be worth no more than a bare mention. Otherwise said, the replicability projects in economics and psychology were too small to say that one field produces more replicable effects than the other.

However, a different measure of replicability which doesn’t depend on an arbitrary cut-off at p = .05 might give a clearer picture. Figure 1’s right panel displays the difference between the effect sizes reported in the original publications and those observed by the replication teams. You will notice that most effect size differences are negative, i.e. when replicating an experiment you will probably observe a (much) smaller effect compared to what you read in the original paper. For a junior researcher like me this is an endless source of self-doubt and frustration.

Are effect sizes more similar between original and replication studies in economics compared to psychology? Figure 1B doesn’t really suggest that there is a huge difference. The Bayes factor of a Bayesian t-test comparing the right and left distributions of Figure 1B supports this impression. The null hypothesis of no difference is favored BF01 = 3.82 times more than the alternative hypothesis of a difference (or BF01 = 3.22 if you are an expert and insist on using Cohen’s q). In Table 1, I give some more information for the expert reader.

The take-home message is that there is not enough information to claim that economics displays better replicability than psychology. Unfortunately, psychologists shouldn’t just adopt the speculative factors contributing to the replication success in economics. Instead, we should look elsewhere for inspiration: the simulations of different research practices showing time and again what leads to high replicability (big sample sizes, pre-registration, …) and what not (publication bias, questionable research practices…). For the moment, psychologists should not look towards economists to find role models of replicability.

Table 1. Comparison of Replicability in Economics and Psychology.
Economics Psychology Bayes Factora Posterior median [95% Credible Interval]1
Independent Replications p < .05 11 out of 18 34 out of 92 BF10 = 1.77

0.95

[-0.03; 1.99]

Effect size reduction (simple subtraction)

M = 0.20

(SD = 0.20)

M = 0.20

(SD = 0.21)

BF01 = 3.82

0.02

[-0.12; 0.15]

Effect size reduction (Cohen’s q) M = 0.27

(SD = 0.36)

M = 0.22

(SD = 0.26)

BF01 = 3.22

0.03

[-0.15; 0.17]

a Assumes normality. See for yourself whether you believe this assumption is met.

1 Log odds for proportions. Difference values for quantities.

— — —

Camerer, C., Dreber, A., Forsell, E., Ho, T., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics Science DOI: 10.1126/science.aaf0918

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716
— — —

PS: I am indebted to Alex Etz and EJ Wagenmakers who presented a similar analysis of parts of the data on the OSF website: https://osf.io/p743r/

— — —

R code for reproducing Figure and Table (drop me a line if you find a mistake):

# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))

if(!require(BayesFactor)){install.packages('BayesFactor')} #Bayesian analysis
library(BayesFactor)

if(!require(BEST)){install.packages('BEST')} #distribution overlap
library(BEST)#requires JAGS version 3

if(!require(xlsx)){install.packages('xlsx')} #for reading excel sheets
library(xlsx)

#How many draws are to be taken from posterior distribution for BF and Credible Interval calculations? The more samples the more precise the estimate and the slower the calculation.
draws = 10000 * 10#BayesFactor package standard = 10000

##########################################################################################################################################################################################
#-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Figures 1: p-values

#get RPP raw data from OSF website
RPPdata &amp;amp;lt;- get.OSFfile(code='https://osf.io/fgjvw/',dfCln=T)$df
# Select the completed replication studies
RPPdata &amp;amp;lt;- dplyr::filter(RPPdata, !is.na(T.pval.USE.O),!is.na(T.pval.USE.R))

#get EERP raw data from local xls file based on Table S1 of Camerer et al., 2016 (just write me an e-mail if you want it)
EERPdata = read.xlsx(&amp;amp;quot;EE_RP_data.xls&amp;amp;quot;, 1)

# Restructure the data to &amp;amp;quot;long&amp;amp;quot; format: Study type will be a factor
df1 &amp;amp;lt;- dplyr::select(RPPdata,starts_with(&amp;amp;quot;T.&amp;amp;quot;))
df &amp;amp;lt;- data.frame(p.value=as.numeric(c(as.character(EERPdata$p_rep),
df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05])),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(EERPdata$p_rep)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=sum(df1$T.pval.USE.O &amp;amp;lt; .05)))))

# Create some variables for plotting
df$grpN &amp;amp;lt;- as.numeric(df$grp)
probs &amp;amp;lt;- seq(0,1,.25)

# VQP PANEL A: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles &amp;amp;lt;- ldply(unique(df$grpN),function(gr) quantile(round(df$p.value[df$grpN==gr],digits=4),probs,na.rm=T,type=3))
freqs &amp;amp;lt;- ldply(unique(df$grpN),function(gr) table(cut(df$p.value[df$grpN==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels &amp;amp;lt;- sapply(unique(df$grpN),function(gr)levels(cut(round(df$p.value[df$grpN==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Check the Quantile bins!
Economics &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[1,])))
rownames(Economics) &amp;amp;lt;- labels[,1]
Economics

Psychology &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[2,])))
rownames(Psychology) &amp;amp;lt;- labels[,2]
Psychology

# Get regular violinplot using package ggplot2
g.pv &amp;amp;lt;- ggplot(df,aes(x=grp,y=p.value)) + geom_violin(aes(group=grp),scale=&amp;amp;quot;width&amp;amp;quot;,color=&amp;amp;quot;grey30&amp;amp;quot;,fill=&amp;amp;quot;grey30&amp;amp;quot;,trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 &amp;amp;lt;- vioQtile(g.pv,qtiles,probs)
# Garnish
g.pv1 &amp;amp;lt;- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
ggtitle(&amp;amp;quot;A&amp;amp;quot;) + xlab(&amp;amp;quot;&amp;amp;quot;) + ylab(&amp;amp;quot;replication p-value&amp;amp;quot;) +
mytheme
# View
g.pv1

## Uncomment to save panel A as a seperate file
# ggsave(&amp;amp;quot;RPP_F1_VQPpv.eps&amp;amp;quot;,plot=g.pv1)

#calculate counts
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;lt;= .05)#How many economic effects 'worked' upon replication?
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;gt; .05)##How many economic effects 'did not work' upon replication?
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;lt;= .05)#How many psychological effects 'worked' upon replication?
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;gt; .05)#How many psychological effects 'did not work' upon replication?

#prepare BayesFactor analysis
data_contingency = matrix(c(sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;lt;= .05),#row 1, col 1
sum(as.numeric(as.character(EERPdata$p_rep)) &amp;amp;gt; .05),#row 2, col 1
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;lt;= .05),#row 1, col 2
sum(df1$T.pval.USE.R[df1$T.pval.USE.O &amp;amp;lt; .05] &amp;amp;gt; .05)),#row 2, col 2
nrow = 2, ncol = 2, byrow = F)#prepare BayesFactor analysis
bf = contingencyTableBF(data_contingency, sampleType = &amp;amp;quot;indepMulti&amp;amp;quot;, fixedMargin = &amp;amp;quot;cols&amp;amp;quot;)#run BayesFactor comparison
sprintf('BF10 = %1.2f', exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log

#Parameter estimation
chains = posterior(bf, iterations = draws)#draw samples from the posterior
odds_ratio = (chains[,&amp;amp;quot;omega[1,1]&amp;amp;quot;] * chains[,&amp;amp;quot;omega[2,2]&amp;amp;quot;]) / (chains[,&amp;amp;quot;omega[2,1]&amp;amp;quot;] * chains[,&amp;amp;quot;omega[1,2]&amp;amp;quot;])
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(log(odds_ratio)),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(log(odds_ratio), 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(log(odds_ratio), 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
#plot(mcmc(log(odds_ratio)), main = &amp;amp;quot;Log Odds Ratio&amp;amp;quot;)

# VQP PANEL B: reduction in effect size -------------------------------------------------

econ_r_diff = as.numeric(as.character(EERPdata$r_rep)) - as.numeric(as.character(EERPdata$r_orig))
psych_r_diff = as.numeric(df1$T.r.R) - as.numeric(df1$T.r.O)
df &amp;amp;lt;- data.frame(EffectSizeDifference= c(econ_r_diff, psych_r_diff[!is.na(psych_r_diff)]),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(econ_r_diff)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=length(psych_r_diff[!is.na(psych_r_diff)])))))

# Create some variables for plotting
df$grpN &amp;amp;lt;- as.numeric(df$grp)
probs &amp;amp;lt;- seq(0,1,.25)

# Get effect size quantiles and frequencies from data
qtiles &amp;amp;lt;- ldply(unique(df$grpN),function(gr) quantile(df$EffectSizeDifference[df$grpN==gr],probs,na.rm=T,type=3,include.lowest=T))
freqs &amp;amp;lt;- ldply(unique(df$grpN),function(gr) table(cut(df$EffectSizeDifference[df$grpN==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T)))
labels &amp;amp;lt;- sapply(unique(df$grpN),function(gr)levels(cut(round(df$EffectSizeDifference[df$grpN==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Check the Quantile bins!
Economics &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[1,])))
rownames(Economics) &amp;amp;lt;- labels[,1]
Economics

Psychology &amp;amp;lt;-cbind(freq=as.numeric(t(freqs[2,])))
rownames(Psychology) &amp;amp;lt;- labels[,2]
Psychology

# Get regular violinplot using package ggplot2
g.es &amp;amp;lt;- ggplot(df,aes(x=grp,y=EffectSizeDifference)) +
geom_violin(aes(group=grpN),scale=&amp;amp;quot;width&amp;amp;quot;,fill=&amp;amp;quot;grey40&amp;amp;quot;,color=&amp;amp;quot;grey40&amp;amp;quot;,trim=T,adjust=1)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 &amp;amp;lt;- vioQtile(g.es,qtiles=qtiles,probs=probs)
# Garnish
g.es1 &amp;amp;lt;- g.es0 +
ggtitle(&amp;amp;quot;B&amp;amp;quot;) + xlab(&amp;amp;quot;&amp;amp;quot;) + ylab(&amp;amp;quot;Replicated - Original Effect Size r&amp;amp;quot;) +
scale_y_continuous(breaks=c(-.25,-.5, -0.75, -1, 0,.25,.5,.75,1),limits=c(-1,0.5)) + mytheme
# View
g.es1

# # Uncomment to save panel B as a seperate file
# ggsave(&amp;amp;quot;RPP_F1_VQPes.eps&amp;amp;quot;,plot=g.es1)

# VIEW panels in one plot using the multi.PLOT() function from C-3PR
multi.PLOT(g.pv1,g.es1,cols=2)

# SAVE combined plots as PDF
pdf(&amp;amp;quot;RPP_Figure1_vioQtile.pdf&amp;amp;quot;,pagecentre=T, width=20,height=8 ,paper = &amp;amp;quot;special&amp;amp;quot;)
multi.PLOT(g.pv1,g.es1,cols=2)
dev.off()

#Effect Size Reduction (simple subtraction)-------------------------------------------------

#calculate means and standard deviations
mean(econ_r_diff)#mean ES reduction of economic effects
sd(econ_r_diff)#Standard Deviation ES reduction of economic effects
mean(psych_r_diff[!is.na(psych_r_diff)])#mean ES reduction of psychological effects
sd(psych_r_diff[!is.na(psych_r_diff)])#Standard Deviation ES reduction of psychological effects

#perform BayesFactor analysis
bf = ttestBF(formula = EffectSizeDifference ~ grp, data = df)#Bayesian t-test to test the difference/similarity between the previous two
sprintf('BF01 = %1.2f', 1/exp(bf@bayesFactor$bf[1]))#exponentiate BF10 because stored as natural log, turn into BF01

##Parameter estimation: use BEST package to estimate posterior median and 95% Credible Interval
BESTout = BESTmcmc(econ_r_diff,
psych_r_diff[!is.na(psych_r_diff)],
priors=NULL, parallel=FALSE)
#plotAll(BESTout)
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(BESTout$mu1 - BESTout$mu2),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)

#Effect Size Reduction (Cohen's q)-------------------------------------------------

#prepare function to calculate Cohen's q
Cohenq &amp;amp;lt;- function(r1, r2) {
fis_r1 = 0.5 * (log((1+r1)/(1-r1)))
fis_r2 = 0.5 * (log((1+r2)/(1-r2)))
fis_r1 - fis_r2
}

#calculate means and standard deviations
econ_Cohen_q = Cohenq(as.numeric(as.character(EERPdata$r_rep)), as.numeric(as.character(EERPdata$r_orig)))
psych_Cohen_q = Cohenq(as.numeric(df1$T.r.R), as.numeric(df1$T.r.O))
mean(econ_Cohen_q)#mean ES reduction of economic effects
sd(econ_Cohen_q)#Standard Deviation ES reduction of economic effects
mean(psych_Cohen_q[!is.na(psych_Cohen_q)])#mean ES reduction of psychological effects
sd(psych_Cohen_q[!is.na(psych_Cohen_q)])#Standard Deviation ES reduction of psychological effects

#perform BayesFactor analysis
dat_bf &amp;amp;lt;- data.frame(EffectSizeDifference = c(econ_Cohen_q,
psych_Cohen_q[!is.na(psych_Cohen_q)]),
grp=factor(c(rep(&amp;amp;quot;Economics&amp;amp;quot;,times=length(econ_Cohen_q)),
rep(&amp;amp;quot;Psychology&amp;amp;quot;,times=length(psych_Cohen_q[!is.na(psych_Cohen_q)])))))#prepare BayesFactor analysis
bf = ttestBF(formula = EffectSizeDifference ~ grp, data = dat_bf)#Bayesian t-test to test the difference/similarity between the previous two
#null Interval is positive because effect size reduction is expressed negatively, H1 predicts less reduction in case of internally replicated effects
sprintf('BF01 = %1.2f', 1/exp(bf@bayesFactor$bf[1]))#exponentiate BF10 because stored as natural log, turn into BF01

#Parameter estimation: use BEST package to estimate posterior median and 95% Credible Interval
BESTout = BESTmcmc(econ_Cohen_q,
psych_Cohen_q[!is.na(psych_Cohen_q)],
priors=NULL, parallel=FALSE)
#plotAll(BESTout)
sprintf('Median = %1.2f [%1.2f; %1.2f]',
median(BESTout$mu1 - BESTout$mu2),#Median for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.025),#Lower edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)
quantile(BESTout$mu1 - BESTout$mu2, 0.975))#Higher edge of 95% Credible Interval for increase in independent replication success due to internal replication\n(internally replicated versus not internally replicated)

How language changes the way you hear music

In a new paper I, together with Roel Willems and Peter Hagoort, show that music and language are tightly coupled in the brain. Get the gist in a 180 second youtube clip and then try out what my participants did.

The task my participants had to do might sound very abstract to you, so let me make it concrete. Listen to these two music pieces and tell me which one sounds ‘finished’:

I bet you thought the second one ended a bit in an odd way. How do you know? You use your implicit knowledge of harmonic relations in Western music for such a ‘finished judgement’. All we did in the paper was to see whether an aspect of language grammar (syntax) can influence your ability to hear these harmonic relations, as revealed by ‘finished judgements’. The music pieces we used for this sounded very similar to what you just heard:

It turns out that reading syntactically difficult sentences while hearing the music reduced the feeling that music pieces like this did actually end well. This indicated that processing language syntax draws on brain resources which are also responsible for music harmony.

Difficult syntax: The surgeon consoled the man and the woman put her hand on his forehead.

Easy syntax: The surgeon consoled the man and the woman because the operation had not been successful.

Curiously, sentences with a difficult meaning had no influence on the ‘finished judgements’.

Difficult meaning: The programmer let his mouse run around on the table after he had fed it.

Easy meaning: The programmer let his field mouse run around on the table after he had fed it.

Because only language syntax influenced ‘finished judgements’, we believe that music and language share a common syntax processor of some kind. This conclusion is in line with a number of other studies which I blogged about before.

What this paper adds is that we rule out an attentional link between music and language as the source of the effect. In other words, difficult syntax doesn’t simply distract you and thereby disables your music hearing. Its influence is based on a common syntax processor instead.

In the end, I tested 278 participants across 3 pre-tests, 2 experiments, and 1 post-test. Judge for yourself whether it was worth it by reading the freely available paper here.

— — —

Kunert R, & Slevc LR (2015). A Commentary on: “Neural overlap in processing music and speech”. Frontiers in human neuroscience, 9 PMID: 26089792

Kunert, R., Willems, R., & Hagoort, P. (2016). Language influences music harmony perception: effects of shared syntactic integration resources beyond attention Royal Society Open Science, 3 (2) DOI: 10.1098/rsos.150685

Are pre-registrations the solution to the replication crisis in Psychology?

Most psychology findings are not replicable. What can be done? In his Psychological Science editorial, Stephen Lindsay advertises pre-registration as a solution, writing that “Personally, I aim never again to submit for publication a report of a study that was not preregistered”. I took a look at whether pre-registrations are effective and feasible [TL;DR: maybe and possibly].

[I updated the blog post using comments by Cortex editor Chris Chambers, see below for full comments. It turns out that many of my concerns have already been addressed. Updates in square brackets.]

A recent study published in Science found that the majority of Psychological research cannot be reproduced by independent replication teams (Open Science Collaboration, 2015). I believe that this is due to questionable research practices (LINK) and that internal replications are no solution to this problem (LINK). However, might pre-registrations be the solution? I don’t think so. The reason why I am pessimistic is three-fold.

What is a pre-registration? A pre-registered study submits its design and analysis before data is acquired. After data acquisition the pre-registered data analysis plan is executed and the results can confidently be labelled confirmatory (i.e. more believable). Analyses not specified before are labelled exploratory (i.e. less believable). Some journals offer peer-review of the pre-registration document. Once it has been approved, the chances of the journal accepting a manuscript based on the proposed design and analysis are supposedly very high. [Chris Chambers: “for more info on RRs see https://osf.io/8mpji/wiki/home/”%5D

 

1) Pre-registration does not remove all incentives to employ questionable research practices

Pre-registrations should enforce honesty about post hoc changes in the design/analysis. Ironically, the efficacy of pre-registrations is itself dependent on the honesty of researchers. The reason is simple: including the information that an experiment was pre-registered is optional. So, if the planned analysis is optimal, a researcher can boost its impact by revealing that the entire experiment was pre-registered. If not, s/he deletes the pre-registration document and proceeds as if it had never existed, a novel questionable research practice (anyone want to invent a name for it? Optional forgetting?).

Defenders of pre-registration could counter that peer-reviewed pre-registrations are different because there is no incentive to deviate from the planned design/analysis. Publication is guaranteed if the pre-registered study is executed as promised. However, two motives remove this publication advantage:

1a) the credibility boost of presenting a successful post hoc design or analysis decision as a priori can still be achieved by publishing the paper in a different journal which is unaware of the pre-registration document.

1b) the credibility loss of a wider research agenda due to a single unsuccessful experiment can still be avoided by simply withdrawing the study from the journal and forgetting about it.

The take-home message is that one can opt-in and out of pre-registration as one pleases. The maximal cost is the rejection of one peer-reviewed pre-registered paper at one journal. Given that paper rejection is the most normal thing in the world for a scientist these days, this threat is not effective.

[Chris Chambers: “all pre-registrations made now on the OSF become public within 4 years – so as far as I understand, it is no longer possible to register privately and thus game the system in the way you describe, at least on the OSF.”]

2) Pre-registrations did not clean up other research fields

Note that the argument so far assumes that when the pre-registration document is revealed, it is effective in stopping undisclosed post hoc design/analysis decisions. The medical sciences, in which randomized control trials have to be pre-registered since a 2004 decision by journal editors, teach us that this is not so. There are four aspects to this surprising ineffectivetiveness of pre-registrations:

2a) Many pre-registered studies are not published. For example, Chan et al. (2004a,b) could not locate the publications of 54% – 63% of the pre-registered studies. It’s possible that this is due to the aforementioned publication bias (see 1b above), or other reasons (lack of funding, manuscript under review…).

2b) Medical authors feel free to frequently deviate from their planned designs/analyses. For example 31% – 62% of randomized controlled trials changed at least one primary outcome between pre-registration and publication (Mathieu et al., 2009; Chan et al., 2004a,b). If you thought that psychological scientists are somehow better than medical ones, early indications are that this is not so (Franco et al., 2015).

pre-registration deviations in psych science

2c) Deviations from pre-registered designs/analyses are not discovered because 66% of journal reviewers do not consult the pre-registration document (Mathieu et al., 2013).

2d) In the medical sciences pre-registration documents are usually not peer-reviewed and quite often sloppy. For example, Mathieu et al., (2013) found 37% of trials to be post-registered (the pointless exercise of registering a study which has already taken place), and 17% of pre-registrations being too imprecise to be useful.

[Chris Chambers: “The concerns raised by others about reviewers not checking protocols apply to clinical trial registries but this is moot for RRs because checking happens at an editorial level (if not at both an editorial and reviewer level) and there is continuity of the review process from protocol through to study completion.”]

3) Pre-registration is a practical night-mare for early career researchers

Now, one might argue that pre-registering is still better than not pre-registering. In terms of non-peer-reviewed pre-registration documents, this is certainly true. However, their value is limited because they can be written so vaguely as to be not useless (see 2d) and they can simply be deleted if they ‘stand in the way of a good story’, i.e. if an exploratory design/analysis choice gets reported as confirmatory (see 1a).

The story is different for peer-reviewed pre-registrations. They are impractical because of one factor which tenured decision makers sometimes forget: time. Most research is done by junior scientists who have temporary contracts running anywhere between a few months and five years [reference needed]. These people cannot wait for a peer-review decision which, on average, takes something like one year and ten months (Nosek & Bar-Anan, 2012). This is the submission-to-publication-time distribution for one prominent researcher (Brian Nosek):

hist_publication_times

What does this mean? As a case study, let’s take Richard Kunert, a fine specimen of a junior researcher, who was given three years of funding by the Max-Planck-Gesellschaft in order to obtain a PhD. Given the experience by Brain Nosek with his articles, and assuming Richard submits three pre-registration documents on day 1 of his 3-year PhD, each individual document has a 84.6% chance of being accepted within three years. The chance that all three will be accepted is 60.6% (0.8463). This scenario is obviously unrealistic because it leaves no time for setting up the studies and for actually carrying them out.

For the more realistic case of one year of piloting and one year of actually carrying out the studies, Richard has a 2.2% chance (0.2823) that all three studies are peer-reviewed at the pre-registration stage and published. However, Richard is not silly (or so I have heard), so he submits 5 studies, hoping that at least three of them will be eventually carried out. In this case he has a 14% that at least three studies are peer-reviewed at the pre-registration stage and published. Only if Richard submits 10 or more pre-registration documents for peer-review after 1 year of piloting, he has a more than 50% chance of being left with at least 3 studies to carry out within 1 year.

For all people who hate numbers, let me put it into plain words. Peer-review is so slow that requiring PhD students to only perform pre-registered studies means the overwhelming majority of PhD students will fail their PhD requirements in their funded time. In this scenario cutting-edge, world-leading science will be done by people flipping burgers to pay the rent because funding ran out too quickly.

[Chris Chambers: “Average decision times from Cortex, not including time taken by authors to make revisions: initial trial = 5 days; Stage 1 provisional acceptance = 9 weeks (1-3 rounds of in-depth review); Stage 2 full acceptance = 4 weeks”]

What to do

The arrival of pre-registration in the field of Psychology is undoubtedly a good sign for science. However, given what we know now, no one should be under the illusion that this instrument is the solution to the replication crisis which psychological researchers are facing. At the most, it is a tiny piece of a wider strategy to make Psychology what it has long claimed to be: a robust, evidence based, scientific enterprise.

 

[Please do yourself a favour and read the comments below. You won’t get better people commenting than this.]

— — —

Chan AW, Krleza-Jerić K, Schmid I, & Altman DG (2004). Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. CMAJ : Canadian Medical Association journal = journal de l’Association medicale canadienne, 171 (7), 735-40 PMID: 15451835

Chan, A., Hróbjartsson, A., Haahr, M., Gøtzsche, P., & Altman, D. (2004). Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials JAMA, 291 (20) DOI: 10.1001/jama.291.20.2457

Franco, A., Malhotra, N., & Simonovits, G. (2015). Underreporting in Psychology Experiments: Evidence From a Study Registry Social Psychological and Personality Science DOI: 10.1177/1948550615598377

Lindsay, D. (2015). Replication in Psychological Science Psychological Science DOI: 10.1177/0956797615616374

Mathieu, S., Boutron, I., Moher, D., Altman, D.G., & Ravaud, P. (2009). Comparison of Registered and Published Primary Outcomes in Randomized Controlled Trials JAMA, 302 (9) DOI: 10.1001/jama.2009.1242

Mathieu, S., Chan, A., & Ravaud, P. (2013). Use of Trial Register Information during the Peer Review Process PLoS ONE, 8 (4) DOI: 10.1371/journal.pone.0059910

Nosek, B., & Bar-Anan, Y. (2012). Scientific Utopia: I. Opening Scientific Communication Psychological Inquiry, 23 (3), 217-243 DOI: 10.1080/1047840X.2012.692215

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443
— — —

Are internal replications the solution to the replication crisis in Psychology? No.

Most Psychology findings are not replicable. What can be done? Stanford psychologist Michael Frank has an idea : Cumulative study sets with internal replication. ‘If I had to advocate for a single change to practice, this would be it.’ I took a look whether this makes any difference.

A recent paper in the journal Science has tried to replicate 97 statistically significant effects (Open Science Collaboration, 2015). In only 35 cases this was successful. Most findings were suddenly a lot weaker upon replication. This has led to a lot of soul searching among psychologists. Fortunately, the authors of the Science paper have made their data freely available. So, soul searching can be accompanied by trying out different ideas for improvements.

What can be done to solve Psychology’s replication crisis?

One idea to improve the situation is to demand study authors to replicate their own experiments in the same paper. Stanford psychologist Michael Frank writes:

If I had to advocate for a single change to practice, this would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. […] If you show me a one-off study and I fail to replicate it in my lab, I will tend to suspect that you got lucky or p-hacked your way to a result. But if you show me a package of studies with four internal replications of an effect, I will believe that you know how to get that effect – and if I don’t get it, I’ll think that I’m doing something wrong.
If this argument were true, then the 41 studies which were successfully, conceptually replicated in their own paper should show higher rates of replication than the 56 studies which were not. Of the 41 internally replicated studies, 19 were replicated once, 10 twice, 8 thrice, 4 more than three times. I will treat all of these as equally internally replicated.

Are internal replications the solution? No.

internal_external_replication

So, does the data by the reprocucibility project show a difference? I made so-called violin plots, thicker parts represent more data points. In the left plot you see the reduction in effect sizes from a bigger original effect to a smaller replicated effect. The reduction associated with internally replicated effects (left) and effects which were only reported once in a paper (right) is more or less the same. In the right plot you can see the p-value of the replication attempt. The dotted line represents the arbitrary 0.05 threshold used to determine statistical significance. Again, replicators appear to have had as hard a task with effects that were found more than once in a paper as with effects which were only found once.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 29% of internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The equivalent number of not internally replicated effects is 41%. A contingency table Bayes factor test (Gunel & Dickey, 1974) shows that the null hypothesis of no difference is 1.97 times more likely than the alternative. In other words, the 12 %-point replication advantage for non-replicated effects does not provide convincing evidence for an unexpected reversed replication advantage. The 12%-point difference is not due to statistical power. Power was 92% on average in the case of internally replicated and not internally replicated studies. So, the picture doesn’t support internal replications at all. They are hardly the solution to Psychology’s replication problem according to this data set.

The problem with internal replications

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

What can be done?

In my previous post I have shown evidence for questionable research practices in this data set. These lead to less replicable results. Pre-registering studies makes questionable research practices a lot harder and science more reproducible. It would be interesting to see data on whether this hunch is true.

[update 7/9/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to take into account the different denominators for replicated and unreplicated effects. Lee Jussim pointed me to this.]

[update 24/10/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to provide correct numbers, Bayesian analysis and power comparison.]

— — —
Bem DJ (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of personality and social psychology, 100 (3), 407-25 PMID: 21280961

Galak, J., LeBoeuf, R., Nelson, L., & Simmons, J. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103 (6), 933-948 DOI: 10.1037/a0029709

Gunel, E., & Dickey, J. (1974). Bayes Factors for Independence in Contingency Tables. Biometrika, 61(3), 545–557. http://doi.org/10.2307/2334738

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Ritchie SJ, Wiseman R, & French CC (2012). Failing the future: three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PloS one, 7 (3) PMID: 22432019

Schimmack U (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17 (4), 551-66 PMID: 22924598
— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between internal replication and independent reproducibility of an effect

#Richard Kunert for Brain's Idea 5/9/2015

# a lot of code was taken from the reproducibility project code here https://osf.io/vdnrb/

# installing/loading the packages:
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))

#loading the data
RPPdata <- get.OSFfile(code='https://osf.io/fgjvw/',dfCln=T)$df
RPPdata <- dplyr::filter(RPPdata, !is.na(T.pval.USE.O),!is.na(T.pval.USE.R), complete.cases(RPPdata$T.r.O,RPPdata$T.r.R))#97 studies with significant effects

#prepare IDs for internally replicated effects and non-internally replicated effects
idIntRepl <- RPPdata$Successful.conceptual.replications.O > 0
idNotIntRepl <- RPPdata$Successful.conceptual.replications.O == 0

# Get ggplot2 themes predefined in C-3PR
mytheme <- gg.theme("clean")

#restructure data in data frame
dat <- data.frame(EffectSizeDifference = as.numeric(c(c(RPPdata$T.r.R[idIntRepl]) - c(RPPdata$T.r.O[idIntRepl]),
                                                          c(RPPdata$T.r.R[idNotIntRepl]) - c(RPPdata$T.r.O[idNotIntRepl]))),
                  ReplicationPValue = as.numeric(c(RPPdata$T.pval.USE.R[idIntRepl],
                                                   RPPdata$T.pval.USE.R[idNotIntRepl])),
                  grp=factor(c(rep("Internally Replicated Studies",times=sum(idIntRepl)),
                               rep("Internally Unreplicated Studies",times=sum(idNotIntRepl))))
  )

# Create some variables for plotting
dat$grp <- as.numeric(dat$grp)
probs   <- seq(0,1,.25)

# VQP PANEL A: reduction in effect size -------------------------------------------------

# Get effect size difference quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$EffectSizeDifference[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$EffectSizeDifference[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$EffectSizeDifference[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.es <- ggplot(dat,aes(x=grp,y=EffectSizeDifference)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 <- vioQtile(g.es,qtiles,probs)
# Garnish (what does this word mean???)
g.es1 <- g.es0 +
  ggtitle("Effect size reduction") + xlab("") + ylab("Replicated - Original Effect Size") + 
  xlim("Internally Replicated", "Not Internally Replicated") +
  mytheme + theme(axis.text.x = element_text(size=20))
# View
g.es1


# VQP PANEL B: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$ReplicationPValue[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$ReplicationPValue[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$ReplicationPValue[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.pv <- ggplot(dat,aes(x=grp,y=ReplicationPValue)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 <- vioQtile(g.pv,qtiles,probs)
# Garnish (I still don't know what this word means!)
g.pv1 <- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
  ggtitle("Independent replication p-value") + xlab("") + ylab("Independent replication p-value") + 
  xlim("Internally Replicated", "Not Internally Replicated")+
  mytheme + theme(axis.text.x = element_text(size=20))
# View
g.pv1

#put two plots together
multi.PLOT(g.es1,g.pv1,cols=2)</pre>
<pre>

Why are Psychological findings mostly unreplicable?

Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.

Sometimes the title of a paper just sounds incredible. Estimating the reproducibility of psychological science. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

One reason for poor replication: sampling until significant

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a p-value below the arbitrary threshold of 0.05. The consequence is this:

KunertFig5

Focus on the left panel first. The green replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a stronger relation between the effect size they found and their sample size. They must have done more than just smart a priori power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size measured in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the expected effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The right panel shows a Bayesian correlation analysis of the data. What you are looking at is the belief in the strength of the correlation, called the posterior distribution. The overlap of the distributions can be used as a measure of believing that the correlations are not different. The overlap is less than 7%. If you are more inclined to believe in frequentist statistics, the associated p-value is .001 (Pearson and Filon’s z = 3.355). Therefore, there is strong evidence that original studies display a stronger negative correlation between sample size and measured effect size than replication studies.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the p-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

[24/10/2015: Added Bayesian analysis and changed figure. Code below is from old figure.]

[27/11/2015: Adjusted percentage overlap of posterior distributions.]

— — —
John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project https://osf.io/vdnrb/

#Richard Kunert for Brain's Idea 3/9/2015
#load necessary libraries
library(httr)
library(Hmisc)
library(ggplot2)
library(cocor)

#get raw data from OSF website
info &lt;- GET('https://osf.io/fgjvw/?action=download', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER &lt;- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] &lt;- "ID" # Change first column name to ID to be able to load .csv file

#restrict studies to those with appropriate data
studies&lt;-MASTER$ID[!is.na(MASTER$T_r..O.) &amp; !is.na(MASTER$T_r..R.)] ##to keep track of which studies are which
studies&lt;-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047)

#set font size for plotting
theme_set(theme_gray(base_size = 30))

#prepare correlation coefficients
dat_rank &lt;- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])),
sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])),
effect_size_O = rank(cbind(MASTER$T_r..O.[studies])),
effect_size_R = rank(cbind(MASTER$T_r..R.[studies])))
corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous
corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman")

#compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!)
htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R,
data = dat_rank, return.htest = FALSE)

#visualisation
#prepare data frame
dat_vis &lt;- data.frame(study = rep(c("Original", "Replication"), each=length(studies)),
sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])),
effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies])))

#The plotting call
plot.new
ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot
geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm,   # Add linear regression lines
se=FALSE,    # Don't add shaded confidence region
size=2,
aes(color=study))+#colour lines according to data points for consistency
geom_text(aes(x=750, y=0.46,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]),
color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies
guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text
geom_text(aes(x=750, y=0.2,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]),
color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies
geom_text(x=1500, y=0.33,
label=sprintf("Difference: Pearson &amp; Filon z = %1.3f (p = %1.3f)",
htest@pearson1898$statistic, htest@pearson1898$p.value),
color="black", hjust=0)+#add text about testing difference between correlation coefficients
guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text
ggtitle("Sampling until significant versus a priori power analysis")+#add figure title
labs(x="Sample Size", y="Effect size r")#add axis titles

Do music and language share brain resources?

When you listen to some music and when you read a book, does your brain use the same resources? This question goes to the heart of how the brain is organised – does it make a difference between cognitive domains like music and language? In a new commentary I highlight a successful approach which helps to answer this question.

On some isolated island in academia, the tree of knowledge has the form of a brain.

How do we read? What is the brain doing in this picture?

When reading the following sentence, check carefully when you are surprised at what you are reading:

After | the trial | the attorney | advised | the defendant | was | likely | to commit | more crimes.

I bet it was on the segment was. You probably thought that the defendant was advised, rather than that someone else was advised about the defendant. Once you read the word was you need to reinterpret what you have just read. In 2009 Bob Slevc and colleagues found out that background music can change your reading of this kind of sentences. If you hear a chord which is harmonically unexpected, you have even more trouble with the reinterpretation of the sentence on reading was.

Why does music influence language?

Why would an unexpected chord be problematic for reading surprising sentences? The most straight-forward explanation is that unexpected chords are odd. So they draw your attention. To test this simple explanation, Slevc tried out an unexpected instrument playing the chord in a harmonically expected way. No effect on reading. Apparently, not just any odd chord changes your reading. The musical oddity has to stem from the harmony of the chord. Why this is the case, is a matter of debate between scientists. What this experiment makes clear though, is that music can influence language via shared resources which have something to do with harmony processing.

Why ignore the fact that music influences language?

None of this was mention in a recent review by Isabelle Peretz and colleagues on this topic. They looked at where in the brain music and language show activations, as revealed in MRI brain scanners. This is just one way to find out whether music and language share brain resources. They concluded that ‘the question of overlap between music and speech processing must still be considered as an open question’. Peretz call for ‘converging evidence from several methodologies’ but fail to mention the evidence from non-MRI methodologies.1

Sure one has to focus on something, but it annoys me that people tend focus on methods (especially fancy expensive methods like MRI scanners), rather than answers (especially answers from elegant but cheap research into human behaviour like reading). So I decided to write a commentary together with Bob Slevc. We list no less than ten studies which used a similar approach to the one outlined above. Why ignore these results?

If only Peretz and colleagues had truly looked at ‘converging evidence from several methodologies’. They would have asked themselves why music sometimes influences language and why it sometimes does not. The debate is in full swing and already beyond the previous question of whether music and language share brain resources. Instead, researchers ask what kind of resources are shared.

So, yes, music and language appear to share some brain resources. Perhaps this is not easily visible in MRI brain scanners. Looking at how people read with chord sequences played in the background is how one can show this.

— — —
Kunert, R., & Slevc, L.R. (2015). A commentary on “Neural overlap in processing music and speech” (Peretz et al., 2015) Frontiers in Human Neuroscience : doi: 10.3389/fnhum.2015.00330

Peretz I, Vuvan D, Lagrois MÉ, & Armony JL (2015). Neural overlap in processing music and speech. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 370 (1664) PMID: 25646513

Slevc LR, Rosenberg JC, & Patel AD (2009). Making psycholinguistics musical: self-paced reading time evidence for shared processing of linguistic and musical syntax. Psychonomic bulletin & review, 16 (2), 374-81 PMID: 19293110
— — —

1 Except for one ECoG study.

DISCLAIMER: The views expressed in this blog post are not necessarily shared by Bob Slevc.

Why does humanity get smarter and smarter?

Intelligence tests have to be adjusted all the time because people score higher and higher. If the average human of today went 105 years back in time, s/he would score 130, be considered as gifted, and join clubs for highly intelligent people. How can that be?

IQ_increase_base_graphic_v.2_EnglishThe IQ growth

The picture above shows the development of humanity’s intelligence between 1909 and 2013. According to IQ-scores people got smarter and smarter. During the last 105 years, people’s scores increased by as much as 30 IQ-points. That is equivalent to the difference between intellectual disability and normal intelligence. Ever since the discovery of this effect by James Flynn, the underlying reason has been hotly debated. A new analysis combines all available studies into one overall picture in order to find answers.

Jakob Pietschnig and Martin Voracek included all available data pertaining to IQ increases from one generation to another: nearly 4 million test takers in 105 years. They found that IQ scores sometimes increased faster and sometimes more slowly. Check the difference between the 1920s and WWII in the figure above. Moreover, different aspects of intelligence change at different speeds. So-called crystallized intelligence (knowledge about facts) increased only at a rate of 0.2 points per year. So-called fluid intelligence (abstract problem solving), on the other hand, increased much faster at 0.4 points per year.

Five reasons for IQ growth

Five reasons appear to come together to explain this phenomenon:

1) better schooling: IQ growth is stronger in adults than in children, probably because adults stay longer and longer in school.

2) more experience with multiple choice tests: since the 1990s the multiple choice format has become common in schools and universities. Modern test takers are no longer put off by this way of asking questions in IQ tests and might resort to smart guessing.

3) less malnutrition: the slow IQ growth during the world wars might have something to do with a lack of nutrients and energy which the brain needs

4) better health care: the less sick you are, the more your brain can develop optimally

5) less lead poisoning: since the 1970s lead was phased out in paint and gasoline, removing an obstacle for healthy neural development

 Am I really smarter than my father?

According to the Flynn effect, my generation is 8 IQ-points smarter than that of my parents. But this only relates to performance on IQ tests. I somehow doubt that more practical, less abstract, areas show the same effect. Perhaps practical intelligence is just more difficult to measure. It is possible that we have not really become more intelligent thinkers but instead more abstract thinkers.

— — —
Pietschnig J, & Voracek M (2015). One Century of Global IQ Gains: A Formal Meta-Analysis of the Flynn Effect (1909-2013). Perspectives on psychological science : a journal of the Association for Psychological Science, 10 (3), 282-306 PMID: 25987509

— — —

Figure: self made, based on data in Figure 1 in Pietschnig & Voracek (2015, p. 285)

The scientific community’s Galileo affair (you’re the Pope)

Science is in crisis. Everyone in the scientific community knows about it but few want to talk about it. The crisis is one of honesty. A junior scientist (like me) asks himself a similar question to Galileo in 1633: how much honesty is desirable in science?

Galileo versus Pope: guess what role the modern scientist plays.

Science Wonderland

According to nearly all empirical scientific publications that I have read, scientists allegedly work like this:

Introduction, Methods, Results, Discussion

Scientists call this ‘the story’ of the paper. This ‘story framework’ is so entrenched in science that the vast majority of scientific publications are required to be organised according to its structure: 1) Introduction, 2) Methods, 3) Results, 4) Discussion. My own publication is no exception.

Science Reality

However, virtually all scientists know that ‘the story’ is not really true. It is merely an ideal-case-scenario. Usually, the process looks more like this:

questionable research practices

Scientists call some of the added red arrows questionable research practices (or QRP for short). The red arrows stand for (going from left to right, top to bottom):

1) adjusting the hypothesis based on the experimental set-up. This is particularly true when a) working with an old data-set, b) the set-up is accidentally different from the intended one, etc.

2) changing design details (e.g., how many participants, how many conditions to include, how many/which measures of interest to focus on) depending on the results these changes produce.

3) analysing until results are easy to interpret.

4) analysing until results are statistically desirable (‘significant results’), i.e. so-called p-hacking.

5) hypothesising after results are known (so-called HARKing).

The outcome is a collection of blatantly unrealistic ‘stories’ in scientific publications. Compare this to the more realistic literature on clinical trials for new drugs. More than half the drugs fail the trial (Goodman, 2014). In contrast, nearly all ‘stories’ in the wider scientific literature are success stories. How?

Joseph Simmons and colleagues (2011) give an illustration of how to produce spurious successes. They simulated the situation of researchers engaging in the second point above (changing design details based on results). Let’s assume that the hypothesised effect is not real. How many experiments will erroneously find an effect at the conventional 5% significance criterion? Well, 5% of experiments should (scientists have agreed that this number is low enough to be acceptable). However, thanks to the questionable research practices outlined above this number can be boosted. For example, sampling participants until the result is statistically desirable leads to up to 22% of experiments reporting a ‘significant result’ even though there is no effect to be found. It is estimated that 70% of US psychologists have done this (John et al., 2012). When such a practice is combined with other, similar design changes, up to 61% of experiments falsely report a significant effect. Why do we do this?

The Pope of 1633 is back

If we know that the scientific literature is unrealistic why don’t we just drop the pretense and just tell it as it is? The reason is simple: because you like the scientific wonderland of success stories. If you are a scientist reader, you like to base the evaluation of scientific manuscripts on the ‘elegance’ (simplicity, straight-forwardness) of the text. This leaves no room for telling you what really happened. You also like to base the evaluation of your colleagues on the quantity and the ‘impact’ of their scientific output. QRPs are essentially a career requirement in such a system. If you are a lay reader, you like the research you fund (via tax money) to be sexy, easy and simple. Scientific data are as messy as the real world but the reported results are not. They are meant to be easily digestible (‘elegant’) insights.

In 1633 it did not matter much whether Galileo admitted to the heliocentric world view which was deemed blasphemous. The idea was out there to conquer the minds of the renaissance world. Today’s Galileo moment is also marked by an inability to admit to scientific facts (i.e. the so-called ‘preliminary’ research results which scientists obtain before applying questionable research practices). But this time the role of the Pope is played both by political leaders/ the lay public and scientists themselves. Actual scientific insights get lost before they can see the light of day.

There is a great movement to remedy this situation, including pressure to share data (e.g., at PLoS ONE), replication initiatives (e.g., RRR1, reproducibility project), the opportunity to pre-register experiments etc. However, these remedies only focus on scientific practice, as if Galileo was at fault and the concept of blasphemy was fine. Maybe we should start looking into how we got into this mess in the first place. Focus on the Pope.

— — —
John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices with Incentives for Truth-Telling SSRN Electronic Journal DOI: 10.2139/ssrn.1996631

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

Picture: Joseph Nicolas Robert-Fleury [Public domain], via Wikimedia Commons

PS: This post necessarily reflects the field of science that I am familiar with (Psychology, Cognitive Science, Neuroscience). The situation may well be different in other scientific fields.