Richard Kunert

Are pre-registrations the solution to the replication crisis in Psychology?

Most psychology findings are not replicable. What can be done? In his Psychological Science editorial, Stephen Lindsay advertises pre-registration as a solution, writing that “Personally, I aim never again to submit for publication a report of a study that was not preregistered”. I took a look at whether pre-registrations are effective and feasible [TL;DR: maybe and possibly].

[I updated the blog post using comments by Cortex editor Chris Chambers, see below for full comments. It turns out that many of my concerns have already been addressed. Updates in square brackets.]

A recent study published in Science found that the majority of Psychological research cannot be reproduced by independent replication teams (Open Science Collaboration, 2015). I believe that this is due to questionable research practices (LINK) and that internal replications are no solution to this problem (LINK). However, might pre-registrations be the solution? I don’t think so. The reason why I am pessimistic is three-fold.

What is a pre-registration? A pre-registered study submits its design and analysis before data is acquired. After data acquisition the pre-registered data analysis plan is executed and the results can confidently be labelled confirmatory (i.e. more believable). Analyses not specified before are labelled exploratory (i.e. less believable). Some journals offer peer-review of the pre-registration document. Once it has been approved, the chances of the journal accepting a manuscript based on the proposed design and analysis are supposedly very high. [Chris Chambers: “for more info on RRs see”%5D


1) Pre-registration does not remove all incentives to employ questionable research practices

Pre-registrations should enforce honesty about post hoc changes in the design/analysis. Ironically, the efficacy of pre-registrations is itself dependent on the honesty of researchers. The reason is simple: including the information that an experiment was pre-registered is optional. So, if the planned analysis is optimal, a researcher can boost its impact by revealing that the entire experiment was pre-registered. If not, s/he deletes the pre-registration document and proceeds as if it had never existed, a novel questionable research practice (anyone want to invent a name for it? Optional forgetting?).

Defenders of pre-registration could counter that peer-reviewed pre-registrations are different because there is no incentive to deviate from the planned design/analysis. Publication is guaranteed if the pre-registered study is executed as promised. However, two motives remove this publication advantage:

1a) the credibility boost of presenting a successful post hoc design or analysis decision as a priori can still be achieved by publishing the paper in a different journal which is unaware of the pre-registration document.

1b) the credibility loss of a wider research agenda due to a single unsuccessful experiment can still be avoided by simply withdrawing the study from the journal and forgetting about it.

The take-home message is that one can opt-in and out of pre-registration as one pleases. The maximal cost is the rejection of one peer-reviewed pre-registered paper at one journal. Given that paper rejection is the most normal thing in the world for a scientist these days, this threat is not effective.

[Chris Chambers: “all pre-registrations made now on the OSF become public within 4 years – so as far as I understand, it is no longer possible to register privately and thus game the system in the way you describe, at least on the OSF.”]

2) Pre-registrations did not clean up other research fields

Note that the argument so far assumes that when the pre-registration document is revealed, it is effective in stopping undisclosed post hoc design/analysis decisions. The medical sciences, in which randomized control trials have to be pre-registered since a 2004 decision by journal editors, teach us that this is not so. There are four aspects to this surprising ineffectivetiveness of pre-registrations:

2a) Many pre-registered studies are not published. For example, Chan et al. (2004a,b) could not locate the publications of 54% – 63% of the pre-registered studies. It’s possible that this is due to the aforementioned publication bias (see 1b above), or other reasons (lack of funding, manuscript under review…).

2b) Medical authors feel free to frequently deviate from their planned designs/analyses. For example 31% – 62% of randomized controlled trials changed at least one primary outcome between pre-registration and publication (Mathieu et al., 2009; Chan et al., 2004a,b). If you thought that psychological scientists are somehow better than medical ones, early indications are that this is not so (Franco et al., 2015).

pre-registration deviations in psych science

2c) Deviations from pre-registered designs/analyses are not discovered because 66% of journal reviewers do not consult the pre-registration document (Mathieu et al., 2013).

2d) In the medical sciences pre-registration documents are usually not peer-reviewed and quite often sloppy. For example, Mathieu et al., (2013) found 37% of trials to be post-registered (the pointless exercise of registering a study which has already taken place), and 17% of pre-registrations being too imprecise to be useful.

[Chris Chambers: “The concerns raised by others about reviewers not checking protocols apply to clinical trial registries but this is moot for RRs because checking happens at an editorial level (if not at both an editorial and reviewer level) and there is continuity of the review process from protocol through to study completion.”]

3) Pre-registration is a practical night-mare for early career researchers

Now, one might argue that pre-registering is still better than not pre-registering. In terms of non-peer-reviewed pre-registration documents, this is certainly true. However, their value is limited because they can be written so vaguely as to be not useless (see 2d) and they can simply be deleted if they ‘stand in the way of a good story’, i.e. if an exploratory design/analysis choice gets reported as confirmatory (see 1a).

The story is different for peer-reviewed pre-registrations. They are impractical because of one factor which tenured decision makers sometimes forget: time. Most research is done by junior scientists who have temporary contracts running anywhere between a few months and five years [reference needed]. These people cannot wait for a peer-review decision which, on average, takes something like one year and ten months (Nosek & Bar-Anan, 2012). This is the submission-to-publication-time distribution for one prominent researcher (Brian Nosek):


What does this mean? As a case study, let’s take Richard Kunert, a fine specimen of a junior researcher, who was given three years of funding by the Max-Planck-Gesellschaft in order to obtain a PhD. Given the experience by Brain Nosek with his articles, and assuming Richard submits three pre-registration documents on day 1 of his 3-year PhD, each individual document has a 84.6% chance of being accepted within three years. The chance that all three will be accepted is 60.6% (0.8463). This scenario is obviously unrealistic because it leaves no time for setting up the studies and for actually carrying them out.

For the more realistic case of one year of piloting and one year of actually carrying out the studies, Richard has a 2.2% chance (0.2823) that all three studies are peer-reviewed at the pre-registration stage and published. However, Richard is not silly (or so I have heard), so he submits 5 studies, hoping that at least three of them will be eventually carried out. In this case he has a 14% that at least three studies are peer-reviewed at the pre-registration stage and published. Only if Richard submits 10 or more pre-registration documents for peer-review after 1 year of piloting, he has a more than 50% chance of being left with at least 3 studies to carry out within 1 year.

For all people who hate numbers, let me put it into plain words. Peer-review is so slow that requiring PhD students to only perform pre-registered studies means the overwhelming majority of PhD students will fail their PhD requirements in their funded time. In this scenario cutting-edge, world-leading science will be done by people flipping burgers to pay the rent because funding ran out too quickly.

[Chris Chambers: “Average decision times from Cortex, not including time taken by authors to make revisions: initial trial = 5 days; Stage 1 provisional acceptance = 9 weeks (1-3 rounds of in-depth review); Stage 2 full acceptance = 4 weeks”]

What to do

The arrival of pre-registration in the field of Psychology is undoubtedly a good sign for science. However, given what we know now, no one should be under the illusion that this instrument is the solution to the replication crisis which psychological researchers are facing. At the most, it is a tiny piece of a wider strategy to make Psychology what it has long claimed to be: a robust, evidence based, scientific enterprise.


[Please do yourself a favour and read the comments below. You won’t get better people commenting than this.]

— — —

Chan AW, Krleza-Jerić K, Schmid I, & Altman DG (2004). Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. CMAJ : Canadian Medical Association journal = journal de l’Association medicale canadienne, 171 (7), 735-40 PMID: 15451835

Chan, A., Hróbjartsson, A., Haahr, M., Gøtzsche, P., & Altman, D. (2004). Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials JAMA, 291 (20) DOI: 10.1001/jama.291.20.2457

Franco, A., Malhotra, N., & Simonovits, G. (2015). Underreporting in Psychology Experiments: Evidence From a Study Registry Social Psychological and Personality Science DOI: 10.1177/1948550615598377

Lindsay, D. (2015). Replication in Psychological Science Psychological Science DOI: 10.1177/0956797615616374

Mathieu, S., Boutron, I., Moher, D., Altman, D.G., & Ravaud, P. (2009). Comparison of Registered and Published Primary Outcomes in Randomized Controlled Trials JAMA, 302 (9) DOI: 10.1001/jama.2009.1242

Mathieu, S., Chan, A., & Ravaud, P. (2013). Use of Trial Register Information during the Peer Review Process PLoS ONE, 8 (4) DOI: 10.1371/journal.pone.0059910

Nosek, B., & Bar-Anan, Y. (2012). Scientific Utopia: I. Opening Scientific Communication Psychological Inquiry, 23 (3), 217-243 DOI: 10.1080/1047840X.2012.692215

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443
— — —

Are internal replications the solution to the replication crisis in Psychology? No.

Most Psychology findings are not replicable. What can be done? Stanford psychologist Michael Frank has an idea : Cumulative study sets with internal replication. ‘If I had to advocate for a single change to practice, this would be it.’ I took a look whether this makes any difference.

A recent paper in the journal Science has tried to replicate 97 statistically significant effects (Open Science Collaboration, 2015). In only 35 cases this was successful. Most findings were suddenly a lot weaker upon replication. This has led to a lot of soul searching among psychologists. Fortunately, the authors of the Science paper have made their data freely available. So, soul searching can be accompanied by trying out different ideas for improvements.

What can be done to solve Psychology’s replication crisis?

One idea to improve the situation is to demand study authors to replicate their own experiments in the same paper. Stanford psychologist Michael Frank writes:

If I had to advocate for a single change to practice, this would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. […] If you show me a one-off study and I fail to replicate it in my lab, I will tend to suspect that you got lucky or p-hacked your way to a result. But if you show me a package of studies with four internal replications of an effect, I will believe that you know how to get that effect – and if I don’t get it, I’ll think that I’m doing something wrong.
If this argument were true, then the 41 studies which were successfully, conceptually replicated in their own paper should show higher rates of replication than the 56 studies which were not. Of the 41 internally replicated studies, 19 were replicated once, 10 twice, 8 thrice, 4 more than three times. I will treat all of these as equally internally replicated.

Are internal replications the solution? No.


So, does the data by the reprocucibility project show a difference? I made so-called violin plots, thicker parts represent more data points. In the left plot you see the reduction in effect sizes from a bigger original effect to a smaller replicated effect. The reduction associated with internally replicated effects (left) and effects which were only reported once in a paper (right) is more or less the same. In the right plot you can see the p-value of the replication attempt. The dotted line represents the arbitrary 0.05 threshold used to determine statistical significance. Again, replicators appear to have had as hard a task with effects that were found more than once in a paper as with effects which were only found once.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 29% of internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The equivalent number of not internally replicated effects is 41%. A contingency table Bayes factor test (Gunel & Dickey, 1974) shows that the null hypothesis of no difference is 1.97 times more likely than the alternative. In other words, the 12 %-point replication advantage for non-replicated effects does not provide convincing evidence for an unexpected reversed replication advantage. The 12%-point difference is not due to statistical power. Power was 92% on average in the case of internally replicated and not internally replicated studies. So, the picture doesn’t support internal replications at all. They are hardly the solution to Psychology’s replication problem according to this data set.

The problem with internal replications

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

What can be done?

In my previous post I have shown evidence for questionable research practices in this data set. These lead to less replicable results. Pre-registering studies makes questionable research practices a lot harder and science more reproducible. It would be interesting to see data on whether this hunch is true.

[update 7/9/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to take into account the different denominators for replicated and unreplicated effects. Lee Jussim pointed me to this.]

[update 24/10/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to provide correct numbers, Bayesian analysis and power comparison.]

— — —
Bem DJ (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of personality and social psychology, 100 (3), 407-25 PMID: 21280961

Galak, J., LeBoeuf, R., Nelson, L., & Simmons, J. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103 (6), 933-948 DOI: 10.1037/a0029709

Gunel, E., & Dickey, J. (1974). Bayes Factors for Independence in Contingency Tables. Biometrika, 61(3), 545–557.

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Ritchie SJ, Wiseman R, & French CC (2012). Failing the future: three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PloS one, 7 (3) PMID: 22432019

Schimmack U (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17 (4), 551-66 PMID: 22924598
— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between internal replication and independent reproducibility of an effect

#Richard Kunert for Brain's Idea 5/9/2015

# a lot of code was taken from the reproducibility project code here

# installing/loading the packages:

#loading the data
RPPdata <- get.OSFfile(code='',dfCln=T)$df
RPPdata <- dplyr::filter(RPPdata, !,!, complete.cases(RPPdata$T.r.O,RPPdata$T.r.R))#97 studies with significant effects

#prepare IDs for internally replicated effects and non-internally replicated effects
idIntRepl <- RPPdata$Successful.conceptual.replications.O > 0
idNotIntRepl <- RPPdata$Successful.conceptual.replications.O == 0

# Get ggplot2 themes predefined in C-3PR
mytheme <- gg.theme("clean")

#restructure data in data frame
dat <- data.frame(EffectSizeDifference = as.numeric(c(c(RPPdata$T.r.R[idIntRepl]) - c(RPPdata$T.r.O[idIntRepl]),
                                                          c(RPPdata$T.r.R[idNotIntRepl]) - c(RPPdata$T.r.O[idNotIntRepl]))),
                  ReplicationPValue = as.numeric(c(RPPdata$T.pval.USE.R[idIntRepl],
                  grp=factor(c(rep("Internally Replicated Studies",times=sum(idIntRepl)),
                               rep("Internally Unreplicated Studies",times=sum(idNotIntRepl))))

# Create some variables for plotting
dat$grp <- as.numeric(dat$grp)
probs   <- seq(0,1,.25)

# VQP PANEL A: reduction in effect size -------------------------------------------------

# Get effect size difference quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$EffectSizeDifference[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$EffectSizeDifference[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$EffectSizeDifference[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2 <- ggplot(dat,aes(x=grp,y=EffectSizeDifference)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 <- vioQtile(,qtiles,probs)
# Garnish (what does this word mean???)
g.es1 <- g.es0 +
  ggtitle("Effect size reduction") + xlab("") + ylab("Replicated - Original Effect Size") + 
  xlim("Internally Replicated", "Not Internally Replicated") +
  mytheme + theme(axis.text.x = element_text(size=20))
# View

# VQP PANEL B: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$ReplicationPValue[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$ReplicationPValue[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$ReplicationPValue[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.pv <- ggplot(dat,aes(x=grp,y=ReplicationPValue)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 <- vioQtile(g.pv,qtiles,probs)
# Garnish (I still don't know what this word means!)
g.pv1 <- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
  ggtitle("Independent replication p-value") + xlab("") + ylab("Independent replication p-value") + 
  xlim("Internally Replicated", "Not Internally Replicated")+
  mytheme + theme(axis.text.x = element_text(size=20))
# View

#put two plots together

Why are Psychological findings mostly unreplicable?

Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.

Sometimes the title of a paper just sounds incredible. Estimating the reproducibility of psychological science. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

One reason for poor replication: sampling until significant

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a p-value below the arbitrary threshold of 0.05. The consequence is this:


Focus on the left panel first. The green replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a stronger relation between the effect size they found and their sample size. They must have done more than just smart a priori power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size measured in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the expected effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The right panel shows a Bayesian correlation analysis of the data. What you are looking at is the belief in the strength of the correlation, called the posterior distribution. The overlap of the distributions can be used as a measure of believing that the correlations are not different. The overlap is less than 7%. If you are more inclined to believe in frequentist statistics, the associated p-value is .001 (Pearson and Filon’s z = 3.355). Therefore, there is strong evidence that original studies display a stronger negative correlation between sample size and measured effect size than replication studies.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the p-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

[24/10/2015: Added Bayesian analysis and changed figure. Code below is from old figure.]

[27/11/2015: Adjusted percentage overlap of posterior distributions.]

— — —
John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project

#Richard Kunert for Brain's Idea 3/9/2015
#load necessary libraries

#get raw data from OSF website
info &lt;- GET('', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER &lt;- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] &lt;- "ID" # Change first column name to ID to be able to load .csv file

#restrict studies to those with appropriate data
studies&lt;-MASTER$ID[!$T_r..O.) &amp; !$T_r..R.)] ##to keep track of which studies are which
studies&lt;-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047)

#set font size for plotting
theme_set(theme_gray(base_size = 30))

#prepare correlation coefficients
dat_rank &lt;- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])),
sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])),
effect_size_O = rank(cbind(MASTER$T_r..O.[studies])),
effect_size_R = rank(cbind(MASTER$T_r..R.[studies])))
corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous
corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman")

#compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!)
htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R,
data = dat_rank, return.htest = FALSE)

#prepare data frame
dat_vis &lt;- data.frame(study = rep(c("Original", "Replication"), each=length(studies)),
sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])),
effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies])))

#The plotting call
ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot
geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm,   # Add linear regression lines
se=FALSE,    # Don't add shaded confidence region
aes(color=study))+#colour lines according to data points for consistency
geom_text(aes(x=750, y=0.46,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]),
color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies
guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text
geom_text(aes(x=750, y=0.2,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]),
color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies
geom_text(x=1500, y=0.33,
label=sprintf("Difference: Pearson &amp; Filon z = %1.3f (p = %1.3f)",
htest@pearson1898$statistic, htest@pearson1898$p.value),
color="black", hjust=0)+#add text about testing difference between correlation coefficients
guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text
ggtitle("Sampling until significant versus a priori power analysis")+#add figure title
labs(x="Sample Size", y="Effect size r")#add axis titles

Do music and language share brain resources?

When you listen to some music and when you read a book, does your brain use the same resources? This question goes to the heart of how the brain is organised – does it make a difference between cognitive domains like music and language? In a new commentary I highlight a successful approach which helps to answer this question.

On some isolated island in academia, the tree of knowledge has the form of a brain.

How do we read? What is the brain doing in this picture?

When reading the following sentence, check carefully when you are surprised at what you are reading:

After | the trial | the attorney | advised | the defendant | was | likely | to commit | more crimes.

I bet it was on the segment was. You probably thought that the defendant was advised, rather than that someone else was advised about the defendant. Once you read the word was you need to reinterpret what you have just read. In 2009 Bob Slevc and colleagues found out that background music can change your reading of this kind of sentences. If you hear a chord which is harmonically unexpected, you have even more trouble with the reinterpretation of the sentence on reading was.

Why does music influence language?

Why would an unexpected chord be problematic for reading surprising sentences? The most straight-forward explanation is that unexpected chords are odd. So they draw your attention. To test this simple explanation, Slevc tried out an unexpected instrument playing the chord in a harmonically expected way. No effect on reading. Apparently, not just any odd chord changes your reading. The musical oddity has to stem from the harmony of the chord. Why this is the case, is a matter of debate between scientists. What this experiment makes clear though, is that music can influence language via shared resources which have something to do with harmony processing.

Why ignore the fact that music influences language?

None of this was mention in a recent review by Isabelle Peretz and colleagues on this topic. They looked at where in the brain music and language show activations, as revealed in MRI brain scanners. This is just one way to find out whether music and language share brain resources. They concluded that ‘the question of overlap between music and speech processing must still be considered as an open question’. Peretz call for ‘converging evidence from several methodologies’ but fail to mention the evidence from non-MRI methodologies.1

Sure one has to focus on something, but it annoys me that people tend focus on methods (especially fancy expensive methods like MRI scanners), rather than answers (especially answers from elegant but cheap research into human behaviour like reading). So I decided to write a commentary together with Bob Slevc. We list no less than ten studies which used a similar approach to the one outlined above. Why ignore these results?

If only Peretz and colleagues had truly looked at ‘converging evidence from several methodologies’. They would have asked themselves why music sometimes influences language and why it sometimes does not. The debate is in full swing and already beyond the previous question of whether music and language share brain resources. Instead, researchers ask what kind of resources are shared.

So, yes, music and language appear to share some brain resources. Perhaps this is not easily visible in MRI brain scanners. Looking at how people read with chord sequences played in the background is how one can show this.

— — —
Kunert, R., & Slevc, L.R. (2015). A commentary on “Neural overlap in processing music and speech” (Peretz et al., 2015) Frontiers in Human Neuroscience : doi: 10.3389/fnhum.2015.00330

Peretz I, Vuvan D, Lagrois MÉ, & Armony JL (2015). Neural overlap in processing music and speech. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 370 (1664) PMID: 25646513

Slevc LR, Rosenberg JC, & Patel AD (2009). Making psycholinguistics musical: self-paced reading time evidence for shared processing of linguistic and musical syntax. Psychonomic bulletin & review, 16 (2), 374-81 PMID: 19293110
— — —

1 Except for one ECoG study.

DISCLAIMER: The views expressed in this blog post are not necessarily shared by Bob Slevc.

Why does humanity get smarter and smarter?

Intelligence tests have to be adjusted all the time because people score higher and higher. If the average human of today went 105 years back in time, s/he would score 130, be considered as gifted, and join clubs for highly intelligent people. How can that be?

IQ_increase_base_graphic_v.2_EnglishThe IQ growth

The picture above shows the development of humanity’s intelligence between 1909 and 2013. According to IQ-scores people got smarter and smarter. During the last 105 years, people’s scores increased by as much as 30 IQ-points. That is equivalent to the difference between intellectual disability and normal intelligence. Ever since the discovery of this effect by James Flynn, the underlying reason has been hotly debated. A new analysis combines all available studies into one overall picture in order to find answers.

Jakob Pietschnig and Martin Voracek included all available data pertaining to IQ increases from one generation to another: nearly 4 million test takers in 105 years. They found that IQ scores sometimes increased faster and sometimes more slowly. Check the difference between the 1920s and WWII in the figure above. Moreover, different aspects of intelligence change at different speeds. So-called crystallized intelligence (knowledge about facts) increased only at a rate of 0.2 points per year. So-called fluid intelligence (abstract problem solving), on the other hand, increased much faster at 0.4 points per year.

Five reasons for IQ growth

Five reasons appear to come together to explain this phenomenon:

1) better schooling: IQ growth is stronger in adults than in children, probably because adults stay longer and longer in school.

2) more experience with multiple choice tests: since the 1990s the multiple choice format has become common in schools and universities. Modern test takers are no longer put off by this way of asking questions in IQ tests and might resort to smart guessing.

3) less malnutrition: the slow IQ growth during the world wars might have something to do with a lack of nutrients and energy which the brain needs

4) better health care: the less sick you are, the more your brain can develop optimally

5) less lead poisoning: since the 1970s lead was phased out in paint and gasoline, removing an obstacle for healthy neural development

 Am I really smarter than my father?

According to the Flynn effect, my generation is 8 IQ-points smarter than that of my parents. But this only relates to performance on IQ tests. I somehow doubt that more practical, less abstract, areas show the same effect. Perhaps practical intelligence is just more difficult to measure. It is possible that we have not really become more intelligent thinkers but instead more abstract thinkers.

— — —
Pietschnig J, & Voracek M (2015). One Century of Global IQ Gains: A Formal Meta-Analysis of the Flynn Effect (1909-2013). Perspectives on psychological science : a journal of the Association for Psychological Science, 10 (3), 282-306 PMID: 25987509

— — —

Figure: self made, based on data in Figure 1 in Pietschnig & Voracek (2015, p. 285)

The scientific community’s Galileo affair (you’re the Pope)

Science is in crisis. Everyone in the scientific community knows about it but few want to talk about it. The crisis is one of honesty. A junior scientist (like me) asks himself a similar question to Galileo in 1633: how much honesty is desirable in science?

Galileo versus Pope: guess what role the modern scientist plays.

Science Wonderland

According to nearly all empirical scientific publications that I have read, scientists allegedly work like this:

Introduction, Methods, Results, Discussion

Scientists call this ‘the story’ of the paper. This ‘story framework’ is so entrenched in science that the vast majority of scientific publications are required to be organised according to its structure: 1) Introduction, 2) Methods, 3) Results, 4) Discussion. My own publication is no exception.

Science Reality

However, virtually all scientists know that ‘the story’ is not really true. It is merely an ideal-case-scenario. Usually, the process looks more like this:

questionable research practices

Scientists call some of the added red arrows questionable research practices (or QRP for short). The red arrows stand for (going from left to right, top to bottom):

1) adjusting the hypothesis based on the experimental set-up. This is particularly true when a) working with an old data-set, b) the set-up is accidentally different from the intended one, etc.

2) changing design details (e.g., how many participants, how many conditions to include, how many/which measures of interest to focus on) depending on the results these changes produce.

3) analysing until results are easy to interpret.

4) analysing until results are statistically desirable (‘significant results’), i.e. so-called p-hacking.

5) hypothesising after results are known (so-called HARKing).

The outcome is a collection of blatantly unrealistic ‘stories’ in scientific publications. Compare this to the more realistic literature on clinical trials for new drugs. More than half the drugs fail the trial (Goodman, 2014). In contrast, nearly all ‘stories’ in the wider scientific literature are success stories. How?

Joseph Simmons and colleagues (2011) give an illustration of how to produce spurious successes. They simulated the situation of researchers engaging in the second point above (changing design details based on results). Let’s assume that the hypothesised effect is not real. How many experiments will erroneously find an effect at the conventional 5% significance criterion? Well, 5% of experiments should (scientists have agreed that this number is low enough to be acceptable). However, thanks to the questionable research practices outlined above this number can be boosted. For example, sampling participants until the result is statistically desirable leads to up to 22% of experiments reporting a ‘significant result’ even though there is no effect to be found. It is estimated that 70% of US psychologists have done this (John et al., 2012). When such a practice is combined with other, similar design changes, up to 61% of experiments falsely report a significant effect. Why do we do this?

The Pope of 1633 is back

If we know that the scientific literature is unrealistic why don’t we just drop the pretense and just tell it as it is? The reason is simple: because you like the scientific wonderland of success stories. If you are a scientist reader, you like to base the evaluation of scientific manuscripts on the ‘elegance’ (simplicity, straight-forwardness) of the text. This leaves no room for telling you what really happened. You also like to base the evaluation of your colleagues on the quantity and the ‘impact’ of their scientific output. QRPs are essentially a career requirement in such a system. If you are a lay reader, you like the research you fund (via tax money) to be sexy, easy and simple. Scientific data are as messy as the real world but the reported results are not. They are meant to be easily digestible (‘elegant’) insights.

In 1633 it did not matter much whether Galileo admitted to the heliocentric world view which was deemed blasphemous. The idea was out there to conquer the minds of the renaissance world. Today’s Galileo moment is also marked by an inability to admit to scientific facts (i.e. the so-called ‘preliminary’ research results which scientists obtain before applying questionable research practices). But this time the role of the Pope is played both by political leaders/ the lay public and scientists themselves. Actual scientific insights get lost before they can see the light of day.

There is a great movement to remedy this situation, including pressure to share data (e.g., at PLoS ONE), replication initiatives (e.g., RRR1, reproducibility project), the opportunity to pre-register experiments etc. However, these remedies only focus on scientific practice, as if Galileo was at fault and the concept of blasphemy was fine. Maybe we should start looking into how we got into this mess in the first place. Focus on the Pope.

— — —
John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices with Incentives for Truth-Telling SSRN Electronic Journal DOI: 10.2139/ssrn.1996631

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

Picture: Joseph Nicolas Robert-Fleury [Public domain], via Wikimedia Commons

PS: This post necessarily reflects the field of science that I am familiar with (Psychology, Cognitive Science, Neuroscience). The situation may well be different in other scientific fields.



The real reason why new pop music is so incredibly bad

You have probably heard that Pink Floyd recently published their new album Endless River. Will this bring back the wonderful world of good music after the endless awfulness of the popular music scene in the last 20 years or so? Is good music, as we know it from the 60s and 70s, back for good? The reasons behind the alleged endless awfulness of pop music these days suggest otherwise. We shouldn’t be throwing stones at new music but instead at our inability to like it.

Pink Floyd 1973

When we were young we learned to appreciate Pink Floyd.

Daniel Levitin was asked at a recent music psychology conference in Toronto why old music is amazing and new music is awful. He believed that modern record companies are there to make money. In the olden days, on the other hand, they were there to make music and ready to hold on to musicians which needed time to become successful. More interestingly, he reminded the public that many modern kidz would totally disagree with the implication that modern music is awful. How can it be that new music is liked by young people if so much of it is often regarded as quite bad?

Everything changes for the better after a few repetitions

The answer to the mystery has nothing to do with flaws in modern music but instead with our brain. When adults hear new music they often hate it at first. After repeated listening they tend to find it more and more beautiful. For example, Marcia Johnson and colleagues (1985) played Korean melodies to American participants and found that hearing a new melody led to low liking ratings, a melody heard once before to higher ratings and even more exposure to higher than higher ratings. Even Korsakoff patients – who could hardly remember having heard individual melodies before – showed this effect, i.e. without them realising it they probably never forget melodies.

This so-called mere exposure effect is all that matters to me: a robust, medium-strong, generally applicable, evolutionarily plausible effect (Bornstein, 1989). You can do what you like, it applies to all sorts of stimuli. However, there is one interesting exception here. Young people do not show the mere exposure effect, no relationship between ‘repeat the stimulus’ and ‘give good feeling’ (Bornstein, 1989). As a result, adults need a lot more patience before they like a new song as much as young people do. No wonder adults are only satisfied with the songs they already know from their youth in the 60s and 70s. Probably, when looking at the music scene in 2050 the current generation will equally hate it and wish the Spice Girls back (notice the gradual rise of 90’s parties already).

I listened to it –> I like it

So, when it comes to an allegedly awful present and great past, ask yourself: how deep is your love for the old music itself rather than its repeated listening? Listen repeatedly to any of a million love songs and you will end up appreciating it. Personally, I give new music a chance and sometimes it manages to relight my fire. Concerning Endless River, if it’s not love at first sight, do not worry. The new Pink Floyd album sure is good (depending on how many times you listen to it).

— — —
Bornstein, R. (1989). Exposure and affect: Overview and meta-analysis of research, 1968-1987. Psychological Bulletin, 106 (2), 265-289 DOI: 10.1037/0033-2909.106.2.265

Johnson MK, Kim JK, & Risse G (1985). Do alcoholic Korsakoff’s syndrome patients acquire affective reactions? Journal of experimental psychology. Learning, memory, and cognition, 11 (1), 22-36 PMID: 3156951
— — —

Figure: By PinkFloyd1973.jpg: TimDuncan derivative work: Mr. Frank (PinkFloyd1973.jpg) [CC-BY-3.0 (, via Wikimedia Commons

— — —

PS: Yes, I did hide 29 Take That song titles in this blog post. Be careful, you might like 90’s pop music a little bit more due to this exposure.






Dyslexia: trouble reading ‘four’

Dyslexia affects about every tenth reader. It shows up when trying to read, especially when reading fast. But it is still not fully clear what words dyslexic readers find particularly hard. So, I did some research to find out, and I published the article today.

Carl Spitzweg: the bookworm

The bookworm (presumably non-dyslexic)

Imagine seeing a new word ‘bour’. How would you pronounce it? Similar to ‘four’, similar to ‘flour’ or similar to ‘tour’? It is impossible to know. Therefore, words such as ‘four’, ‘flour’ and ‘tour’ are said to be inconsistent – one doesn’t know how to pronounce them when encountering them for the very first time. Given this pronunciation challenge, I, together with my co-author Christoph Scheepers, hypothesised that such words would be more difficult for readers generally, and for dyslexic readers especially.

Finding evidence for a dyslexia specific problem is challenging because dyslexic participants tend to be slower than non-dyslexic people in most tasks that they do. So, if you force them to be as quick as typical readers they will seem bad readers even though they might be merely slow readers. Therefore, we adopted a new task that gave people a very long time to judge whether a bunch of letters are a word or not.

It turns out that inconsistent words like ‘four’ slow down both dyslexic and typical readers. But on top of that dyslexic readers never quite reach the same accuracy as typical readers with these words. It is as if the additional challenge these words pose can, with time, be surmounted in normal readers while dyslexic readers have trouble no matter how much time you give them. In other words, dyslexic people aren’t just slow. At least for some words they have trouble no matter how long they look at them.

This is my very first publication based on work I did more than four years ago. You should check out whether the waiting was worth it. The article is free to access here. I hope it will convince you that dyslexia is a real challenge to investigate. Still, the pay-off to fully understanding it is enormous: helping dyslexic readers cope in a literate society.

— — —
Kunert, R., & Scheepers, C. (2014). Speed and accuracy of dyslexic versus typical word recognition: an eye-movement investigation Frontiers in Psychology, 5 DOI: 10.3389/fpsyg.2014.01129
— — —

Picture: Carl Spitzweg [Public domain or Public domain], via Wikimedia Commons

Old people are immune against the cocktail party effect

Imagine standing at a cocktail party and somewhere your name gets mentioned. Your attention is immediately grabbed by the sound of your name. It is a classic psychological effect with a new twist: old people are immune.

Someone mention my name?

The so-called cocktail party effect has fascinated researchers for a long time. Even though you do not consciously listen to a conversation around you, your own name can grab your attention. That means that unbeknownst to you, you follow the conversations around you. You check them for salient information like your name, and if it occurs you quickly switch attention to where your name was mentioned.

The cocktail party simulated in the lab

In the lab this is investigated slightly differently. Participants listen to one ear and, for example, repeat whatever they hear. Their name is embedded in what they hear coming in to the other (unattended) ear. After the experiment one simply asks ‘Did you hear your own name?’ In a recent paper published by Moshe Naveh-Benjamin and colleagues (in press), around half of the young student participants noticed their name in such a set-up. Compare this to old people aged around 70: next to nobody (only six out of 76 participants) noticed their name being mentioned in the unattended ear.

Why this age difference? Do old people simply not hear well? Unlikely, when the name was played to the ear that they attended to, 45% of old people noticed their names. Clearly, many old people can hear their names, but they do not notice their names if they do not pay attention to this. Young people do not show such a sharp distinction. Half the time they notice their names, even when concentrating on something else.

Focusing the little attention that is available

Naveh-Benjamin and colleagues instead suggest that old people simply have less attention. When they focus on a conversation, they give it their everything. Nothing is left for the kind of unconscious checking of conversations which young people can do so well.

At the next cocktail party you can safely gossip about your old boss. Just avoid mentioning the name of the young new colleague who just started.


— — —

Naveh-Benjamin M, Kilb A, Maddox GB, Thomas J, Fine HC, Chen T, & Cowan N (2014). Older adults do not notice their names: A new twist to a classic attention task. Journal of experimental psychology. Learning, memory, and cognition PMID: 24820668

— — —


By Financial Times (Patrón cocktail bar) [CC-BY-2.0 (, via Wikimedia Commons

Why are ethical standards higher in science than in business and media?

Facebook manipulates user content in the name of science? Scandalous! It manipulates user content in the name of profit? No worries! Want to run a Milgram study these days? Get bashed by your local ethics committee! Want to show it on TV? No worries. Why do projects which seek knowledge have higher ethical standards than projects which seek profit?

Over half a million people were this mouse.

Just as we were preparing to leave for our well-deserved summer holidays this year, research was shaken by the fall-out to a psychological study (Kramer et al., 2014) which manipulated Facebook content. Many scientists objected to the study’s lack of asking for ‘informed consent’, and I think they are right. However, many ordinary people objected to something else. Here’s how Alex Hern put it over at the guardian:

At least when a multinational company, which knows everything about us and controls the very means of communication with our loved ones, acts to try and maximise its profit, it’s predictable. There’s something altogether unsettling about the possibility that Facebook experiments on its users out of little more than curiosity.

Notice the opposition between ‘maximise profit’ which is somehow thought to be okay and ‘experimenting on users’ which is not. I genuinely do not understand this distinction. Suppose the study had never been published in PNAS but instead in the company’s report to share holders (as a new means of emotionally enhancing advertisements), would there have been the same outcry? I doubt it. Why not?

Having issues with TV experimentation versus scientific experimentation?

Was the double standard around the Facebook study the exception? I do not think so.  In the following youtube clip you see the classic Milgram experiment re-created for British TV. The participants’ task is to follow the experimentor’s instructions to electrocute another participant (who is actually an actor) for bad task performance. Electro shocks increase in strength until they are allegedly lethal. People are obviously distressed in this task.

Yesterday, the New Scientist called the classic Milgram experiment one of ‘the most unethical [experiments] ever carried out’. Why is this okay for TV? Now, imagine a hybrid case. Would it be okay if the behaviour shown on TV was scientifically analysed and published in a respectable journal? I guess that would somehow be fine. Why is it okay to run the study with a TV camera involved, not when the TV camera is switched off? This is not a rhetorical question. I actually do not grasp the underlying principle.

Why is ‘experimenting on people’ bad?

In my experience, ethical guidelines are a real burden on researchers. And this is a good thing because society holds researchers to a high ethical standard. Practically all modern research on humans involves strong ethical safe guards. Compare this to business and media. I do not understand why projects seeking private gains (profit for share holders) have a lower ethical standard than research. Surely, the generation of public knowledge is in the greater public interest than private profit making or TV entertainment.

— — —

Kramer AD, Guillory JE, & Hancock JT (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America, 111 (24), 8788-90 PMID: 24889601

Milgram, S. (1963). Behavioral Study of obedience The Journal of Abnormal and Social Psychology, 67 (4), 371-378 : doi: 10.1037/h0040525

— — —

picture: from