Richard Kunert

Are internal replications the solution to the replication crisis in Psychology? No.

Most Psychology findings are not replicable. What can be done? Stanford psychologist Michael Frank has an idea : Cumulative study sets with internal replication. ‘If I had to advocate for a single change to practice, this would be it.’ I took a look whether this makes any difference.

A recent paper in the journal Science has tried to replicate 97 statistically significant effects (Open Science Collaboration, 2015). In only 35 cases this was successful. Most findings were suddenly a lot weaker upon replication. This has led to a lot of soul searching among psychologists. Fortunately, the authors of the Science paper have made their data freely available. So, soul searching can be accompanied by trying out different ideas for improvements.

What can be done to solve Psychology’s replication crisis?

One idea to improve the situation is to demand study authors to replicate their own experiments in the same paper. Stanford psychologist Michael Frank writes:

If I had to advocate for a single change to practice, this would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. […] If you show me a one-off study and I fail to replicate it in my lab, I will tend to suspect that you got lucky or p-hacked your way to a result. But if you show me a package of studies with four internal replications of an effect, I will believe that you know how to get that effect – and if I don’t get it, I’ll think that I’m doing something wrong.
If this argument were true, then the 41 studies which were successfully, conceptually replicated in their own paper should show higher rates of replication than the 56 studies which were not. Of the 41 internally replicated studies, 19 were replicated once, 10 twice, 8 thrice, 4 more than three times. I will treat all of these as equally internally replicated.

Are internal replications the solution? No.


So, does the data by the reprocucibility project show a difference? I made so-called violin plots, thicker parts represent more data points. In the left plot you see the reduction in effect sizes from a bigger original effect to a smaller replicated effect. The reduction associated with internally replicated effects (left) and effects which were only reported once in a paper (right) is more or less the same. In the right plot you can see the p-value of the replication attempt. The dotted line represents the arbitrary 0.05 threshold used to determine statistical significance. Again, replicators appear to have had as hard a task with effects that were found more than once in a paper as with effects which were only found once.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 17 (17 out of 41 = 41%) internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The same number of effects (17 out of 56 = 30%) were not internally replicated. So, the picture doesn’t support internal replications strongly: internal replications provide, at best, only a small advantage to the independent reproducibility of effects. They are hardly the solution to Psychology’s replication problem according to this data set.

The problem with internal replications

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

What can be done?

In my previous post I have shown evidence for questionable research practices in this data set. These lead to less replicable results. Pre-registering studies makes questionable research practices a lot harder and science more reproducible. It would be interesting to see data on whether this hunch is true.

[update 7/9/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to take into account the different denominators for replicated and unreplicated effects. Lee Jussim pointed me to this.]

— — —
Bem DJ (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of personality and social psychology, 100 (3), 407-25 PMID: 21280961

Galak, J., LeBoeuf, R., Nelson, L., & Simmons, J. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103 (6), 933-948 DOI: 10.1037/a0029709

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Ritchie SJ, Wiseman R, & French CC (2012). Failing the future: three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PloS one, 7 (3) PMID: 22432019

Schimmack U (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17 (4), 551-66 PMID: 22924598
— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between internal replication and independent reproducibility of an effect

#Richard Kunert for Brain's Idea 5/9/2015

# a lot of code was taken from the reproducibility project code here

# installing/loading the packages:

#loading the data
RPPdata <- get.OSFfile(code='',dfCln=T)$df
RPPdata <- dplyr::filter(RPPdata, !,!, complete.cases(RPPdata$T.r.O,RPPdata$T.r.R))#97 studies with significant effects

#prepare IDs for internally replicated effects and non-internally replicated effects
idIntRepl <- RPPdata$Successful.conceptual.replications.O > 0
idNotIntRepl <- RPPdata$Successful.conceptual.replications.O == 0

# Get ggplot2 themes predefined in C-3PR
mytheme <- gg.theme("clean")

#restructure data in data frame
dat <- data.frame(EffectSizeDifference = as.numeric(c(c(RPPdata$T.r.R[idIntRepl]) - c(RPPdata$T.r.O[idIntRepl]),
                                                          c(RPPdata$T.r.R[idNotIntRepl]) - c(RPPdata$T.r.O[idNotIntRepl]))),
                  ReplicationPValue = as.numeric(c(RPPdata$T.pval.USE.R[idIntRepl],
                  grp=factor(c(rep("Internally Replicated Studies",times=sum(idIntRepl)),
                               rep("Internally Unreplicated Studies",times=sum(idNotIntRepl))))

# Create some variables for plotting
dat$grp <- as.numeric(dat$grp)
probs   <- seq(0,1,.25)

# VQP PANEL A: reduction in effect size -------------------------------------------------

# Get effect size difference quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$EffectSizeDifference[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$EffectSizeDifference[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$EffectSizeDifference[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2 <- ggplot(dat,aes(x=grp,y=EffectSizeDifference)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 <- vioQtile(,qtiles,probs)
# Garnish (what does this word mean???)
g.es1 <- g.es0 +
  ggtitle("Effect size reduction") + xlab("") + ylab("Replicated - Original Effect Size") + 
  xlim("Internally Replicated", "Not Internally Replicated") +
  mytheme + theme(axis.text.x = element_text(size=20))
# View

# VQP PANEL B: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$ReplicationPValue[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$ReplicationPValue[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$ReplicationPValue[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.pv <- ggplot(dat,aes(x=grp,y=ReplicationPValue)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 <- vioQtile(g.pv,qtiles,probs)
# Garnish (I still don't know what this word means!)
g.pv1 <- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
  ggtitle("Independent replication p-value") + xlab("") + ylab("Independent replication p-value") + 
  xlim("Internally Replicated", "Not Internally Replicated")+
  mytheme + theme(axis.text.x = element_text(size=20))
# View

#put two plots together

Why are Psychological findings mostly unreplicable?

Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.

Sometimes the title of a paper just sounds incredible. Estimating the reproducibility of psychological science. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

One reason for poor replication: sampling until significant

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a p-value below the arbitrary threshold of 0.05. The consequence is this:


The blue replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a significantly stronger relation between the effect size they found and their sample size. They must have done more than just smart a priori power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size measured in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the expected effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the p-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

— — —
John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project

#Richard Kunert for Brain's Idea 3/9/2015
#load necessary libraries

#get raw data from OSF website
info <- GET('', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER <- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file

#restrict studies to those with appropriate data
studies<-MASTER$ID[!$T_r..O.) & !$T_r..R.)] ##to keep track of which studies are which
studies<-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047)

#set font size for plotting
theme_set(theme_gray(base_size = 30))

#prepare correlation coefficients
dat_rank <- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])),
                  sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])),
                  effect_size_O = rank(cbind(MASTER$T_r..O.[studies])),
                  effect_size_R = rank(cbind(MASTER$T_r..R.[studies])))
corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous
corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman")

#compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!)
htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R,
      data = dat_rank, return.htest = FALSE)

#prepare data frame
dat_vis <- data.frame(study = rep(c("Original", "Replication"), each=length(studies)),
                  sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])),
                  effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies])))

#The plotting call
ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot
  geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape
  scale_colour_hue(l=50) + # Use a slightly darker palette than normal
  geom_smooth(method=lm,   # Add linear regression lines
              se=FALSE,    # Don't add shaded confidence region
              aes(color=study))+#colour lines according to data points for consistency
  geom_text(aes(x=750, y=0.46,
            label=sprintf("Spearman rho = %1.3f (p = %1.3f)", 
                                        corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]),
            color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies
  guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text
  geom_text(aes(x=750, y=0.2,
                label=sprintf("Spearman rho = %1.3f (p = %1.3f)", 
                              corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]),
                color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies
  geom_text(x=1500, y=0.33,
                label=sprintf("Difference: Pearson & Filon z = %1.3f (p = %1.3f)", 
                              htest@pearson1898$statistic, htest@pearson1898$p.value),
                color="black", hjust=0)+#add text about testing difference between correlation coefficients
  guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text
  ggtitle("Sampling until significant versus a priori power analysis")+#add figure title
  labs(x="Sample Size", y="Effect size r")#add axis titles

Do music and language share brain resources?

When you listen to some music and when you read a book, does your brain use the same resources? This question goes to the heart of how the brain is organised – does it make a difference between cognitive domains like music and language? In a new commentary I highlight a successful approach which helps to answer this question.

On some isolated island in academia, the tree of knowledge has the form of a brain.

How do we read? What is the brain doing in this picture?

When reading the following sentence, check carefully when you are surprised at what you are reading:

After | the trial | the attorney | advised | the defendant | was | likely | to commit | more crimes.

I bet it was on the segment was. You probably thought that the defendant was advised, rather than that someone else was advised about the defendant. Once you read the word was you need to reinterpret what you have just read. In 2009 Bob Slevc and colleagues found out that background music can change your reading of this kind of sentences. If you hear a chord which is harmonically unexpected, you have even more trouble with the reinterpretation of the sentence on reading was.

Why does music influence language?

Why would an unexpected chord be problematic for reading surprising sentences? The most straight-forward explanation is that unexpected chords are odd. So they draw your attention. To test this simple explanation, Slevc tried out an unexpected instrument playing the chord in a harmonically expected way. No effect on reading. Apparently, not just any odd chord changes your reading. The musical oddity has to stem from the harmony of the chord. Why this is the case, is a matter of debate between scientists. What this experiment makes clear though, is that music can influence language via shared resources which have something to do with harmony processing.

Why ignore the fact that music influences language?

None of this was mention in a recent review by Isabelle Peretz and colleagues on this topic. They looked at where in the brain music and language show activations, as revealed in MRI brain scanners. This is just one way to find out whether music and language share brain resources. They concluded that ‘the question of overlap between music and speech processing must still be considered as an open question’. Peretz call for ‘converging evidence from several methodologies’ but fail to mention the evidence from non-MRI methodologies.1

Sure one has to focus on something, but it annoys me that people tend focus on methods (especially fancy expensive methods like MRI scanners), rather than answers (especially answers from elegant but cheap research into human behaviour like reading). So I decided to write a commentary together with Bob Slevc. We list no less than ten studies which used a similar approach to the one outlined above. Why ignore these results?

If only Peretz and colleagues had truly looked at ‘converging evidence from several methodologies’. They would have asked themselves why music sometimes influences language and why it sometimes does not. The debate is in full swing and already beyond the previous question of whether music and language share brain resources. Instead, researchers ask what kind of resources are shared.

So, yes, music and language appear to share some brain resources. Perhaps this is not easily visible in MRI brain scanners. Looking at how people read with chord sequences played in the background is how one can show this.

— — —
Kunert, R., & Slevc, L.R. (2015). A commentary on “Neural overlap in processing music and speech” (Peretz et al., 2015) Frontiers in Human Neuroscience : doi: 10.3389/fnhum.2015.00330

Peretz I, Vuvan D, Lagrois MÉ, & Armony JL (2015). Neural overlap in processing music and speech. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 370 (1664) PMID: 25646513

Slevc LR, Rosenberg JC, & Patel AD (2009). Making psycholinguistics musical: self-paced reading time evidence for shared processing of linguistic and musical syntax. Psychonomic bulletin & review, 16 (2), 374-81 PMID: 19293110
— — —

1 Except for one ECoG study.

DISCLAIMER: The views expressed in this blog post are not necessarily shared by Bob Slevc.

Why does humanity get smarter and smarter?

Intelligence tests have to be adjusted all the time because people score higher and higher. If the average human of today went 105 years back in time, s/he would score 130, be considered as gifted, and join clubs for highly intelligent people. How can that be?

IQ_increase_base_graphic_v.2_EnglishThe IQ growth

The picture above shows the development of humanity’s intelligence between 1909 and 2013. According to IQ-scores people got smarter and smarter. During the last 105 years, people’s scores increased by as much as 30 IQ-points. That is equivalent to the difference between intellectual disability and normal intelligence. Ever since the discovery of this effect by James Flynn, the underlying reason has been hotly debated. A new analysis combines all available studies into one overall picture in order to find answers.

Jakob Pietschnig and Martin Voracek included all available data pertaining to IQ increases from one generation to another: nearly 4 million test takers in 105 years. They found that IQ scores sometimes increased faster and sometimes more slowly. Check the difference between the 1920s and WWII in the figure above. Moreover, different aspects of intelligence change at different speeds. So-called crystallized intelligence (knowledge about facts) increased only at a rate of 0.2 points per year. So-called fluid intelligence (abstract problem solving), on the other hand, increased much faster at 0.4 points per year.

Five reasons for IQ growth

Five reasons appear to come together to explain this phenomenon:

1) better schooling: IQ growth is stronger in adults than in children, probably because adults stay longer and longer in school.

2) more experience with multiple choice tests: since the 1990s the multiple choice format has become common in schools and universities. Modern test takers are no longer put off by this way of asking questions in IQ tests and might resort to smart guessing.

3) less malnutrition: the slow IQ growth during the world wars might have something to do with a lack of nutrients and energy which the brain needs

4) better health care: the less sick you are, the more your brain can develop optimally

5) less lead poisoning: since the 1970s lead was phased out in paint and gasoline, removing an obstacle for healthy neural development

 Am I really smarter than my father?

According to the Flynn effect, my generation is 8 IQ-points smarter than that of my parents. But this only relates to performance on IQ tests. I somehow doubt that more practical, less abstract, areas show the same effect. Perhaps practical intelligence is just more difficult to measure. It is possible that we have not really become more intelligent thinkers but instead more abstract thinkers.

— — —
Pietschnig J, & Voracek M (2015). One Century of Global IQ Gains: A Formal Meta-Analysis of the Flynn Effect (1909-2013). Perspectives on psychological science : a journal of the Association for Psychological Science, 10 (3), 282-306 PMID: 25987509

— — —

Figure: self made, based on data in Figure 1 in Pietschnig & Voracek (2015, p. 285)

The scientific community’s Galileo affair (you’re the Pope)

Science is in crisis. Everyone in the scientific community knows about it but few want to talk about it. The crisis is one of honesty. A junior scientist (like me) asks himself a similar question to Galileo in 1633: how much honesty is desirable in science?

Galileo versus Pope: guess what role the modern scientist plays.

Science Wonderland

According to nearly all empirical scientific publications that I have read, scientists allegedly work like this:

Introduction, Methods, Results, Discussion

Scientists call this ‘the story’ of the paper. This ‘story framework’ is so entrenched in science that the vast majority of scientific publications are required to be organised according to its structure: 1) Introduction, 2) Methods, 3) Results, 4) Discussion. My own publication is no exception.

Science Reality

However, virtually all scientists know that ‘the story’ is not really true. It is merely an ideal-case-scenario. Usually, the process looks more like this:

questionable research practices

Scientists call some of the added red arrows questionable research practices (or QRP for short). The red arrows stand for (going from left to right, top to bottom):

1) adjusting the hypothesis based on the experimental set-up. This is particularly true when a) working with an old data-set, b) the set-up is accidentally different from the intended one, etc.

2) changing design details (e.g., how many participants, how many conditions to include, how many/which measures of interest to focus on) depending on the results these changes produce.

3) analysing until results are easy to interpret.

4) analysing until results are statistically desirable (‘significant results’), i.e. so-called p-hacking.

5) hypothesising after results are known (so-called HARKing).

The outcome is a collection of blatantly unrealistic ‘stories’ in scientific publications. Compare this to the more realistic literature on clinical trials for new drugs. More than half the drugs fail the trial (Goodman, 2014). In contrast, nearly all ‘stories’ in the wider scientific literature are success stories. How?

Joseph Simmons and colleagues (2011) give an illustration of how to produce spurious successes. They simulated the situation of researchers engaging in the second point above (changing design details based on results). Let’s assume that the hypothesised effect is not real. How many experiments will erroneously find an effect at the conventional 5% significance criterion? Well, 5% of experiments should (scientists have agreed that this number is low enough to be acceptable). However, thanks to the questionable research practices outlined above this number can be boosted. For example, sampling participants until the result is statistically desirable leads to up to 22% of experiments reporting a ‘significant result’ even though there is no effect to be found. It is estimated that 70% of US psychologists have done this (John et al., 2012). When such a practice is combined with other, similar design changes, up to 61% of experiments falsely report a significant effect. Why do we do this?

The Pope of 1633 is back

If we know that the scientific literature is unrealistic why don’t we just drop the pretense and just tell it as it is? The reason is simple: because you like the scientific wonderland of success stories. If you are a scientist reader, you like to base the evaluation of scientific manuscripts on the ‘elegance’ (simplicity, straight-forwardness) of the text. This leaves no room for telling you what really happened. You also like to base the evaluation of your colleagues on the quantity and the ‘impact’ of their scientific output. QRPs are essentially a career requirement in such a system. If you are a lay reader, you like the research you fund (via tax money) to be sexy, easy and simple. Scientific data are as messy as the real world but the reported results are not. They are meant to be easily digestible (‘elegant’) insights.

In 1633 it did not matter much whether Galileo admitted to the heliocentric world view which was deemed blasphemous. The idea was out there to conquer the minds of the renaissance world. Today’s Galileo moment is also marked by an inability to admit to scientific facts (i.e. the so-called ‘preliminary’ research results which scientists obtain before applying questionable research practices). But this time the role of the Pope is played both by political leaders/ the lay public and scientists themselves. Actual scientific insights get lost before they can see the light of day.

There is a great movement to remedy this situation, including pressure to share data (e.g., at PLoS ONE), replication initiatives (e.g., RRR1, reproducibility project), the opportunity to pre-register experiments etc. However, these remedies only focus on scientific practice, as if Galileo was at fault and the concept of blasphemy was fine. Maybe we should start looking into how we got into this mess in the first place. Focus on the Pope.

— — —
John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices with Incentives for Truth-Telling SSRN Electronic Journal DOI: 10.2139/ssrn.1996631

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

Picture: Joseph Nicolas Robert-Fleury [Public domain], via Wikimedia Commons

PS: This post necessarily reflects the field of science that I am familiar with (Psychology, Cognitive Science, Neuroscience). The situation may well be different in other scientific fields.



The real reason why new pop music is so incredibly bad

You have probably heard that Pink Floyd recently published their new album Endless River. Will this bring back the wonderful world of good music after the endless awfulness of the popular music scene in the last 20 years or so? Is good music, as we know it from the 60s and 70s, back for good? The reasons behind the alleged endless awfulness of pop music these days suggest otherwise. We shouldn’t be throwing stones at new music but instead at our inability to like it.

Pink Floyd 1973

When we were young we learned to appreciate Pink Floyd.

Daniel Levitin was asked at a recent music psychology conference in Toronto why old music is amazing and new music is awful. He believed that modern record companies are there to make money. In the olden days, on the other hand, they were there to make music and ready to hold on to musicians which needed time to become successful. More interestingly, he reminded the public that many modern kidz would totally disagree with the implication that modern music is awful. How can it be that new music is liked by young people if so much of it is often regarded as quite bad?

Everything changes for the better after a few repetitions

The answer to the mystery has nothing to do with flaws in modern music but instead with our brain. When adults hear new music they often hate it at first. After repeated listening they tend to find it more and more beautiful. For example, Marcia Johnson and colleagues (1985) played Korean melodies to American participants and found that hearing a new melody led to low liking ratings, a melody heard once before to higher ratings and even more exposure to higher than higher ratings. Even Korsakoff patients – who could hardly remember having heard individual melodies before – showed this effect, i.e. without them realising it they probably never forget melodies.

This so-called mere exposure effect is all that matters to me: a robust, medium-strong, generally applicable, evolutionarily plausible effect (Bornstein, 1989). You can do what you like, it applies to all sorts of stimuli. However, there is one interesting exception here. Young people do not show the mere exposure effect, no relationship between ‘repeat the stimulus’ and ‘give good feeling’ (Bornstein, 1989). As a result, adults need a lot more patience before they like a new song as much as young people do. No wonder adults are only satisfied with the songs they already know from their youth in the 60s and 70s. Probably, when looking at the music scene in 2050 the current generation will equally hate it and wish the Spice Girls back (notice the gradual rise of 90’s parties already).

I listened to it –> I like it

So, when it comes to an allegedly awful present and great past, ask yourself: how deep is your love for the old music itself rather than its repeated listening? Listen repeatedly to any of a million love songs and you will end up appreciating it. Personally, I give new music a chance and sometimes it manages to relight my fire. Concerning Endless River, if it’s not love at first sight, do not worry. The new Pink Floyd album sure is good (depending on how many times you listen to it).

— — —
Bornstein, R. (1989). Exposure and affect: Overview and meta-analysis of research, 1968-1987. Psychological Bulletin, 106 (2), 265-289 DOI: 10.1037/0033-2909.106.2.265

Johnson MK, Kim JK, & Risse G (1985). Do alcoholic Korsakoff’s syndrome patients acquire affective reactions? Journal of experimental psychology. Learning, memory, and cognition, 11 (1), 22-36 PMID: 3156951
— — —

Figure: By PinkFloyd1973.jpg: TimDuncan derivative work: Mr. Frank (PinkFloyd1973.jpg) [CC-BY-3.0 (, via Wikimedia Commons

— — —

PS: Yes, I did hide 29 Take That song titles in this blog post. Be careful, you might like 90’s pop music a little bit more due to this exposure.






Dyslexia: trouble reading ‘four’

Dyslexia affects about every tenth reader. It shows up when trying to read, especially when reading fast. But it is still not fully clear what words dyslexic readers find particularly hard. So, I did some research to find out, and I published the article today.

Carl Spitzweg: the bookworm

The bookworm (presumably non-dyslexic)

Imagine seeing a new word ‘bour’. How would you pronounce it? Similar to ‘four’, similar to ‘flour’ or similar to ‘tour’? It is impossible to know. Therefore, words such as ‘four’, ‘flour’ and ‘tour’ are said to be inconsistent – one doesn’t know how to pronounce them when encountering them for the very first time. Given this pronunciation challenge, I, together with my co-author Christoph Scheepers, hypothesised that such words would be more difficult for readers generally, and for dyslexic readers especially.

Finding evidence for a dyslexia specific problem is challenging because dyslexic participants tend to be slower than non-dyslexic people in most tasks that they do. So, if you force them to be as quick as typical readers they will seem bad readers even though they might be merely slow readers. Therefore, we adopted a new task that gave people a very long time to judge whether a bunch of letters are a word or not.

It turns out that inconsistent words like ‘four’ slow down both dyslexic and typical readers. But on top of that dyslexic readers never quite reach the same accuracy as typical readers with these words. It is as if the additional challenge these words pose can, with time, be surmounted in normal readers while dyslexic readers have trouble no matter how much time you give them. In other words, dyslexic people aren’t just slow. At least for some words they have trouble no matter how long they look at them.

This is my very first publication based on work I did more than four years ago. You should check out whether the waiting was worth it. The article is free to access here. I hope it will convince you that dyslexia is a real challenge to investigate. Still, the pay-off to fully understanding it is enormous: helping dyslexic readers cope in a literate society.

— — —
Kunert, R., & Scheepers, C. (2014). Speed and accuracy of dyslexic versus typical word recognition: an eye-movement investigation Frontiers in Psychology, 5 DOI: 10.3389/fpsyg.2014.01129
— — —

Picture: Carl Spitzweg [Public domain or Public domain], via Wikimedia Commons

Old people are immune against the cocktail party effect

Imagine standing at a cocktail party and somewhere your name gets mentioned. Your attention is immediately grabbed by the sound of your name. It is a classic psychological effect with a new twist: old people are immune.

Someone mention my name?

The so-called cocktail party effect has fascinated researchers for a long time. Even though you do not consciously listen to a conversation around you, your own name can grab your attention. That means that unbeknownst to you, you follow the conversations around you. You check them for salient information like your name, and if it occurs you quickly switch attention to where your name was mentioned.

The cocktail party simulated in the lab

In the lab this is investigated slightly differently. Participants listen to one ear and, for example, repeat whatever they hear. Their name is embedded in what they hear coming in to the other (unattended) ear. After the experiment one simply asks ‘Did you hear your own name?’ In a recent paper published by Moshe Naveh-Benjamin and colleagues (in press), around half of the young student participants noticed their name in such a set-up. Compare this to old people aged around 70: next to nobody (only six out of 76 participants) noticed their name being mentioned in the unattended ear.

Why this age difference? Do old people simply not hear well? Unlikely, when the name was played to the ear that they attended to, 45% of old people noticed their names. Clearly, many old people can hear their names, but they do not notice their names if they do not pay attention to this. Young people do not show such a sharp distinction. Half the time they notice their names, even when concentrating on something else.

Focusing the little attention that is available

Naveh-Benjamin and colleagues instead suggest that old people simply have less attention. When they focus on a conversation, they give it their everything. Nothing is left for the kind of unconscious checking of conversations which young people can do so well.

At the next cocktail party you can safely gossip about your old boss. Just avoid mentioning the name of the young new colleague who just started.


— — —

Naveh-Benjamin M, Kilb A, Maddox GB, Thomas J, Fine HC, Chen T, & Cowan N (2014). Older adults do not notice their names: A new twist to a classic attention task. Journal of experimental psychology. Learning, memory, and cognition PMID: 24820668

— — —


By Financial Times (Patrón cocktail bar) [CC-BY-2.0 (, via Wikimedia Commons

Why are ethical standards higher in science than in business and media?

Facebook manipulates user content in the name of science? Scandalous! It manipulates user content in the name of profit? No worries! Want to run a Milgram study these days? Get bashed by your local ethics committee! Want to show it on TV? No worries. Why do projects which seek knowledge have higher ethical standards than projects which seek profit?

Over half a million people were this mouse.

Just as we were preparing to leave for our well-deserved summer holidays this year, research was shaken by the fall-out to a psychological study (Kramer et al., 2014) which manipulated Facebook content. Many scientists objected to the study’s lack of asking for ‘informed consent’, and I think they are right. However, many ordinary people objected to something else. Here’s how Alex Hern put it over at the guardian:

At least when a multinational company, which knows everything about us and controls the very means of communication with our loved ones, acts to try and maximise its profit, it’s predictable. There’s something altogether unsettling about the possibility that Facebook experiments on its users out of little more than curiosity.

Notice the opposition between ‘maximise profit’ which is somehow thought to be okay and ‘experimenting on users’ which is not. I genuinely do not understand this distinction. Suppose the study had never been published in PNAS but instead in the company’s report to share holders (as a new means of emotionally enhancing advertisements), would there have been the same outcry? I doubt it. Why not?

Having issues with TV experimentation versus scientific experimentation?

Was the double standard around the Facebook study the exception? I do not think so.  In the following youtube clip you see the classic Milgram experiment re-created for British TV. The participants’ task is to follow the experimentor’s instructions to electrocute another participant (who is actually an actor) for bad task performance. Electro shocks increase in strength until they are allegedly lethal. People are obviously distressed in this task.

Yesterday, the New Scientist called the classic Milgram experiment one of ‘the most unethical [experiments] ever carried out’. Why is this okay for TV? Now, imagine a hybrid case. Would it be okay if the behaviour shown on TV was scientifically analysed and published in a respectable journal? I guess that would somehow be fine. Why is it okay to run the study with a TV camera involved, not when the TV camera is switched off? This is not a rhetorical question. I actually do not grasp the underlying principle.

Why is ‘experimenting on people’ bad?

In my experience, ethical guidelines are a real burden on researchers. And this is a good thing because society holds researchers to a high ethical standard. Practically all modern research on humans involves strong ethical safe guards. Compare this to business and media. I do not understand why projects seeking private gains (profit for share holders) have a lower ethical standard than research. Surely, the generation of public knowledge is in the greater public interest than private profit making or TV entertainment.

— — —

Kramer AD, Guillory JE, & Hancock JT (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America, 111 (24), 8788-90 PMID: 24889601

Milgram, S. (1963). Behavioral Study of obedience The Journal of Abnormal and Social Psychology, 67 (4), 371-378 : doi: 10.1037/h0040525

— — —

picture: from

How to increase children’s patience in 5 seconds

A single act increases adults’ compliance with researchers. The same act makes students more likely to volunteer to solve math problems in front of others. Moreover, it makes four-year-olds more patient. What sounds like a miracle cure to everyday problems is actually the oldest trick in the book: human touch.

How do researchers know this? Here is one experiment. In a recently published study (Leonard et al., 2014), four and five year old children were asked to wait for ten minutes in front of candy. The experimenter told them to wait before eating the candy because he had to finish paperwork. How long would children wait before calling the experimenter in because they wanted to eat the candy earlier? Four-year-olds waited for about six minutes while five-year-olds waited for about eight minutes. The task was similar to the classic Marshmallow test shown in the video.


The positive effect of touch

However, it all depends on whether the experimenter gave children a friendly touch on the back during the request to wait. If she did, four-year-olds waited for seven minutes (versus 5 minutes without touch) and five-year-olds waited for nine minutes (versus seven minutes without touch). A simple, five-second-long touch made four-year-olds behave as patiently as five-year-olds. It’s surprising how simple and fast the intervention is.

Touch across the ages

This result nicely fits into a wider literature on the benefits of a friendly touch. Already back in the eighties Patterson and colleagues (1986) found that adults spent more time helping with the tedious task of scoring personality tests if they were touched by the experimenter. Interestingly, the touch on the shoulder was hardly ever reported as noteworthy. In the early noughties Gueguen picked this effect up and moved it to the real world. He showed that touch also increases adults’ willingness to help by watching after a large dog (Gueguen & Fisher-Loku, 2002) as well as students’ willingness to volunteer to solve a math problem in front of a class (Gueguen, 2004).

The reason underlying these effects remains a bit mysterious. Does the touch on the back reduce the anxiety of being faced with a new, possibly difficult, task? Does it increase the rapport between experimenter and experimental participant? Does it make time fly by because being touched feels good? Well, time will tell.

Touch your child?

There are obvious sexual connotations related to touching people, unfortunately this includes touching children. As a result, some schools in the UK have adopted a ‘no touch’ policy: teachers are never allowed to touch children. Research shows that such an approach comes at a cost: children behave less patiently when they are not touched. Should society deny itself the benefits of people innocently touching each other?


Guéguen N, & Fischer-Lokou J (2002). An evaluation of touch on a large request: a field setting. Psychological reports, 90 (1), 267-9 PMID: 11898995

Guéguen, N. (2004). Nonverbal Encouragement of Participation in a Course: the Effect of Touching Social Psychology of Education, 7 (1), 89-98 DOI: 10.1023/B:SPOE.0000010691.30834.14

Leonard JA, Berkowitz T, & Shusterman A (2014). The effect of friendly touch on delay-of-gratification in preschool children. Quarterly journal of experimental psychology (2006), 1-11 PMID: 24666195

Patterson, M., Powell, J., & Lenihan, M. (1986). Touch, compliance, and interpersonal affect Journal of Nonverbal Behavior, 10 (1), 41-50 DOI: 10.1007/BF00987204