Why are Psychological findings mostly unreplicable?

Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.

Sometimes the title of a paper just sounds incredible. Estimating the reproducibility of psychological science. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

One reason for poor replication: sampling until significant

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a p-value below the arbitrary threshold of 0.05. The consequence is this:


Focus on the left panel first. The green replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a stronger relation between the effect size they found and their sample size. They must have done more than just smart a priori power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size measured in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the expected effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The right panel shows a Bayesian correlation analysis of the data. What you are looking at is the belief in the strength of the correlation, called the posterior distribution. The overlap of the distributions can be used as a measure of believing that the correlations are not different. The overlap is less than 7%. If you are more inclined to believe in frequentist statistics, the associated p-value is .001 (Pearson and Filon’s z = 3.355). Therefore, there is strong evidence that original studies display a stronger negative correlation between sample size and measured effect size than replication studies.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the p-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

[24/10/2015: Added Bayesian analysis and changed figure. Code below is from old figure.]

[27/11/2015: Adjusted percentage overlap of posterior distributions.]

— — —
John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project

#Richard Kunert for Brain's Idea 3/9/2015
#load necessary libraries

#get raw data from OSF website
info <- GET('', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER <- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file

#restrict studies to those with appropriate data
studies<-MASTER$ID[!$T_r..O.) & !$T_r..R.)] ##to keep track of which studies are which
studies<-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047)

#set font size for plotting
theme_set(theme_gray(base_size = 30))

#prepare correlation coefficients
dat_rank <- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])),
sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])),
effect_size_O = rank(cbind(MASTER$T_r..O.[studies])),
effect_size_R = rank(cbind(MASTER$T_r..R.[studies])))
corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous
corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman")

#compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!)
htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R,
data = dat_rank, return.htest = FALSE)

#prepare data frame
dat_vis <- data.frame(study = rep(c("Original", "Replication"), each=length(studies)),
sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])),
effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies])))

#The plotting call
ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot
geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm,   # Add linear regression lines
se=FALSE,    # Don't add shaded confidence region
aes(color=study))+#colour lines according to data points for consistency
geom_text(aes(x=750, y=0.46,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]),
color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies
guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text
geom_text(aes(x=750, y=0.2,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]),
color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies
geom_text(x=1500, y=0.33,
label=sprintf("Difference: Pearson & Filon z = %1.3f (p = %1.3f)",
htest@pearson1898$statistic, htest@pearson1898$p.value),
color="black", hjust=0)+#add text about testing difference between correlation coefficients
guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text
ggtitle("Sampling until significant versus a priori power analysis")+#add figure title
labs(x="Sample Size", y="Effect size r")#add axis titles

Why does humanity get smarter and smarter?

Intelligence tests have to be adjusted all the time because people score higher and higher. If the average human of today went 105 years back in time, s/he would score 130, be considered as gifted, and join clubs for highly intelligent people. How can that be?

IQ_increase_base_graphic_v.2_EnglishThe IQ growth

The picture above shows the development of humanity’s intelligence between 1909 and 2013. According to IQ-scores people got smarter and smarter. During the last 105 years, people’s scores increased by as much as 30 IQ-points. That is equivalent to the difference between intellectual disability and normal intelligence. Ever since the discovery of this effect by James Flynn, the underlying reason has been hotly debated. A new analysis combines all available studies into one overall picture in order to find answers.

Jakob Pietschnig and Martin Voracek included all available data pertaining to IQ increases from one generation to another: nearly 4 million test takers in 105 years. They found that IQ scores sometimes increased faster and sometimes more slowly. Check the difference between the 1920s and WWII in the figure above. Moreover, different aspects of intelligence change at different speeds. So-called crystallized intelligence (knowledge about facts) increased only at a rate of 0.2 points per year. So-called fluid intelligence (abstract problem solving), on the other hand, increased much faster at 0.4 points per year.

Five reasons for IQ growth

Five reasons appear to come together to explain this phenomenon:

1) better schooling: IQ growth is stronger in adults than in children, probably because adults stay longer and longer in school.

2) more experience with multiple choice tests: since the 1990s the multiple choice format has become common in schools and universities. Modern test takers are no longer put off by this way of asking questions in IQ tests and might resort to smart guessing.

3) less malnutrition: the slow IQ growth during the world wars might have something to do with a lack of nutrients and energy which the brain needs

4) better health care: the less sick you are, the more your brain can develop optimally

5) less lead poisoning: since the 1970s lead was phased out in paint and gasoline, removing an obstacle for healthy neural development

 Am I really smarter than my father?

According to the Flynn effect, my generation is 8 IQ-points smarter than that of my parents. But this only relates to performance on IQ tests. I somehow doubt that more practical, less abstract, areas show the same effect. Perhaps practical intelligence is just more difficult to measure. It is possible that we have not really become more intelligent thinkers but instead more abstract thinkers.

— — —
Pietschnig J, & Voracek M (2015). One Century of Global IQ Gains: A Formal Meta-Analysis of the Flynn Effect (1909-2013). Perspectives on psychological science : a journal of the Association for Psychological Science, 10 (3), 282-306 PMID: 25987509

— — —

Figure: self made, based on data in Figure 1 in Pietschnig & Voracek (2015, p. 285)

Memory training boosts IQ

Is the IQ set in stone once we hit adulthood? ‘Yes it is’ used to be the received wisdom. A new meta-analysis challenges this view and gives hope to all of us who feel that mother nature should have endowed us with more IQ points. But is the training worth it?

a perfectly realistic depiction of intelligence training

a perfectly realistic depiction of intelligence training

Intelligence increases in adults

I have previously blogged about intelligence training with music (here). Music lessons increase your intelligence by round about 3 IQ points. But this has only been shown to work in children. A new paper shows that adults can also improve their IQ. Jacky Au and colleagues make this point based on one big analysis incorporating 20 publications with over 1000 participans. People did a working memory exercice, i.e. they trained the bit of their mind that holds information online. How? They did the so-called n-back task over and over and over again. Rather than explain the n-back task here, I just invite you to watch the video.

Increasing memory, increasing intelligence

Of course you cannot change your intelligence if you only do the task once. However, once you do this task several times a week over several weeks, your performance should increase, which shows that you trained your working memory. However, you will also improve on seemingly unrelated IQ tests. The meta-analysis takes this as a sign that actual intelligence increases result from n-back training. Working memory training goes beyond improvements on working memory tests alone.

The catch

So, the training is effective. It increases your intelligence by three to four IQ points. But is it efficient? You have to train for around half an hour daily, over a month. Such a training regime will have a considerable impact on your life. Are three to four IQ points enough to compensate for that?

— — —

Au, J., Sheehan, E., Tsai, N., Duncan, G., Buschkuehl, M., & Jaeggi, S. (2014). Improving fluid intelligence with training on working memory: a meta-analysis Psychonomic Bulletin & Review DOI: 10.3758/s13423-014-0699-x

— — —


When to switch on background music

Some things of our daily lives have become so common, we hardly notice them anymore. Background music is one such thing. Whether you are in a supermarket, a gym or a molecular biology laboratory, you can constantly hear it. More than that, even in quiet environments like the office or the library people get out their mp3-players and play background music. Is this a form of boosting one’s productivity or are people enjoying music at the cost of getting things done? Research on the effect of background music can give an answer.

A German research team led by Juliane Kämpfe did a meta-analysis of nearly 100 studies on this topic. It turns out that certain tasks benefit from background music. They are noticeably mindless tasks: mundane behaviours like eating or driving as well as sports. Below you can hear how Arnold Schwarzenegger uses this finding to great effect.



Music also has a positive effect on mood regulation like controlling your nervousness before a job interview. (I have discussed similar stuff before when looking into why people willingly listen to sad music.)
However, music can also have a detrimental effect. It can draw your attention away from the things you should be focussing on. As a result a negative influence tends to be seen in situations which require concentration: memorising and text understanding. In other words: don’t play it in a university library as these students did.



So far, so unsurprising. However, one positive effect stands out from the picture I painted above. The German meta-analysis mentions a curious, positive effect of music on simple math tests. This is in line with a recent study by Avila and colleagues who found a positive effect of music on logical reasoning. Could it be that the negative effect of background music on concentration tasks is found because these tasks are nearly always language based? Music and language have been claimed to share a lot of mental resources. This special link between the two modalities could perhaps explain the negative effect. It is too early to tell, but there may be a set of intellectual tasks which benefit from music: the abstract, mathematical or logical ones.
The conclusion is clear. If you want to get things done, choose carefully whether music will aid you or hold you back. Think Arnie or Gangnam Style.

Avila, C., Furnham, A., & McClelland, A. (2012). The influence of distracting familiar vocal music on cognitive performance of introverts and extraverts Psychology of Music, 40 (1), 84-93 DOI: 10.1177/0305735611422672

Kampfe, J., Sedlmeier, P., & Renkewitz, F. (2011). The impact of background music on adult listeners: A meta-analysis Psychology of Music, 39 (4), 424-448 DOI: 10.1177/0305735610376261

If you liked this post, you may also like:

Mental Fitness – How to Improve your Mind through Bodily Exercise


If you were not entirely indifferent to this post, please leave a comment.

Mental Fitness – How to Improve your Mind through Bodily Exercise
We stood in the middle of the motorway, about to drive onto it in the wrong direction. The windscreen wipers worked madly even though the weather was very dry. One of the passengers screamed, I can’t remember whom.
A German family’s holidays in South Africa can be scary indeed.
What had happened? In psychological jargon, my mother – who drove – was overcome by pre-potent responses which were not inhibited by her executive control system. In other words, her German driving habits – drive down a motorway on the right hand side, indicate using a lever on the left of the steering wheel – were incompatible with the South African traffic system – drive on the left – and the car – the wind screen wipers are activated on the left, the indicator on the right. In order to drive in South Africa my mother needed to hold in mind the correct information about how to drive and at the same time had to stop herself falling back on her usual driving habits. Worse, she had to do so while sitting basically motionless.
A link between immobility and mental performance is suggested by a recent meta-analysis done by Chang and colleagues (just published in Brain Research). They pooled 79 studies with a total of over 2000 participants and overall found a very small positive effect of a small bout of exercise on cognitive performance.
If you would like slightly superior mental abilities, here is some self-help advice:
If you want to show off your slightly improved concentration during exercise:
– Exercise intensity: doesn’t matter too much.
– Cognitive improvements: executive control
Positive effects are only seen for tasks similar to my mother sitting in a foreign car on a foreign road, i.e. situations where you need to concentrate in order to perform differently to what you are used to or to what is usually obvious. Forget about higher intelligence or better memory while sweating it out.
– People: The better your overall fitness level the more positive the effect.
If you want to mentally perform slightly better just after exercising:
– Exercise intensity: light to intermediate
– Cognitive improvements: executive control, attention, intelligence
– People: unfit or very fit (not moderate)
If there is a small pause of at least a minute between physical exercise and cognitive performance:
– Exercise intensity: light or above (not very light)
– Cognitive improvements: executive control, factual knowledge
– People: any level of bodily fitness
How long should the exercises be to see positive effects?
At least 10 minutes.
For how long do the improvements last?
No more than 15 minutes.
How old are people who show cognitive improvements?
Age effects are not strong but generally any age after primary school will show improvements.
Which type of exercise works best?
Aerobic exercise works. Anaerobic and muscular resistance training regimes may have the opposite effect but more research is needed before strong conclusions can be drawn.
At what time of day are the improvements seen?
In the morning. However, often testing time is not reported, so don’t take my word for it.
So much for the self-help. But what does it all mean? Chang and colleagues interpret their results in terms of some unspecified bodily mechanism related to exercise, e.g., heart rate, to increase to some optimal level. Before it declines again to rest level, one has got a limited time window of perhaps 15 minutes in order to show minimally improved cognition. It is a nice illustration of how body and mind are intertwined.
So, had my mother cycled – instead of driven a car – her chances of nearly driving down the wrong side of the motorway would perhaps have been a bit smaller. Also, her chances of riding on a bicycle on any motorway at all would have been smaller, of course. Well, you get my point.
Now think about all the great inventions, all the great ideas, all the great insights that we could have had if only we didn’t spend the 15 minutes after physical exercise with stretching, chatting, and showering. Now go jogging for fifteen minutes and, immediately afterwards, think again.
Chang, Y.K., Labban, J.D., Gapin, J.I., Etnier, J.L. (2012). The effects of acute exercise on cognitive performance: A meta-analysis. Brain Research, 1453, 87-101. doi: 10.1016/j.brainres.2012.02.068