Do twitter or facebook activity influence scientific impact?

Are scientists smart when they promote their work on social media? Isn’t this a waste of time, time which could otherwise be spent in the lab running experiments? Perhaps not. An analysis of all available articles published by PLoS journals suggests otherwise.

My own twitter activity might best be thought of as learning about science (in the widest sense), while what I do on facebook is really just shameless procrastination. It turns out that this pattern holds more generally and impacts on how to use social media effectively to promote science.

In order to make this claim, I downloaded the twitter and facebook activity associated with every single article published in any journal by the Public Library of Science (PLoS), using this R-script here. PLoS is the open access publisher of the biggest scientific journal PLoS ONE as well as a number of smaller, more high impact journals. The huge amount of data allows me to have a 90% chance of discovering even a small effect (r = .1) if it actually exists.

I should add that I limited my sample to those articles published after May 2012 (which is when PLoS started tracking tweets) and January 2015 (in order to allow for at least two years to aggregate citations). The 87,649 remaining articles published in any of the PLoS journals offer the following picture.


There is a small but non-negligible association between impact on twitter (tweets) and impact in the scientific literature (citations): Pearson r = .12, p < .001; Spearman rho = .18, p < .001. This pattern held for nearly every PLoS journal individually as well (all Pearson r ≥ .10 except for PLoS Computational Biology; all Spearman rho ≥ .12 except for PLoS Pathogens). This result is in line with Peoples et al.’s (2016) analysis of twitter activity and citations in the field of ecology.

So, twitter might indeed help a bit to promote an article. Does this hold for social media in general? A look at facebook reveals a different picture. The relationship between facebook mentions of an article and its scientific impact is so small as to be practically negligible: Pearson r = .03, p < .001; Spearman rho = .06, p < .001. This pattern of only a tiny association between facebook mentions and citations held for every single PLoS journal (Pearson r ≤ .09, Spearman rho ≤ .08).


In conclusion, twitter can be used for promoting your scientific work in an age of increased competition for scientific reading time (Renear & Palmer, 2009). Facebook, on the other hand, can be used for procrastinating.

Wanna explore the data set yourself? I made a web-app which you can use in RStudio or in your web browser. Have fun with it and tell me what you find.

Why does music training increase intelligence?

We know that music training causes intelligence to increase, but why? In this post I 1) propose a new theory, and 2) falsify it immediately. Given that this particular combination of activities is unpublishable in any academic journal, I invite you to read the whole story here (in under 500 words).

1) Proposing the ISAML

Incredible but true, music lessons improve the one thing that determines why people who are good on one task tend to be better on another task as well: IQ (Schellenberg, 2004; Kaviani et al., 2013; see coverage in previous blog post). Curiously, I have never seen an explanation for why music training would benefit intelligence.

I propose the Improved Sustained Attention through Music Lessons hypothesis (ISAML). The ISAML hypothesis claims that all tasks related to intelligence are dependent to some degree on people attending to them continuously. This ability is called sustained attention. A lapse of attention, caused by insufficient sustained attention, leads to suboptimal answers on IQ tests. Given that music is related to the structuring of attention (Boltz & Jones, 1989) and removes attentional ‘gaps’ (Olivers & Nieuwenhuis, 2005; see coverage in previous blog post), music training might help in attentional control and, thus, in increasing sustained attention. This in turn might have a positive impact on intelligence, see boxes and arrows in Figure 1.


Figure 1. The Improved Sustained Attention through Music Lessons hypothesis (ISAML) in a nutshell. Arrows represent positive associations.

The ISAML does not predict that intelligence is the same as sustained attention. Instead, it predicts that:

a) music training increases sustained attention

b) sustained attention is associated with intelligence

c) music training increases intelligence

2) Evaluating the ISAML

Prediction c is already supported, see above. Does anyone know something about prediction b? Here, I shall evaluate prediction a: does music training increase sustained attention? So far, the evidence looks inconclusive (Carey et al., 2015). Therefore, I will turn to a data set of my own which I gathered in a project together with Suzanne R. Jongman (Kunert & Jongman, in press).

We used a standard test of sustained attention: the digit discrimination test (Jongman et al., 2015). Participants had the mind-boggingly boring task of clicking a button every time they saw a zero while watching one single digit after another on the screen for ten minutes. A low sustained attention ability is thought to be reflected by worse performance (higher reaction time to the digit zero) at the end of the testing session compared to the beginning, or by overall high reaction times.

Unfortunately for the ISAML, it turns out that there is absolutely no relation between musical training and sustained attention. As you can see in Figure 2A, the reaction time (logged) decrement between the first and last half of reactions to zeroes is not related to musical training years [Pearson r = .03, N = 362, p = .61, 95% CI = [-.076; .129], JZS BF01 with default prior = 7.59; Spearman rho = .05]. Same for mean reaction time (logged), see Figure 2B [Pearson r = .02, N = 362, p = .74, 95% CI = [-0.861; 0.120], JZS BF01 = 8.181; Spearman rho = 0.03].


Figure 2. The correlation between two different measures of sustained attention (vertical axes) and musical training (horizontal axes) in a sample of 362 participants. High values on vertical axes represent low sustained attention, i.e. the ISAML predicts a negative correlation coefficient. Neither correlation is statistically significant. Light grey robust regression lines show an iterated least squares regression which reduces the influence of unusual data points.

3) Conclusion

Why on earth is musical training related to IQ increases? I have no idea. The ISAML is not a good account for the intelligence boost provided by music lessons.

The curious effect of a musical rhythm on us

Do you know the feeling of a musical piece moving you? What is this feeling? One common answer by psychological researchers is that what you feel is your attention moving in sync with the music. In a new paper I show that this explanation is mistaken.

Watch the start of the following video and observe carefully what is happening in the first minute or so (you may stop it after that).

Noticed something? Nearly everyone in the audience moved to the rhythm, clapping, moving the head etc. And you? Did you move? I guess not. You probably looked carefully at what people were doing instead. Your reaction illustrates nicely how musical rhythms affect people according to psychological researchers. One very influential theory claims that your attention moves up and down in sync with the rhythm. It treats the rhythm like you treated it. It simply ignores the fact that most people love moving to the rhythm.

The theory: a rhythm moves your attention

Sometimes we have gaps of attention. Sometimes we manage to concentrate really well for a brief moment. A very influential theory, which has been supported in various experiments, claims that these fluctuations in attention are synced to the rhythm when hearing music. Attention is up at rhythmically salient moments, e.g., the first beat in each bar. And attention is down during rhythmically unimportant moments, e.g., off-beat moments.

This makes intuitive sense. Important tones, e.g., those determining the harmonic key of a music piece, tend to occur at rhythmically salient moments. Looking at language rhythm reveals a similar picture. Stressed syllables are important for understanding language and signal moments of rhythmic salience. It makes sense to attend well during moments which include important information.

The test: faster decisions and better learning?

I, together with Suzanne Jongman, asked whether attention really is up at rhythmically salient moments. If so, people should make decisions faster when a background rhythm has a moment of rhythmic importance. As if people briefly concentrated better at that moment. This is indeed what we found. People are faster at judging whether a few letters on the screen are a real word or not, if the letters are shown near a salient moment of a background rhythm, compared to another moment.

However, we went further. People should also learn new words better if they are shown near a rhythmically salient moment. This turned out not to be the case. Whether people have to memorise a new word at a moment when their attention is allegedly up or down (according to a background rhythm) does not matter. Learning is just as good.

What is more, even those people who react really strongly to the background rhythm in terms of speeding up a decision at a rhythmically salient moment (red square in Figure below), even those people do not learn new words better at the same time as they speed up.

It’s as if the speed-up of decisions is unrelated to the learning of new words. That’s weird because both tasks are known to be affected by attention. This makes us doubt that a rhythm affects attention. What could it affect instead?


Figure 1. Every dot is one of 60 participants. How much a background rhythm sped up responses is shown horizontally. How much the same rhythm, at the same time, facilitated pseudoword memorisation is shown on the vertical axis. The red square singles out the people who were most affected by the rhythm in terms of their decision speed. Notice that, at the same time, their learning is unaffected by the rhythm.

The conclusion: a rhythm does not move your attention, it moves your muscles

To our own surprise, a musical rhythm appears not to affect how your attention moves up and down, when your attentional lapses happen, or when you can concentrate well. Instead, it simply appears to affect how fast you can press a button, e.g., when indicating a decision whether a few letters form a word or not.

Thinking back to the video at the start, I guess this just means that people love moving to the rhythm because the urge to do so is a direct consequence of understanding a rhythm. Somewhere in the auditory and motor parts of the brain, rhythm processing happens. However, this has nothing to do with attention. This is why learning a new word shown on the screen – a task without an auditory or motor component – is not affected by a background rhythm.

The paper: the high point of my career

You may read all of this yourself in the paper (here). I will have to admit that in many ways this paper is how I like to see science done and, so, I will shamelessly tell you of its merits. The paper is not too long (7,500 words) but includes no less than 4 experiments with no less than 60 participants each. Each experiment tests the research question individually. However, the experiments build on each other in such a way that their combination makes the overall paper stronger than any experiment individually ever could.

In terms of analyses, we put in everything we could think of. All analyses are Bayesian (subjective Bayes factor) and frequentist (p-values). We report hypothesis testing analyses (Bayes factor, p-values) and parameter estimation analyses (effect sizes, Confidence intervals, Credible intervals). If you can think of yet another analysis, go for it. We publish the raw data and analysis code alongside the article.

The most important reason why this paper represents my favoured approach to science, though, is because it actually tests a theory. A theory I and my co-author truly believed in. A theory with a more than 30-year history. With a varied supporting literature. With a computational model implementation. With more than 800 citations for two key papers. With, in short, everything you could wish to see in a good theory.

And we falsified it! Instead of thinking of the learning task as ‘insensitive’ or as ‘a failed experiment’, we dug deeper and couldn’t help but concluding that the attention theory of rhythm perception is probably wrong. We actually learned something from our data!

PS: no-one is perfect and neither is this paper. I wish we had pre-registered at least one of the experiments. I also wish the paper was open access (see a free copy here). There is room for improvement, as always.

Kunert R, & Jongman SR (2017). Entrainment to an auditory signal: Is attention involved? Journal of experimental psychology. General, 146 (1), 77-88 PMID: 28054814

How to write a nature-style review

Nature Reviews Neuroscience is one of the foremost journals in neuroscience. What do its articles look like? How have they developed? This blog post provides answers which might guide you in writing your own reviews.

Read more than you used to

Reviews in Nature Reviews Neuroscience cover more and more ground. Ten years ago, 93 references were the norm. Now, reviews average 150 references. This might be an example of scientific reports in general having to contain more and more information so as not to be labelled ‘premature’, ‘incomplete’, or ‘insufficient’ (Vale, 2015).


Reviews in NRN include more and more references.

Concentrate on the most recent literature

Nature Reviews Neuroscience is not the outlet for your history of neuroscience review. Only 22% of cited articles are more than 10 years old. A full 17% of cited articles were published a mere two years prior to the review being published, i.e. something like one year before the first draft of the review reached Nature Reviews Neuroscience (assuming a fast review process of 1 year).


Focus on recent findings. Ignore historical contexts.

If at all, give a historical background early on in your review.

References are given in order of first presentation in Nature Reviews Neuroscience. Dividing this order in quarters allows us to reveal the age distribution of references in the quarter of the review where they are first mentioned. As can be seen in the figure below, the pressure for recency is less severe in the first quarter of your review. It increases thereafter. So, if you want to take a risk and provide a historical context to your review, do so early on.


Ignore historical contexts, especially later in your review. Q = quarter in which reference first mentioned

The change in reference age distributions of the different quarters of reviews is not easily visible. Therefore, I fit a logarithmic model to the distributions (notice dotted line in Figure above) and used its parameter estimates as a representation of how ‘historical’ references are. Of course, the average reference is not historical, hence the negative values. But notice how the parameter estimates become more negative in progressive quarters of the reviews: history belongs at the beginning of a review.


Ignore historical contexts, especially later in your review: the modeling outcome.

Now, find a topic and write that Nature Review Neuroscience review. What are you waiting for?

All the R-code, including the R-markdown script used to generate this blog post, is available at

Discovering a glaring error in a research paper – a personal account

New York Magazine has published a great article about how grad student Steven Ludeke tried to correct mistakes in the research of Pete Hatemi and Brad Verhulst. Overall, Ludeke summarises his experience as ‘not recommendable’. Back in my undergraduate years I spotted an error in an article by David DeMatteo and did little to correct it. Why?

Christian Bale playing a non-incarcerated American Psycho.

David DeMatteo, assistant professor in Psychology at Drexel University, investigates psychopathy. In 2010, I was a lowly undergraduate student and noticed a glaring mistake in one of his top ten publications which has now been cited 50 times according to Google Scholar.

The error

The study investigated the characteristics of psychopaths who live among us, the non-incarcerated population. How do these psychopaths manage to avoid prison? DeMatteo et al. (2006) measured their psychopathy in terms of personality features and in terms of overt behaviours. ‘Participants exhibited the core personality features of psychopathy (Factor 1) to a greater extent than the core behavioral features of psychopathy (Factor 2). This finding may be helpful in explaining why many of the study participants, despite having elevated levels of psychopathic characteristics, have had no prior involvement with the criminal justice system.’ (p. 142)

The glaring mistake in this publication is that Factor 2 scores at 7.1 (the behavioural features of psychopathy) are actually higher than the Factor 1 scores at 5.2 (the personality features of psychopathy). The numbers tell the exactly opposite story to the words.


The error in short. The numbers obviously do not match up with the statement.

The numbers are given twice in the paper making a typo unlikely (p. 138 and p. 139). Adjusting the scores for the maxima of the scales that they are from (factor 1 x/x_max = 0.325 < factor 2 x/x_max=0.394) or the sample maximum (factor 1 x/x_max_obtained = 0.433 < factor 2 x/x_max_obtained = 0.44375) makes no difference. No outlier rejection is mentioned in the paper.

In sum, it appears as if DeMatteo and his co-authors interpret their numbers in a way which makes intuitive sense but which is in direct contradiction to their own data. When researchers disagree with their own data, we have a real problem.

The reaction

1) Self doubt. I consulted with my professor (the late Paddy O’Donnel) who confirmed the glaring mistake.

2) Contact the author. I contacted DeMatteo in 2010 but his e-mail response was evasive and did nothing to resolve the issue. I have contacted him again, inviting him to react to this post.

3) Check others’ reactions. I found three publications which cited DeMatteo et al.’s article (Rucevic, 2010; Gao & Raine, 2010; Ullrich et al., 2008) and simply ignored the contradictory numbers. They went with the story that community dwelling psychopaths show psychopathic personalities more than psychopathic behaviours, even though the data in the article favours the exactly opposite conclusion.

4) Realising my predicament. At this point I realised my options. Either I pursued this full force while finishing a degree and, afterwards, moving on to my Master’s in a different country. Or I let it go. I had a suspicion which Ludeke’s story in New York Magazine confirmed: in these situations one has much to lose and little to gain. Pursuing a mistake in the research literature is ‘clearly a bad choice’ according to Ludeke.

The current situation

And now this blog post detailing my experience. Why? Well, on the one hand, I have very little to lose from a disagreement with DeMatteo as I certainly don’t want a career in law psychology research and perhaps not even in research in general. The balance went from ‘little to gain, much to lose’ to ‘little to gain, little to lose’. On the other hand, following my recent blog posts and article (Kunert, 2016) about the replication crisis in Psychology, I have come to the conclusion that science cynicism is not the way forward. So, I finally went fully transparent.

I am not particularly happy with how I handled this whole affair. I have zero documentation of my contact with DeMatteo. So, expect his word to stand against mine soon. I also feel I should have taken a risk earlier in exposing this. But then, I used to be passionate about science and wanted a career in it. I didn’t want to make enemies before I had even started my Master’s degree.

In short, only once I stopped caring about my career in science did I find the space to care about science itself.

Update 16/11/2016: corrected numerical typo in sentence beginning ‘Adjusting the scores for the maxima…’ pointed out to me by Tom Foulsham via twitter (@TomFoulsh).

How to excel at academic conferences in 5 steps

Academic conferences have been the biggest joy of my PhD and so I want to share with others how to excel at this academic tradition. 


The author (second from right, with can) at his first music cognition conference (SMPC 2013 in Toronto) which – despite appearances – he attended by himself.

1) Socialising

A conference is not all about getting to know facts. It’s all about getting to know people. Go to a conference where you feel you can approach people. Attend every single preparatory excursion/workshop/symposium, every social event, every networking lunch. Sit at a table where you know no-one at all. Talk to the person next to you in every queue. At first, you will have only tiny chats. Later, these first contacts can develop over lunch. Still later you publish a paper together (Kunert & Slevc, 2015). The peer-review process might make you think that academics are awful know-it-alls. At a conference you will discover that they are actually interesting, intelligent and sociable people. Meet them!

2) Honesty

The conference bar is a mythical place where researchers talk about their actual findings, their actual doubts, their actual thoughts. If you want to get rid of the nagging feeling that you are an academic failure, talk to researchers at a conference. You will see that the published literature is a very polished version of what is really going on in research groups. It will help you put your own findings into perspective.

3) Openness

You can get even more out of a conference if you let go of your fear of being scooped and answer other people’s honesty with being open about what you do. I personally felt somewhat isolated with my research project at my institute. Conferences were more or less the only place to meet people with shared academic interests. Being open there didn’t just improve the bond with other academics, it led to concrete improvements of my research (Kunert et al., 2016).

4) Tourism

Get out of the conference hotel and explore the city. More often than not conferences are held in suspiciously nice places. Come a few days early, get rid of your jet-lag while exploring the local sights. Stay a few days longer and gather your thoughts before heading back to normal life. You might never again have an excuse to go to so many nice places so easily.

5) Spontaneity

The most important answer is yes. You might get asked for all sorts of things to do during the conference. Just say yes. I attended the Gran’ Ol Opry in Nashville. I found myself in a jacuzzi in Redwood, CA. I attended a transvestite bar in Toronto. All with people I barely knew. All with little to no information on what the invitation entailed. Just say yes and see what happens.

It might sound terribly intimidating to go to an academic conference if you just started your PhD. In this case a national or student only conference might be a good first step into the academic conference tradition.

Conferences are the absolute highlight of academia. Don’t miss out on them.

How to test for music skills

In a new article I evaluate a recently developed test for music listening skills. To my great surprise the test behaves very well. This could open the path to better understand the psychology underlying music listening. Why am I surprised?

I got my first taste of how difficult it is to replicate published scientific results during my very first empirical study as an undergraduate (eventually published as Kunert & Scheepers, 2014). Back then, I used a 25 minute long dyslexia screening test to distinguish dyslexic participants from non-dyslexic participants (the Lucid Adult Dyslexia Screener). Even though previous studies had suggested an excellent sensitivity (identifying actually dyslexic readers as dyslexic) of 90% and a moderate to excellent specificity (identifying actually non-dylexic readers as non-dyslexic) of 66% – 91% (Singleton et al., 2009; Nichols et al., 2009), my own values were worse at 61% sensitivity and 65% specificity. In other words, the dyslexia test only flagged someone with an official dyslexia diagnosis in 11/18 cases and only categorised someone without known reading problems as non-dyslexic in 13/20 cases. The dyslexia screener didn’t perform exactly as suggested by the published literature and I have been suspicious of ability tests every since.

Five years later I acquired data to look at how music can influence language processing (Kunert et al., 2016) and added a newly proposed music abilitily measure called PROMS (Law & Zentner, 2012) to the experimental sessions to see how bad it is. I really thought I would see the music listening ability scores derived from the PROMS to be conflated with things which on the face of it have little to do with music (digit span, i.e. the ability to repeat increasingly longer digit sequences), because previous music ability tests had that problem. Similarly, I expected people with better music training to not have that much better PROMS scores. In other words, I expected the PROMS to perform worse than suggested by the people who developed the test, in line with my negative experience with the dylexia screener.

It then came as a surprise to see that PROMS scores were hardly associated with the ability to repeat increasingly longer digit sequences (either in the same order, i.e. forward digit span, or in reverse order, i.e. backward digit span), see Figure 1A and 1B. This makes the PROMS scores surprisingly robust against variation in working memory, as you would expect from a good music ability test.


Figure 1. How the brief PROMS (vertical axis) correlates with various validity measures (horizontal axis). Each dot is one participant. Lines are best fit lines with equal weights for each participant (dark) or downweighting unusual participants (light). Inserted correlation values reflect dark line (Pearson r) or a rank-order equivalent of it which is robust to outliers (Spearman rho). Correlation values range from -1 to +1.

The second surprise came when musical training was actually associated with better music skill scores, as one would expect for a good test of music skills, see Figures 1C, 1D, 1E, and 1H. To top it of, the PROMS score was also correlated with the music task performance in the experiment looking at how language influences music processing. This association between the PROMS and musical task accuracy was visible in two independent samples, see Figures 1F and 1G, which is truly surprising because the music task targets harmonic music perception which is not directly tested by the PROMS.

To conclude, I can honestly recommend the PROMS to music researchers. To my surprise it is a good test which could truly tell us something about where music skills actually come from. I’m glad that this time I have been proven wrong regarding my suspicions about ability tests.

— — —

A critical comment on “Contextual sensitivity in scientific reproducibility”

Psychological science is surprisingly difficult to replicate (Open Science Collaboration, 2015). Researchers are desperate to find out why. A new study in the prestigious journal PNAS (Van Bavel et al., 2016) claims that unknown contextual factors of psychological phenomena (“hidden moderators”) are to blame. The more an effect is sensitive to unknown contextual factors, the less likely a successful replication is. In this blog post I will lay out why I am not convinced by this argument.

Before I start I should say that I really appreciate that the authors of this paper make their point with reference to data and analyses thereof. I believe that this is a big improvement on the state of the replicability debate of a few years back when it was dominated by less substantiated opinions. Moreover, they share their key data and some analysis code, following good scientific practice. Still, I am not convinced by their argument. Here’s why:

1) No full engagement with the opposite side of the argument

Van Bavel et al.’s (2016) suggested influence of replication contexts on replication success cannot explain the following patterns in the data set they used (Open Science Collaboration, 2015):

a) replication effect sizes are mostly lower than original effect sizes. Effects might well “vary by [replication] context” (p. 2) but why the consistent reduction in effect size when replicating an effect?

b) internal conceptual replications are not related to independent replication success (Kunert, 2016). This goes directly against Van Bavel et al.’s (2016) suggestion that “conceptual replications can even improve the probability of successful replications” (p. 5).

c) why are most original effects just barely statistically significant (see previous blog post)?

I believe that all three patterns point to some combination of questionable research practices affecting the original studies. Nothing in Van Bavel et al.’s (2016) article manages to convince me otherwise.

2) The central result completely depends on how you define ‘replication success’

The central claim of the article is based on the correlation between one measure of replication success (subjective judgment by replication team of whether replication was successful) and one measure of the contextual sensitivity of a replicated effect. While the strength of the association (r = -.23) is statistically significant (p = .024), it doesn’t actually provide convincing evidence for either the null or the alternative hypothesis according to a standard Bayesian JZS correlation test (BF01 = 1). [For all analyses: R-code below.]

Moreover, another measure of replication success (reduction of effect size between original and replication study) is so weakly correlated with the contextual sensitivity variable (r = -.01) as to provide strong evidence for a lack of association between contextual sensitivity and replication success (BF01 = 12, notice that even the direction of the correlation is in the wrong direction according to Van Bavel et al.’s (2016) account).


[Update: The corresponding values for the other measures of replication success are: replication p < .05 (r = -0.18; p = .0721; BF01 = 2.5), original effect size in 95%CI of replication effect size (r = -.3, p = .0032, BF10 = 6). I could not locate the data column for whether the meta-analytic effect size is different from zero.]

3) The contextual sensitivity variable could be confounded

How do we know which original effects were plagued by hidden moderators (i.e. by unknown context sensitivity) if, well, these moderators are hidden? Three of the authors of the article simply rated all replicated studies for contextual sensitivity without knowing each study’s replication status (but after the replication success of each study was known in general). The authors provide evidence for the ratings to be reliable but no one knows whether they are valid.

For example, the raters tried not to be influenced by ‘whether the specific replication attempt in question would succeed’ (p. 2). Still, all raters knew they would benefit (in the form of a prestigious publication) from a significant association between their ratings and replication success. How do we know that the ratings do not simply reflect some sort of implicit replicability doubt? From another PNAS study (Dreber et al., 2015) we know that scientists can predict replication success before a replication study is run.

Revealing hidden moderators

My problem with the contextual sensitivity account claiming that unknown moderators are to blame for replication failures is not so much that it is an unlikely explanation. I agree with Van Bavel et al. (2016) that some psychological phenomena are more sensitive to replication contexts than others. I would equally welcome it if scientific authors were more cautious in generalising their results.

My problem is that this account is so general as to be nearly unfalsifiable, and an unfalsifiable account is scientifically useless. Somehow unknown moderators always get invoked once a replication attempt has failed. All sorts of wild claims could be retrospectively claimed to be true within the context of the original finding.

In short: a convincing claim that contextual factors are to blame for replication failures needs to reveal the crucial replication contexts and then show that they indeed influence replication success. The proof of the unknown pudding is in the eating.

Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

The replicability of psychological research is surprisingly low. Why? In this blog post I present new evidence showing that questionable research practices contributed to failures to replicate psychological effects.

Quick recap. A recent publication in Science claims that only around 40% of psychological findings are replicable, based on 100 replication attempts in the Reproducibility Project Psychology (Open Science Collaboration, 2015). A few months later, a critical commentary in the same journal made all sorts of claims, including that the surprisingly low 40% replication success rate is due to replications having been unfaithful to the original studies’ methods (Gilbert et al., 2016). A little while later, I published an article in Psychonomic Bulletin & Review re-analysing the data by the 100 replication teams (Kunert, 2016). I found evidence for questionable research practices being at the heart of failures to replicate, rather than the unfaithfulness of replications to original methods.

However, my previous re-analysis depended on replication teams having done good work. In this blog post I will show that even when just looking at the original studies in the Reproducibility Project: Psychology one cannot fail to notice that questionable research practices were employed by the original discoverers of the effects which often failed to replicate. The reanalysis I will present here is based on the caliper test introduced by Gerber and colleagues (Gerber & Malhotra, 2008; Gerber et al., 2010).

The idea of the caliper test is simple. The research community has decided that an entirely arbitrary threshold of p = 0.05 distinguishes between effects which might just be due to chance (p > 0.05) and effects which are more likely due to something other than chance (p < 0.05). If researchers want to game the system they slightly rig their methods and analyses to push their p-values just below the arbitrary border between ‘statistical fluke’ and ‘interesting effect’. Alternatively, they just don’t publish anything which came up p > 0.05. Such behaviour should lead to an unlikely amount of p-values just below 0.05 compared to just above 0.05.

The figure below shows the data of the Reproducibility Project: Psychology. On the horizontal axis I plot z-values which are related to p-values. The higher the z-value the lower the p-value. On the vertical axis I just show how many z-values I found in each range. The dashed vertical line is the arbitrary threshold between p < .05 (significant effects on the right) and p > .05 (non-significant effects on the left).


The independent replications in blue show many z-values left of the dashed line, i.e. replication attempts which were unsuccessful. Otherwise the blue distribution is relatively smooth. There is certainly nothing fishy going on around the arbitrary p = 0.05 threshold. The blue curve looks very much like what I would expect psychological research to be if questionable research practices did not exist.

However, the story is completely different for the green distribution representing the original effects. Just right of the arbitrary p = 0.05 threshold there is a surprising clustering of z-values. It’s as if the human mind magically leads to effects which are just about significant rather than just about not significant. This bump immediately to the right of the dashed line is a clear sign that original authors used questionable research practices. This behaviour renders psychological research unreplicable.

For the expert reader, the formal analysis of the caliper test is shown in the table below using both a Bayesian analysis and a classical frequentist analysis. The conclusion is clear. There is no strong evidence for replication studies failing the caliper test, indicating that questionable research practices were probably not employed. The original studies do not pass the caliper test, indicating that questionable research practices were employed.


over caliper


below caliper (non-sign.) Binomial test Bayesian proportion test posterior median

[95% Credible Interval]1

10 % caliper (1.76 < z < 1.96 versus 1.96 < z < 2.16)

Original 9 4 p = 0.267 BF10 = 1.09 0.53

[-0.36; 1.55]

Replication 3 2 p = 1 BF01 = 1.30 0.18

[-1.00; 1.45]

15 % caliper (1.67 < z < 1.96 versus 1.96 < z < 2.25)

Original 17 4 p = 0.007 BF10 = 12.9 1.07

[0.24; 2.08]

Replication 4 5 p = 1 BF01 = 1.54 -0.13

[-1.18; 0.87]

20 % caliper (1.76 < z < 1.57 versus 1.96 < z < 2.35)

Original 29 4 p < 0.001 BF10 = 2813 1.59

[0.79; 2.58]

Replication 5 5 p = 1 BF01 = 1.64 0.00

[-0.99; 0.98]

1Based on 100,000 draws from the posterior distribution of log odds.


As far as I know, this is the first analysis showing that data from the original studies of the Reproducibility Project: Psychology point to questionable research practices [I have since been made aware of others, see this comment below]. Instead of sloppy science on the part of independent replication teams, this analysis rather points to original investigators employing questionable research practices. This alone could explain the surprisingly low replication rates in psychology.

Psychology failing the caliper test is by no means a new insight. Huge text-mining analyses have shown that psychology as a whole tends to fail the caliper test (Kühberger et al., 2013, Head et al., 2015). The analysis I have presented here links this result to replicability. If a research field employs questionable research practices (as indicated by the caliper test) then it can no longer claim to deliver insights which stand the replication test (as indicated by the Reproducibility Project: Psychology).

It is time to get rid of questionable research practices. There are enough ideas for how to do so (e.g., Asendorpf et al., 2013; Ioannidis, Munafò, Fusar-Poli, Nosek, & Lakens, 2014). The Reproducibility Project: Psychology shows why there is no time to waste: it is currently very difficult to distinguish an interesting psychological effect from a statistical fluke. I doubt that this state of affairs is what psychological researchers get paid for.

PS: full R-code for recreating all analyses and figures is posted below. If you find mistakes please let me know.

PPS: I am indebted to Jelte Wicherts for pointing me to this analysis.

Update 25/4/2015:

I adjusted text to clarify that caliper test cannot distinguish between many different questionable research practices, following tweet by .

I toned down the language somewhat following tweet by .

I added reference to Uli Schimmack’s analysis by linking his comment.

10 things I learned while working for the Dutch science funding council (NWO)


The way science is currently funded is very controversial. During the last 6 months I was on a break from my PhD and worked for the organisation funding science in the Netherlands (NWO). These are 10 insights I gained.


1) Belangenverstrengeling

This is the first word I learned when arriving in The Hague. There is an anal obsession with avoiding (any potential for) conflicts of interest (belangenverstrengeling in Dutch). It might not seem a big deal to you, but it is a big deal at NWO.


2) Work ethic

Work e-mails on Sunday evening? Check. Unhealthy deadline obsession? Check. Stories of burn-out diagnoses? Check. In short, I found no evidence for the mythical low work ethic of NWO. My colleagues seemed to be in a perfectly normal, modern, semi-stressful job.


3) Perks

While the career prospects at NWO are somewhat limited, there are some nice perks to working in The Hague including: an affordable, good cantine, free fruit all day, subsidised in-house gym, free massage (unsurprisingly, with a waiting list from hell), free health check … The work atmosphere is, perhaps as a result, quite pleasant.


4) Closed access

Incredible but true, NWO does not have access to the pay-walled research literature it funds. Among other things, I was tasked with checking that research funds were appropriately used. You can imagine that this is challenging if the end-product of science funding (scientific articles) is beyond reach. Given a Herculean push to make all Dutch scientific output open access, this problem will soon be a thing of the past.


5) Peer-review

NWO itself does not generally assess grant proposals in terms of content (except for very small grants). What it does is organise peer-review, very similar to the peer-review of journal articles. My impression is that the peer-review quality is similar if not better at NWO compared to the journals that I have published in. NWO has minimum standards for reviewers and tries to diversify the national/scientific/gender background of the reviewer group assigned to a given grant proposal. I very much doubt that this is the case for most scientific journals.


6) NWO peer-reviewed

NWO itself also applies for funding, usually to national political institutions, businesses, and the EU. Got your grant proposal rejected at NWO? Find comfort in the thought that NWO itself also gets rejected.


7) Funding decisions in the making

In many ways my fears for how it is decided who gets funding were confirmed. Unfortunately, I cannot share more information other than to say: science has a long way to go before focussing rewards on good scientists doing good research.


8) Not funding decisions

I worked on grants which were not tied to some societal challenge, political objective, or business need. The funds I helped distribute are meant to simply facilitate the best science, no matter what that science is (often blue sky research, Vernieuwingsimpuls for people in the know). Approximately 10% of grant proposals receive funding. In other words, bad apples do not get funding. Good apples also do not get funding. Very good apples equally get zero funding. Only outstanding/excellent/superman apples get funding. If you think you are good at what you do, do not apply for grant money through the Vernieuwingsimpuls. It’s a waste of time. If, on the other hand, you haven’t seen someone as excellent as you for a while, then you might stand a chance.


9) Crisis response

Readers of this blog will be well aware that the field of psychology is currently going through something of a revolution related to depressingly low replication rates of influential findings (Open Science Framework, 2015; Etz & Vandekerckhove, 2016; Kunert, 2016). To my surprise, NWO wants to play its part to overcome the replication crisis engulfing science. I arrived at a fortunate moment, presenting my ideas of the problem and potential solutions to NWO. I am glad NWO will set aside money just for replicating findings.


10) No civil servant life for me

Being a junior policy officer at NWO turned out to be more or less the job I thought it would be. It was monotonous, cognitively relaxing, and low on responsibilities. In other words, quite different to doing a PhD. Other PhD students standing at the precipice of a burn out might also want to consider this as an option to get some breathing space. For me, it was just that, but not more than that.

This blog post does not represent the views of my former or current employers. NWO did not endorse this blog post. As far as I know, NWO doesn’t even know that this blog post exists.

