The growing divide between higher and low impact scientific journals

Ten years ago the Public Library of Science started one big lower impact and a series of smaller higher impact journals. Over the years these publication outlets diverged. The growing divide between standard and top journals might mirror wider trends in scholarly publishing.

There are roughly two kinds of journals in the Public Library of Science (PLoS): low impact (IF = 3.06) and higher impact (3.9 < IF < 13.59) journals. There is only one low impact journal, PLoS ONE, which is bigger in terms of output than all the other journals in PLoS combined. Its editorial policy is fundamentally different to the higher impact journals in that it does not require novelty or ‘strong results’. All it requires is methodological soundness.

Comparing PLoS ONE to the other PLoS journals then offers the opportunity to plot the growing divide between ‘high impact’ and ‘standard’ research papers. I will follow the hypothesis that more and more information is required for a publication (Vale, 2015). More information could be mirrored in three values: the number of references, authors, or pages.

And indeed, the higher impact PLoS journal articles have longer and longer reference sections, a rise of 24% from 46 to 57 over the last ten years (Pearson r = .11, Spearman rho = .11), see also my previous blog post for a similar pattern in another high impact journal outside of PLoS.


The lower impact PLoS ONE journal articles, on the other hand, practically did not change in the same period (Pearson r = .01, Spearman rho = -.00).


The diverging pattern between higher and low impact journals can also be observed with the number of authors per article. While in 2006 the average article in a higher impact PLoS journal was authored by 4.7 people, the average article in 2016 was written by 7.8 authors, a steep rise of 68% (Pearson r = .12, Spearman rho = .19).


And again, the low impact PLoS ONE articles do not exhibit the same change, remaining more or less unchanged (Pearson r = .01, Spearman rho = .02).


Finally, the number of pages per article tells the same story of runaway information density in higher impact journals and little to no change in PLoS ONE. Limiting myself to articles published until late november 2014(when lay-out changes complicate the comparison), the average higher impact journal article grew substantially in higher impact journals (Pearson r = .16, Spearman rho = .13) but not in PLoS ONE (Pearson r = .03, Spearman rho = .02).



So, overall, it is true that more and more information is required for a publication in a high impact journal. No similar rise in information density is seen in PLoS ONE. The publication landscape has changed. More effort is now needed for a high impact publication compared to ten years ago.

Wanna explore the data set yourself? I made a web-app which you can use in RStudio or in your web browser. Have fun with it and tell me what you find.

— — —
Vale, R.D. (2015). Accelerating scientific publication in biology Proceedings of the National Academy of Sciences, 112, 13439-13446 DOI: 10.1101/022368

The slowing down of the biggest scientific journal

PLoS ONE started 11 years ago to disruptively change scholarly publishing. By now it is the biggest scientific journal out there. Why has it become so slow?

Many things changed at PLoS ONE over the years, reflecting general trends in how researchers publish their work. For one thing, PLoS ONE grew enourmously. After publishing only 137 articles in its first year, the number of articles published per year peaked in 2013 at 31,522.


However, as shown in the figure above, since then they have declined by nearly a third. In 2016 only 21,655 articles were published in PLoS ONE. The decline could be due to a novel open data policy implemented in March 2014, a slight increase in the cost to publish in October 2015, or a generally more crowded market place for open access mega journals like PloS ONE (Wakeling et al., 2016).

However, it might also be that authors are becoming annoyed with PLoS ONE for getting slower. In its first year, it took 95 days on average to get an article from submission to publication in PLoS ONE. In 2016 it took a full 172 days. This renders PLoS ONE no longer the fastest journal published by PLoS, a title it held for nine years.


The graph below shows the developemtn of PLoS ONE in more detail by plotting each article’s review and publication speed against its publication date, i.e. each blue dot represents one of the 159,000 PLoS ONE articles.


What can explain the increasingly poor publication speed of PLoS ONE? Most people might think it is the sheer volume of manuscripts the journal has to process. Processing more articles might simply slow a journal down. However, this slow down continued until  2015, i.e. beyond the peak in publication output in 2013. Below, I show a more thorough analysis which reiterates this point. The plot shows each article in PLoS ONE in terms of its time from submission to publication and the number of articles published around the same time (30 days before and after). There is a link, for sure (Pearson r = .13, Spearman rho = .15), but it is much weaker than I would have thought.


Moreover, when controlling for publication date via a partial correlation, the pattern above becomes much weaker (partial Pearson r = .05, partial Spearman rho = .11). This suggests that much of PLoS ONE’s slow down is simply due to the passage of time. Perhaps, during this time scientific articles changed, requiring a longer time to evaluate whether they are suitable for the journal.

For example, it might be that articles these days include more information which takes longer to be assessed by scientific peers. More information could be mirrored in three values: the number of authors (information contributors), the reference count (information links), the page count (space for information). However, the number of authors per article has not changed over the years (Pearson r = .01, Spearman rho = .02). Similarly, there is no increase in the length of the reference sections over the years (r = .01; rho = -.00). Finally, while articles have indeed become longer in terms of page count (see graph below), the change is probably just due to a new lay-out in January 2015.


Perhaps, it takes longer to go through peer-review at PLoS ONE these days because modern articles are increasingly complex and interdisciplinary. A very small but reliable correlation between subject categories per article and publication date supports this possibility somewhat, see below. It is possible that PLoS ONE simply finds it increasingly difficult to look for the right experts to assess the scientific validity of an article because articles have become more difficult to pin down in terms of the expertise they require.


Having celebrated its 10 year anniversary, PLoS ONE can be proud to have revolutionized scholarly publishing. However, whether PLoS ONE itself will survive in the new publishing environment it helped to create remains to be seen. The slowing down of its publication process is certainly a sign that PLoS ONE needs to up its game in order to remain competitive.

Wanna explore the data set yourself? I made a web-app which you can use in RStudio or in your web browser. Have fun with it and tell me what you find.

— — —
Wakeling S, Willett P, Creaser C, Fry J, Pinfield S, & Spezi V (2016). Open-Access Mega-Journals: A Bibliometric Profile. PloS one, 11 (11) PMID: 27861511

Do twitter or facebook activity influence scientific impact?

Are scientists smart when they promote their work on social media? Isn’t this a waste of time, time which could otherwise be spent in the lab running experiments? Perhaps not. An analysis of all available articles published by PLoS journals suggests otherwise.

My own twitter activity might best be thought of as learning about science (in the widest sense), while what I do on facebook is really just shameless procrastination. It turns out that this pattern holds more generally and impacts on how to use social media effectively to promote science.

In order to make this claim, I downloaded the twitter and facebook activity associated with every single article published in any journal by the Public Library of Science (PLoS), using this R-script here. PLoS is the open access publisher of the biggest scientific journal PLoS ONE as well as a number of smaller, more high impact journals. The huge amount of data allows me to have a 90% chance of discovering even a small effect (r = .1) if it actually exists.

I should add that I limited my sample to those articles published after May 2012 (which is when PLoS started tracking tweets) and January 2015 (in order to allow for at least two years to aggregate citations). The 87,649 remaining articles published in any of the PLoS journals offer the following picture.


There is a small but non-negligible association between impact on twitter (tweets) and impact in the scientific literature (citations): Pearson r = .12, p < .001; Spearman rho = .18, p < .001. This pattern held for nearly every PLoS journal individually as well (all Pearson r ≥ .10 except for PLoS Computational Biology; all Spearman rho ≥ .12 except for PLoS Pathogens). This result is in line with Peoples et al.’s (2016) analysis of twitter activity and citations in the field of ecology.

So, twitter might indeed help a bit to promote an article. Does this hold for social media in general? A look at facebook reveals a different picture. The relationship between facebook mentions of an article and its scientific impact is so small as to be practically negligible: Pearson r = .03, p < .001; Spearman rho = .06, p < .001. This pattern of only a tiny association between facebook mentions and citations held for every single PLoS journal (Pearson r ≤ .09, Spearman rho ≤ .08).


In conclusion, twitter can be used for promoting your scientific work in an age of increased competition for scientific reading time (Renear & Palmer, 2009). Facebook, on the other hand, can be used for procrastinating.

Wanna explore the data set yourself? I made a web-app which you can use in RStudio or in your web browser. Have fun with it and tell me what you find.

— — —
Peoples BK, Midway SR, Sackett D, Lynch A, & Cooney PB (2016). Twitter Predicts Citation Rates of Ecological Research. PloS one, 11 (11) PMID: 27835703

Renear AH, & Palmer CL (2009). Strategic reading, ontologies, and the future of scientific publishing. Science (New York, N.Y.), 325 (5942), 828-32 PMID: 19679805

How to write a nature-style review

Nature Reviews Neuroscience is one of the foremost journals in neuroscience. What do its articles look like? How have they developed? This blog post provides answers which might guide you in writing your own reviews.

Read more than you used to

Reviews in Nature Reviews Neuroscience cover more and more ground. Ten years ago, 93 references were the norm. Now, reviews average 150 references. This might be an example of scientific reports in general having to contain more and more information so as not to be labelled ‘premature’, ‘incomplete’, or ‘insufficient’ (Vale, 2015).


Reviews in NRN include more and more references.

Concentrate on the most recent literature

Nature Reviews Neuroscience is not the outlet for your history of neuroscience review. Only 22% of cited articles are more than 10 years old. A full 17% of cited articles were published a mere two years prior to the review being published, i.e. something like one year before the first draft of the review reached Nature Reviews Neuroscience (assuming a fast review process of 1 year).


Focus on recent findings. Ignore historical contexts.

If at all, give a historical background early on in your review.

References are given in order of first presentation in Nature Reviews Neuroscience. Dividing this order in quarters allows us to reveal the age distribution of references in the quarter of the review where they are first mentioned. As can be seen in the figure below, the pressure for recency is less severe in the first quarter of your review. It increases thereafter. So, if you want to take a risk and provide a historical context to your review, do so early on.


Ignore historical contexts, especially later in your review. Q = quarter in which reference first mentioned

The change in reference age distributions of the different quarters of reviews is not easily visible. Therefore, I fit a logarithmic model to the distributions (notice dotted line in Figure above) and used its parameter estimates as a representation of how ‘historical’ references are. Of course, the average reference is not historical, hence the negative values. But notice how the parameter estimates become more negative in progressive quarters of the reviews: history belongs at the beginning of a review.


Ignore historical contexts, especially later in your review: the modeling outcome.

Now, find a topic and write that Nature Review Neuroscience review. What are you waiting for?

— — —

Vale, R. (2015). Accelerating scientific publication in biology Proceedings of the National Academy of Sciences, 112 (44), 13439-13446 DOI: 10.1073/pnas.1511912112

— — —

All the R-code, including the R-markdown script used to generate this blog post, is available at

Discovering a glaring error in a research paper – a personal account

New York Magazine has published a great article about how grad student Steven Ludeke tried to correct mistakes in the research of Pete Hatemi and Brad Verhulst. Overall, Ludeke summarises his experience as ‘not recommendable’. Back in my undergraduate years I spotted an error in an article by David DeMatteo and did little to correct it. Why?

Christian Bale playing a non-incarcerated American Psycho.

David DeMatteo, assistant professor in Psychology at Drexel University, investigates psychopathy. In 2010, I was a lowly undergraduate student and noticed a glaring mistake in one of his top ten publications which has now been cited 50 times according to Google Scholar.

The error

The study investigated the characteristics of psychopaths who live among us, the non-incarcerated population. How do these psychopaths manage to avoid prison? DeMatteo et al. (2006) measured their psychopathy in terms of personality features and in terms of overt behaviours. ‘Participants exhibited the core personality features of psychopathy (Factor 1) to a greater extent than the core behavioral features of psychopathy (Factor 2). This finding may be helpful in explaining why many of the study participants, despite having elevated levels of psychopathic characteristics, have had no prior involvement with the criminal justice system.’ (p. 142)

The glaring mistake in this publication is that Factor 2 scores at 7.1 (the behavioural features of psychopathy) are actually higher than the Factor 1 scores at 5.2 (the personality features of psychopathy). The numbers tell the exactly opposite story to the words.


The error in short. The numbers obviously do not match up with the statement.

The numbers are given twice in the paper making a typo unlikely (p. 138 and p. 139). Adjusting the scores for the maxima of the scales that they are from (factor 1 x/x_max = 0.325 < factor 2 x/x_max=0.394) or the sample maximum (factor 1 x/x_max_obtained = 0.433 < factor 2 x/x_max_obtained = 0.44375) makes no difference. No outlier rejection is mentioned in the paper.

In sum, it appears as if DeMatteo and his co-authors interpret their numbers in a way which makes intuitive sense but which is in direct contradiction to their own data. When researchers disagree with their own data, we have a real problem.

The reaction

1) Self doubt. I consulted with my professor (the late Paddy O’Donnel) who confirmed the glaring mistake.

2) Contact the author. I contacted DeMatteo in 2010 but his e-mail response was evasive and did nothing to resolve the issue. I have contacted him again, inviting him to react to this post.

3) Check others’ reactions. I found three publications which cited DeMatteo et al.’s article (Rucevic, 2010; Gao & Raine, 2010; Ullrich et al., 2008) and simply ignored the contradictory numbers. They went with the story that community dwelling psychopaths show psychopathic personalities more than psychopathic behaviours, even though the data in the article favours the exactly opposite conclusion.

4) Realising my predicament. At this point I realised my options. Either I pursued this full force while finishing a degree and, afterwards, moving on to my Master’s in a different country. Or I let it go. I had a suspicion which Ludeke’s story in New York Magazine confirmed: in these situations one has much to lose and little to gain. Pursuing a mistake in the research literature is ‘clearly a bad choice’ according to Ludeke.

The current situation

And now this blog post detailing my experience. Why? Well, on the one hand, I have very little to lose from a disagreement with DeMatteo as I certainly don’t want a career in law psychology research and perhaps not even in research in general. The balance went from ‘little to gain, much to lose’ to ‘little to gain, little to lose’. On the other hand, following my recent blog posts and article (Kunert, 2016) about the replication crisis in Psychology, I have come to the conclusion that science cynicism is not the way forward. So, I finally went fully transparent.

I am not particularly happy with how I handled this whole affair. I have zero documentation of my contact with DeMatteo. So, expect his word to stand against mine soon. I also feel I should have taken a risk earlier in exposing this. But then, I used to be passionate about science and wanted a career in it. I didn’t want to make enemies before I had even started my Master’s degree.

In short, only once I stopped caring about my career in science did I find the space to care about science itself.

— — —

DeMatteo, D., Heilbrun, K., & Marczyk, G. (2006). An empirical investigation of psychopathy in a noninstitutionalized and noncriminal sample Behavioral Sciences & the Law, 24 (2), 133-146 DOI: 10.1002/bsl.667

Gao, Y., & Raine, A. (2010). Successful and unsuccessful psychopaths: A neurobiological model Behavioral Sciences & the Law DOI: 10.1002/bsl.924

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success Psychonomic Bulletin & Review DOI: 10.3758/s13423-016-1030-9

Rucević S (2010). Psychopathic personality traits and delinquent and risky sexual behaviors in Croatian sample of non-referred boys and girls. Law and human behavior, 34 (5), 379-91 PMID: 19728057

Ullrich, S., Farrington, D., & Coid, J. (2008). Psychopathic personality traits and life-success Personality and Individual Differences, 44 (5), 1162-1171 DOI: 10.1016/j.paid.2007.11.008

— — —

Update 16/11/2016: corrected numerical typo in sentence beginning ‘Adjusting the scores for the maxima…’ pointed out to me by Tom Foulsham via twitter (@TomFoulsh).

How to excel at academic conferences in 5 steps

Academic conferences have been the biggest joy of my PhD and so I want to share with others how to excel at this academic tradition. 


The author (second from right, with can) at his first music cognition conference (SMPC 2013 in Toronto) which – despite appearances – he attended by himself.

1) Socialising

A conference is not all about getting to know facts. It’s all about getting to know people. Go to a conference where you feel you can approach people. Attend every single preparatory excursion/workshop/symposium, every social event, every networking lunch. Sit at a table where you know no-one at all. Talk to the person next to you in every queue. At first, you will have only tiny chats. Later, these first contacts can develop over lunch. Still later you publish a paper together (Kunert & Slevc, 2015). The peer-review process might make you think that academics are awful know-it-alls. At a conference you will discover that they are actually interesting, intelligent and sociable people. Meet them!

2) Honesty

The conference bar is a mythical place where researchers talk about their actual findings, their actual doubts, their actual thoughts. If you want to get rid of the nagging feeling that you are an academic failure, talk to researchers at a conference. You will see that the published literature is a very polished version of what is really going on in research groups. It will help you put your own findings into perspective.

3) Openness

You can get even more out of a conference if you let go of your fear of being scooped and answer other people’s honesty with being open about what you do. I personally felt somewhat isolated with my research project at my institute. Conferences were more or less the only place to meet people with shared academic interests. Being open there didn’t just improve the bond with other academics, it led to concrete improvements of my research (Kunert et al., 2016).

4) Tourism

Get out of the conference hotel and explore the city. More often than not conferences are held in suspiciously nice places. Come a few days early, get rid of your jet-lag while exploring the local sights. Stay a few days longer and gather your thoughts before heading back to normal life. You might never again have an excuse to go to so many nice places so easily.

5) Spontaneity

The most important answer is yes. You might get asked for all sorts of things to do during the conference. Just say yes. I attended the Gran’ Ol Opry in Nashville. I found myself in a jacuzzi in Redwood, CA. I attended a transvestite bar in Toronto. All with people I barely knew. All with little to no information on what the invitation entailed. Just say yes and see what happens.

It might sound terribly intimidating to go to an academic conference if you just started your PhD. In this case a national or student only conference might be a good first step into the academic conference tradition.

Conferences are the absolute highlight of academia. Don’t miss out on them.

— — —

Kunert R, & Slevc LR (2015). A Commentary on: “Neural overlap in processing music and speech”. Frontiers in human neuroscience, 9 PMID: 26089792

Kunert R, Willems RM, & Hagoort P (2016). Language influences music harmony perception: effects of shared syntactic integration resources beyond attention. Royal Society open science, 3 (2) PMID: 26998339

A critical comment on “Contextual sensitivity in scientific reproducibility”

Psychological science is surprisingly difficult to replicate (Open Science Collaboration, 2015). Researchers are desperate to find out why. A new study in the prestigious journal PNAS (Van Bavel et al., 2016) claims that unknown contextual factors of psychological phenomena (“hidden moderators”) are to blame. The more an effect is sensitive to unknown contextual factors, the less likely a successful replication is. In this blog post I will lay out why I am not convinced by this argument.

Before I start I should say that I really appreciate that the authors of this paper make their point with reference to data and analyses thereof. I believe that this is a big improvement on the state of the replicability debate of a few years back when it was dominated by less substantiated opinions. Moreover, they share their key data and some analysis code, following good scientific practice. Still, I am not convinced by their argument. Here’s why:

1) No full engagement with the opposite side of the argument

Van Bavel et al.’s (2016) suggested influence of replication contexts on replication success cannot explain the following patterns in the data set they used (Open Science Collaboration, 2015):

a) replication effect sizes are mostly lower than original effect sizes. Effects might well “vary by [replication] context” (p. 2) but why the consistent reduction in effect size when replicating an effect?

b) internal conceptual replications are not related to independent replication success (Kunert, 2016). This goes directly against Van Bavel et al.’s (2016) suggestion that “conceptual replications can even improve the probability of successful replications” (p. 5).

c) why are most original effects just barely statistically significant (see previous blog post)?

I believe that all three patterns point to some combination of questionable research practices affecting the original studies. Nothing in Van Bavel et al.’s (2016) article manages to convince me otherwise.

2) The central result completely depends on how you define ‘replication success’

The central claim of the article is based on the correlation between one measure of replication success (subjective judgment by replication team of whether replication was successful) and one measure of the contextual sensitivity of a replicated effect. While the strength of the association (r = -.23) is statistically significant (p = .024), it doesn’t actually provide convincing evidence for either the null or the alternative hypothesis according to a standard Bayesian JZS correlation test (BF01 = 1). [For all analyses: R-code below.]

Moreover, another measure of replication success (reduction of effect size between original and replication study) is so weakly correlated with the contextual sensitivity variable (r = -.01) as to provide strong evidence for a lack of association between contextual sensitivity and replication success (BF01 = 12, notice that even the direction of the correlation is in the wrong direction according to Van Bavel et al.’s (2016) account).


[Update: The corresponding values for the other measures of replication success are: replication p < .05 (r = -0.18; p = .0721; BF01 = 2.5), original effect size in 95%CI of replication effect size (r = -.3, p = .0032, BF10 = 6). I could not locate the data column for whether the meta-analytic effect size is different from zero.]

3) The contextual sensitivity variable could be confounded

How do we know which original effects were plagued by hidden moderators (i.e. by unknown context sensitivity) if, well, these moderators are hidden? Three of the authors of the article simply rated all replicated studies for contextual sensitivity without knowing each study’s replication status (but after the replication success of each study was known in general). The authors provide evidence for the ratings to be reliable but no one knows whether they are valid.

For example, the raters tried not to be influenced by ‘whether the specific replication attempt in question would succeed’ (p. 2). Still, all raters knew they would benefit (in the form of a prestigious publication) from a significant association between their ratings and replication success. How do we know that the ratings do not simply reflect some sort of implicit replicability doubt? From another PNAS study (Dreber et al., 2015) we know that scientists can predict replication success before a replication study is run.

Revealing hidden moderators

My problem with the contextual sensitivity account claiming that unknown moderators are to blame for replication failures is not so much that it is an unlikely explanation. I agree with Van Bavel et al. (2016) that some psychological phenomena are more sensitive to replication contexts than others. I would equally welcome it if scientific authors were more cautious in generalising their results.

My problem is that this account is so general as to be nearly unfalsifiable, and an unfalsifiable account is scientifically useless. Somehow unknown moderators always get invoked once a replication attempt has failed. All sorts of wild claims could be retrospectively claimed to be true within the context of the original finding.

In short: a convincing claim that contextual factors are to blame for replication failures needs to reveal the crucial replication contexts and then show that they indeed influence replication success. The proof of the unknown pudding is in the eating.

— — —
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research Proceedings of the National Academy of Sciences, 112 (50), 15343-15347 DOI: 10.1073/pnas.1516179112

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success Psychonomic Bulletin & Review DOI: 10.3758/s13423-016-1030-9

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

Van Bavel, J.J., Mende-Siedlecki, P., Brady, W.J., & Reinero, D.A. (2016). Contextual sensitivity in scientific reproducibility PNAS
— — —

# Script for article "A critical comment on "Contextual sensitivity in scientific reproducibility""    #
# Submitted to Brain's Idea                                                                            #
# Responsible for this file: R. Kunert (                                            # 
# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
if(!require(BayesMed)){install.packages('BayesMed')} #Bayesian analysis of correlation
if(!require(Hmisc)){install.packages('Hmisc')} #correlations
if(!require(reshape2)){install.packages('reshape2')}#melt function
if(!require(grid)){install.packages('grid')} #arranging figures
#get raw data from OSF website
info <- GET('', write_disk('rpp_Bevel_data.csv', overwrite = TRUE)) #downloads data file from the OSF
RPPdata <- read.csv("rpp_Bevel_data.csv")[1:100, ]
colnames(RPPdata)[1] <- "ID" # Change first column name
#2) The central result completely depends on how you define 'replication success'----------------------------
#replication with subjective judgment of whether it replicated
rcorr(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary, type = 'spearman')
#As far as I know there is currently no Bayesian Spearman rank correlation analysis. Therefore, use standard correlation analysis with raw and ranked data and hope that the result is similar.
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary)#parametric Bayes factor test
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C), rank(RPPdata$Replicate_Binary))#parametric Bayes factor test
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#replication with effect size reduction
rcorr(RPPdata$ContextVariable_C[!$FXSize_Diff)], RPPdata$FXSize_Diff[!$FXSize_Diff)], type = 'spearman')
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C[!$FXSize_Diff)], RPPdata$FXSize_Diff[!$FXSize_Diff)])
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C[!$FXSize_Diff)]), rank(RPPdata$FXSize_Diff[!$FXSize_Diff)]))
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#Figure 1----------------------------------------------------------------------------------------------------
#general look
theme_set(theme_bw(12)+#remove gray background, set font-size
            theme(axis.line = element_line(colour = "black"),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  panel.background = element_blank(),
                  panel.border = element_blank(),
                  legend.title = element_blank(),
                  legend.key = element_blank(),
                  legend.position = "top",
                  legend.direction = 'vertical'))
#Panel A: replication success measure = binary replication team judgment
dat_box = melt(data.frame(dat = c(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1],
                                  RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0]),
                          replication_status = c(rep('replicated', sum(RPPdata$Replicate_Binary == 1)),
                                                 rep('not replicated', sum(RPPdata$Replicate_Binary == 0)))),
               id = c('replication_status'))
#draw basic box plot
plot_box = ggplot(dat_box, aes(x=replication_status, y=value)) +
  geom_boxplot(size = 1.2,#line size
               alpha = 0.3,#transparency of fill colour
               width = 0.8,#box width
               notch = T, notchwidth = 0.8,#notch setting               
               show_guide = F,#do not show legend
               fill='black', color='grey40') +  
  labs(x = "Replication status", y = "Context sensitivity score")#axis titles
#add mean values and rhythm effect lines to box plot
#prepare data frame
dat_sum = melt(data.frame(dat = c(mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1]),
                                  mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0])),
                          replication_status = c('replicated', 'not replicated')),
               id = 'replication_status')
#add mean values
plot_box = plot_box +
  geom_line(data = dat_sum, mapping = aes(y = value, group = 1),
            size= c(1.5), color = 'grey40')+
  geom_point(data = dat_sum, size=12, shape=20,#dot rim
             fill = 'grey40',
             color = 'grey40') +
  geom_point(data = dat_sum, size=6, shape=20,#dot fill
             fill = 'black',
             color = 'black')
#Panel B: replication success measure = effect size reduction
dat_corr = data.frame("x" = RPPdata$FXSize_Diff[!$FXSize_Diff)],
                      "y" = RPPdata$ContextVariable_C[!$FXSize_Diff)])#plotted data
plot_corr = ggplot(dat_corr, aes(x = x, y = y))+
  geom_point(size = 2) +#add points
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression")) +
  stat_smooth(method = "rlm", size = 1, se = FALSE,
              aes(colour = "robust regression")) +
  labs(x = "Effect size reduction (original - replication)", y = "Contextual sensitivity score") +#axis labels
  scale_color_grey()+#colour scale for lines
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression"),
              lty = 2)
#arrange figure with both panels
multi.PLOT(plot_box + ggtitle("Replication success = replication team judgment"),
           plot_corr + ggtitle("Replication success = effect size stability"),

Created by Pretty R at

Yet more evidence for questionable research practices in original studies of Reproducibility Project: Psychology

The replicability of psychological research is surprisingly low. Why? In this blog post I present new evidence showing that questionable research practices contributed to failures to replicate psychological effects.

Quick recap. A recent publication in Science claims that only around 40% of psychological findings are replicable, based on 100 replication attempts in the Reproducibility Project Psychology (Open Science Collaboration, 2015). A few months later, a critical commentary in the same journal made all sorts of claims, including that the surprisingly low 40% replication success rate is due to replications having been unfaithful to the original studies’ methods (Gilbert et al., 2016). A little while later, I published an article in Psychonomic Bulletin & Review re-analysing the data by the 100 replication teams (Kunert, 2016). I found evidence for questionable research practices being at the heart of failures to replicate, rather than the unfaithfulness of replications to original methods.

However, my previous re-analysis depended on replication teams having done good work. In this blog post I will show that even when just looking at the original studies in the Reproducibility Project: Psychology one cannot fail to notice that questionable research practices were employed by the original discoverers of the effects which often failed to replicate. The reanalysis I will present here is based on the caliper test introduced by Gerber and colleagues (Gerber & Malhotra, 2008; Gerber et al., 2010).

The idea of the caliper test is simple. The research community has decided that an entirely arbitrary threshold of p = 0.05 distinguishes between effects which might just be due to chance (p > 0.05) and effects which are more likely due to something other than chance (p < 0.05). If researchers want to game the system they slightly rig their methods and analyses to push their p-values just below the arbitrary border between ‘statistical fluke’ and ‘interesting effect’. Alternatively, they just don’t publish anything which came up p > 0.05. Such behaviour should lead to an unlikely amount of p-values just below 0.05 compared to just above 0.05.

The figure below shows the data of the Reproducibility Project: Psychology. On the horizontal axis I plot z-values which are related to p-values. The higher the z-value the lower the p-value. On the vertical axis I just show how many z-values I found in each range. The dashed vertical line is the arbitrary threshold between p < .05 (significant effects on the right) and p > .05 (non-significant effects on the left).


The independent replications in blue show many z-values left of the dashed line, i.e. replication attempts which were unsuccessful. Otherwise the blue distribution is relatively smooth. There is certainly nothing fishy going on around the arbitrary p = 0.05 threshold. The blue curve looks very much like what I would expect psychological research to be if questionable research practices did not exist.

However, the story is completely different for the green distribution representing the original effects. Just right of the arbitrary p = 0.05 threshold there is a surprising clustering of z-values. It’s as if the human mind magically leads to effects which are just about significant rather than just about not significant. This bump immediately to the right of the dashed line is a clear sign that original authors used questionable research practices. This behaviour renders psychological research unreplicable.

For the expert reader, the formal analysis of the caliper test is shown in the table below using both a Bayesian analysis and a classical frequentist analysis. The conclusion is clear. There is no strong evidence for replication studies failing the caliper test, indicating that questionable research practices were probably not employed. The original studies do not pass the caliper test, indicating that questionable research practices were employed.


over caliper


below caliper (non-sign.) Binomial test Bayesian proportion test posterior median

[95% Credible Interval]1

10 % caliper (1.76 < z < 1.96 versus 1.96 < z < 2.16)

Original 9 4 p = 0.267 BF10 = 1.09 0.53

[-0.36; 1.55]

Replication 3 2 p = 1 BF01 = 1.30 0.18

[-1.00; 1.45]

15 % caliper (1.67 < z < 1.96 versus 1.96 < z < 2.25)

Original 17 4 p = 0.007 BF10 = 12.9 1.07

[0.24; 2.08]

Replication 4 5 p = 1 BF01 = 1.54 -0.13

[-1.18; 0.87]

20 % caliper (1.76 < z < 1.57 versus 1.96 < z < 2.35)

Original 29 4 p < 0.001 BF10 = 2813 1.59

[0.79; 2.58]

Replication 5 5 p = 1 BF01 = 1.64 0.00

[-0.99; 0.98]

1Based on 100,000 draws from the posterior distribution of log odds.


As far as I know, this is the first analysis showing that data from the original studies of the Reproducibility Project: Psychology point to questionable research practices [I have since been made aware of others, see this comment below]. Instead of sloppy science on the part of independent replication teams, this analysis rather points to original investigators employing questionable research practices. This alone could explain the surprisingly low replication rates in psychology.

Psychology failing the caliper test is by no means a new insight. Huge text-mining analyses have shown that psychology as a whole tends to fail the caliper test (Kühberger et al., 2013, Head et al., 2015). The analysis I have presented here links this result to replicability. If a research field employs questionable research practices (as indicated by the caliper test) then it can no longer claim to deliver insights which stand the replication test (as indicated by the Reproducibility Project: Psychology).

It is time to get rid of questionable research practices. There are enough ideas for how to do so (e.g., Asendorpf et al., 2013; Ioannidis, Munafò, Fusar-Poli, Nosek, & Lakens, 2014). The Reproducibility Project: Psychology shows why there is no time to waste: it is currently very difficult to distinguish an interesting psychological effect from a statistical fluke. I doubt that this state of affairs is what psychological researchers get paid for.

PS: full R-code for recreating all analyses and figures is posted below. If you find mistakes please let me know.

PPS: I am indebted to Jelte Wicherts for pointing me to this analysis.

Update 25/4/2015:

I adjusted text to clarify that caliper test cannot distinguish between many different questionable research practices, following tweet by .

I toned down the language somewhat following tweet by .

I added reference to Uli Schimmack’s analysis by linking his comment.

— — —

Asendorpf, J., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J., Fiedler, K., Fiedler, S., Funder, D., Kliegl, R., Nosek, B., Perugini, M., Roberts, B., Schmitt, M., van Aken, M., Weber, H., & Wicherts, J. (2013). Recommendations for Increasing Replicability in Psychology European Journal of Personality, 27 (2), 108-119 DOI: 10.1002/per.1919

Gerber, A., & Malhotra, N. (2008). Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results? Sociological Methods & Research, 37 (1), 3-30 DOI: 10.1177/0049124108318973

Gerber, A., Malhotra, N., Dowling, C., & Doherty, D. (2010). Publication Bias in Two Political Behavior Literatures American Politics Research, 38 (4), 591-613 DOI: 10.1177/1532673X09350979

Gilbert, D., King, G., Pettigrew, S., & Wilson, T. (2016). Comment on “Estimating the reproducibility of psychological science” Science, 351 (6277), 1037-1037 DOI: 10.1126/science.aad7243

Head ML, Holman L, Lanfear R, Kahn AT, & Jennions MD (2015). The extent and consequences of p-hacking in science. PLoS biology, 13 (3) PMID: 25768323

Ioannidis JP, Munafò MR, Fusar-Poli P, Nosek BA, & David SP (2014). Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in cognitive sciences, 18 (5), 235-41 PMID: 24656991

Kühberger A, Fritz A, & Scherndl T (2014). Publication bias in psychology: a diagnosis based on the correlation between effect size and sample size. PloS one, 9 (9) PMID: 25192357

Kunert R (2016). Internal conceptual replications do not increase independent replication success. Psychonomic bulletin & review PMID: 27068542

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

— — —

# Script for article "Questionable research practices in original studies of Reproducibility Project: Psychology"#
# Submitted to Brain's Idea (status: published)                                                                                               #
# Responsible for this file: R. Kunert (                                                      # 
#Figure 1: p-value density
# source functions
info <- GET('', write_disk('functions.r', overwrite = TRUE)) #downloads data file from the OSF
if(!require(devtools)){install.packages('devtools')} #RPP functions
if(!require(BayesFactor)){install.packages('BayesFactor')} #Bayesian analysis
if(!require(BEST)){install.packages('BEST')} #distribution overlap
library(BEST)#requires JAGS version 3
if(!require(RCurl)){install.packages('RCurl')} #
#the following few lines are an excerpt of the Reproducibility Project: Psychology's 
# masterscript.R to be found here:
# Read in Tilburg data
info <- GET('', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER <- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] <- "ID" # Change first column name to ID to be able to load .csv file
#for studies with exact p-values
id <- MASTER$ID[!$T_pval_USE..O.) & !$T_pval_USE..R.)]
#FYI: turn p-values into z-scores 
#z = qnorm(1 - (pval/2)) 
#prepare data point for plotting
dat_vis <- data.frame(p = c(MASTER$T_pval_USE..O.[id],
                            MASTER$T_pval_USE..R.[id], MASTER$T_pval_USE..R.[id]),
                      z = c(qnorm(1 - (MASTER$T_pval_USE..O.[id]/2)),
                            qnorm(1 - (MASTER$T_pval_USE..R.[id]/2)),
                            qnorm(1 - (MASTER$T_pval_USE..R.[id]/2))),
                      Study_set= c(rep("Original Publications", length(id)),
                                   rep("Independent Replications", length(id)),
                                   rep("zIndependent Replications2", length(id))))
#prepare plotting colours etc
cols_emp_in = c("#1E90FF","#088A08")#colour_definitions of area
cols_emp_out = c("#08088a","#0B610B")#colour_definitions of outline
legend_labels = list("Independent\nReplication", "Original\nStudy")
#execute actual plotting
density_plot = ggplot(dat_vis, aes(x=z, fill = Study_set, color = Study_set, linetype = Study_set))+
  geom_density(adjust =0.6, size = 1, alpha=1) +  #density plot call
  scale_linetype_manual(values = c(1,1,3)) +#outline line types
  scale_fill_manual(values = c(cols_emp_in,NA), labels = legend_labels)+#specify the to be used colours (outline)
  scale_color_manual(values = cols_emp_out[c(1,2,1)])+#specify the to be used colours (area)
  labs(x="z-value", y="Density")+ #add axis titles  
  ggtitle("Reproducibility Project: Psychology") +#add title
  geom_vline(xintercept = qnorm(1 - (0.05/2)), linetype = 2) +
  annotate("text", x = 2.92, y = -0.02, label = "p < .05", vjust = 1, hjust = 1)+
  annotate("text", x = 1.8, y = -0.02, label = "p > .05", vjust = 1, hjust = 1)+
  theme(legend.position="none",#remove legend
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line  = element_line(colour = "black"),#clean look
        text = element_text(size=18),
#common legend
#draw a nonsense bar graph which will provide a legend
legend_labels = list("Independent\nReplication", "Original\nStudy")
dat_vis <- data.frame(ric = factor(legend_labels, levels=legend_labels), kun = c(1, 2))
dat_vis$ric = relevel(dat_vis$ric, "Original\nStudy")
nonsense_plot = ggplot(data=dat_vis, aes(x=ric, y=kun, fill=ric)) + 
  scale_fill_manual(values=cols_emp_in[c(2,1)], name=" ") +
#extract legend
tmp <- ggplot_gtable(ggplot_build(nonsense_plot)) 
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") 
leg_plot <- tmp$grobs[[leg]]
#combine plots
grid.arrange(grobs = list(density_plot,leg_plot), ncol = 2, widths = c(2,0.4))
#caliper test according to Gerber et al.
#turn p-values into z-values
z_o = qnorm(1 - (MASTER$T_pval_USE..O.[id]/2))
z_r = qnorm(1 - (MASTER$T_pval_USE..R.[id]/2))
#How many draws are to be taken from posterior distribution for BF and Credible Interval calculations? The more samples the more precise the estimate and the slower the calculation.
draws = 10000 * 10#BayesFactor package standard = 10000
#choose one of the calipers
#z_c = c(1.76, 2.16)#10% caliper
#z_c = c(1.67, 2.25)#15% caliper
#z_c = c(1.57, 2.35)#20% caliper
#calculate counts
print(sprintf('Originals: over caliper N = %d', sum(z_o <= z_c[2] & z_o >= 1.96)))
print(sprintf('Originals: under caliper N = %d', sum(z_o >= z_c[1] & z_o <= 1.96)))
print(sprintf('Replications: over caliper N = %d', sum(z_r <= z_c[2] & z_r >= 1.96)))
print(sprintf('Replications: under caliper N = %d', sum(z_r >= z_c[1] & z_r <= 1.96)))
#formal caliper test: originals
#Bayesian analysis
bf = proportionBF(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2)
sprintf('Bayesian test of single proportion: BF10 = %1.2f', exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log
#sample from posterior
samples_o = proportionBF(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2,
                       posterior = TRUE, iterations = draws)
sprintf('Posterior Median = %1.2f [%1.2f; %1.2f]',
        quantile(samples_o[,"logodds"], 0.025),#Lower edge of 95% Credible Interval
        quantile(samples_o[,"logodds"], 0.975))#Higher edge of 95% Credible Interval
#classical frequentist test
bt = binom.test(sum(z_o <= z_c[2] & z_o >= 1.96), sum(z_o >= z_c[1] & z_o <= z_c[2]), p = 1/2)
sprintf('Binomial test: p = %1.3f', bt$p.value)#
#formal caliper test: replications
bf = proportionBF(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2)
sprintf('Bayesian test of single proportion: BF01 = %1.2f', 1/exp(bf@bayesFactor$bf))#exponentiate BF10 because stored as natural log, turn into BF01
#sample from posterior
samples_r = proportionBF(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2,
                       posterior = TRUE, iterations = draws)
sprintf('Posterior Median = %1.2f [%1.2f; %1.2f]',
        quantile(samples_r[,"logodds"], 0.025),#Lower edge of 95% Credible Interval
        quantile(samples_r[,"logodds"], 0.975))#Higher edge of 95% Credible Interval
#classical frequentist test
bt = binom.test(sum(z_r <= z_c[2] & z_r >= 1.96), sum(z_r >= z_c[1] & z_r <= z_c[2]), p = 1/2)
sprintf('Binomial test: p = %1.3f', bt$p.value)#
#possibly of interest: overlap of posteriors
#postPriorOverlap(samples_o[,"logodds"], samples_r[,"logodds"])#overlap of distribitions

Created by Pretty R at

Are internal replications the solution to the replication crisis in Psychology? No.

Most Psychology findings are not replicable. What can be done? Stanford psychologist Michael Frank has an idea : Cumulative study sets with internal replication. ‘If I had to advocate for a single change to practice, this would be it.’ I took a look whether this makes any difference.

A recent paper in the journal Science has tried to replicate 97 statistically significant effects (Open Science Collaboration, 2015). In only 35 cases this was successful. Most findings were suddenly a lot weaker upon replication. This has led to a lot of soul searching among psychologists. Fortunately, the authors of the Science paper have made their data freely available. So, soul searching can be accompanied by trying out different ideas for improvements.

What can be done to solve Psychology’s replication crisis?

One idea to improve the situation is to demand study authors to replicate their own experiments in the same paper. Stanford psychologist Michael Frank writes:

If I had to advocate for a single change to practice, this would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. […] If you show me a one-off study and I fail to replicate it in my lab, I will tend to suspect that you got lucky or p-hacked your way to a result. But if you show me a package of studies with four internal replications of an effect, I will believe that you know how to get that effect – and if I don’t get it, I’ll think that I’m doing something wrong.
If this argument were true, then the 41 studies which were successfully, conceptually replicated in their own paper should show higher rates of replication than the 56 studies which were not. Of the 41 internally replicated studies, 19 were replicated once, 10 twice, 8 thrice, 4 more than three times. I will treat all of these as equally internally replicated.

Are internal replications the solution? No.


So, does the data by the reprocucibility project show a difference? I made so-called violin plots, thicker parts represent more data points. In the left plot you see the reduction in effect sizes from a bigger original effect to a smaller replicated effect. The reduction associated with internally replicated effects (left) and effects which were only reported once in a paper (right) is more or less the same. In the right plot you can see the p-value of the replication attempt. The dotted line represents the arbitrary 0.05 threshold used to determine statistical significance. Again, replicators appear to have had as hard a task with effects that were found more than once in a paper as with effects which were only found once.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 29% of internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The equivalent number of not internally replicated effects is 41%. A contingency table Bayes factor test (Gunel & Dickey, 1974) shows that the null hypothesis of no difference is 1.97 times more likely than the alternative. In other words, the 12 %-point replication advantage for non-replicated effects does not provide convincing evidence for an unexpected reversed replication advantage. The 12%-point difference is not due to statistical power. Power was 92% on average in the case of internally replicated and not internally replicated studies. So, the picture doesn’t support internal replications at all. They are hardly the solution to Psychology’s replication problem according to this data set.

The problem with internal replications

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

What can be done?

In my previous post I have shown evidence for questionable research practices in this data set. These lead to less replicable results. Pre-registering studies makes questionable research practices a lot harder and science more reproducible. It would be interesting to see data on whether this hunch is true.

[update 7/9/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to take into account the different denominators for replicated and unreplicated effects. Lee Jussim pointed me to this.]

[update 24/10/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to provide correct numbers, Bayesian analysis and power comparison.]

— — —
Bem DJ (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of personality and social psychology, 100 (3), 407-25 PMID: 21280961

Galak, J., LeBoeuf, R., Nelson, L., & Simmons, J. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103 (6), 933-948 DOI: 10.1037/a0029709

Gunel, E., & Dickey, J. (1974). Bayes Factors for Independence in Contingency Tables. Biometrika, 61(3), 545–557.

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Ritchie SJ, Wiseman R, & French CC (2012). Failing the future: three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PloS one, 7 (3) PMID: 22432019

Schimmack U (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17 (4), 551-66 PMID: 22924598
— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between internal replication and independent reproducibility of an effect

#Richard Kunert for Brain's Idea 5/9/2015

# a lot of code was taken from the reproducibility project code here

# installing/loading the packages:

#loading the data
RPPdata <- get.OSFfile(code='',dfCln=T)$df
RPPdata <- dplyr::filter(RPPdata, !,!, complete.cases(RPPdata$T.r.O,RPPdata$T.r.R))#97 studies with significant effects

#prepare IDs for internally replicated effects and non-internally replicated effects
idIntRepl <- RPPdata$Successful.conceptual.replications.O > 0
idNotIntRepl <- RPPdata$Successful.conceptual.replications.O == 0

# Get ggplot2 themes predefined in C-3PR
mytheme <- gg.theme("clean")

#restructure data in data frame
dat <- data.frame(EffectSizeDifference = as.numeric(c(c(RPPdata$T.r.R[idIntRepl]) - c(RPPdata$T.r.O[idIntRepl]),
                                                          c(RPPdata$T.r.R[idNotIntRepl]) - c(RPPdata$T.r.O[idNotIntRepl]))),
                  ReplicationPValue = as.numeric(c(RPPdata$T.pval.USE.R[idIntRepl],
                  grp=factor(c(rep("Internally Replicated Studies",times=sum(idIntRepl)),
                               rep("Internally Unreplicated Studies",times=sum(idNotIntRepl))))

# Create some variables for plotting
dat$grp <- as.numeric(dat$grp)
probs   <- seq(0,1,.25)

# VQP PANEL A: reduction in effect size -------------------------------------------------

# Get effect size difference quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$EffectSizeDifference[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$EffectSizeDifference[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$EffectSizeDifference[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2 <- ggplot(dat,aes(x=grp,y=EffectSizeDifference)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 <- vioQtile(,qtiles,probs)
# Garnish (what does this word mean???)
g.es1 <- g.es0 +
  ggtitle("Effect size reduction") + xlab("") + ylab("Replicated - Original Effect Size") + 
  xlim("Internally Replicated", "Not Internally Replicated") +
  mytheme + theme(axis.text.x = element_text(size=20))
# View

# VQP PANEL B: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$ReplicationPValue[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$ReplicationPValue[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$ReplicationPValue[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.pv <- ggplot(dat,aes(x=grp,y=ReplicationPValue)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 <- vioQtile(g.pv,qtiles,probs)
# Garnish (I still don't know what this word means!)
g.pv1 <- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
  ggtitle("Independent replication p-value") + xlab("") + ylab("Independent replication p-value") + 
  xlim("Internally Replicated", "Not Internally Replicated")+
  mytheme + theme(axis.text.x = element_text(size=20))
# View

#put two plots together

Why are Psychological findings mostly unreplicable?

Take 97 psychological effects from top journals which are claimed to be robust. How many will replicate? Brian Nosek and his huge team tried it out and the results were sobering, to say the least. How did we get here? The data give some clues.

Sometimes the title of a paper just sounds incredible. Estimating the reproducibility of psychological science. No one had ever systematically, empirically investigated this for any science. Doing so would require huge resources. The countless authors on this paper which appeared in Science last week went to great lengths to try anyway and their findings are worrying.

When they tried to replicate 97 statistically significant effects with 92% power (i.e. a nominal 92% chance of finding the effect should it exist as claimed by the original discoverers), 89 statistically significant effect should pop up. Only 35 did. Why weren’t 54 more studies replicated?

The team behind this article also produced 95% Confidence Intervals of the replication study effect sizes. Despite their name, only 83% of them should contain the original effect size (see here why). Only 47% actually did. Why were most effect sizes much smaller in the replication?

One reason for poor replication: sampling until significant

I believe much has to do with so-called questionable research practices which I blogged about before. The consequences of this are directly visible in the openly available data of this paper. Specifically, I am focussing on the widespread practice of sampling more participants until a test result is statistically desirable, i.e. until you get a p-value below the arbitrary threshold of 0.05. The consequence is this:


Focus on the left panel first. The green replication studies show a moderate relation between the effect size they found and their pre-determined sample size. This is to be expected as the replicators wanted to be sure that they had sufficient statistical power to find their effects. Expecting small effects (lower on vertical axis) makes you plan in more participants (further right on horizontal axis). The replicators simply sampled their pre-determined number, and then analysed the data. Apparently, such a practice leads to a moderate correlation between measured effect size and sample size because what the measured effect size will be is uncertain when you start sampling.

The red original studies show a stronger relation between the effect size they found and their sample size. They must have done more than just smart a priori power calculations. I believe that they sampled until their effect was statistically significant, going back and forth between sampling and analysing their data. If, by chance, the first few participants showed the desired effect quite strongly, experimenters were happy with overestimating their effect size and stopped early. These would be red data values in the top left of the graph. If, on the other hand, the first few participants gave equivocal results, the experimenters continued for as long as necessary. Notice how this approach links sample size to the effect size measured in the experiment, hence the strong statistical relation. The approach by the replicators links the sample size merely to the expected effect size estimated before the experiment, hence the weaker association with the actually measured effect size.

The right panel shows a Bayesian correlation analysis of the data. What you are looking at is the belief in the strength of the correlation, called the posterior distribution. The overlap of the distributions can be used as a measure of believing that the correlations are not different. The overlap is less than 7%. If you are more inclined to believe in frequentist statistics, the associated p-value is .001 (Pearson and Filon’s z = 3.355). Therefore, there is strong evidence that original studies display a stronger negative correlation between sample size and measured effect size than replication studies.

The approach which – I believe – has been followed by the original research teams should be accompanied by adjustments of the p-value (see Lakens, 2014 for how to do this). If not, you misrepresent your stats and lower the chances of replication, as shown in simulation studies (Simmons et al., 2011). It is estimated that 70% of psychological researchers have sampled until their result was statistically significant without correcting their results for this (John et al., 2012). This might very well be one of the reasons why replication rates in Psychology are far lower than what they should be.

So, one approach to boosting replication rates might be to do what we claim to do anyways and what the replication studies have actually done: aquiring data first, analysing it second. Alternatively, be open about what you did and correct your results appropriately. Otherwise, you might publish nothing more than a fluke finding with no basis.

[24/10/2015: Added Bayesian analysis and changed figure. Code below is from old figure.]

[27/11/2015: Adjusted percentage overlap of posterior distributions.]

— — —
John LK, Loewenstein G, & Prelec D (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23 (5), 524-32 PMID: 22508865

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses European Journal of Social Psychology, 44 (7), 701-710 DOI: 10.1002/ejsp.2023

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between sample size and effect size from data provided by the reproducibility project

#Richard Kunert for Brain's Idea 3/9/2015
#load necessary libraries

#get raw data from OSF website
info &lt;- GET('', write_disk('rpp_data.csv', overwrite = TRUE)) #downloads data file from the OSF
MASTER &lt;- read.csv("rpp_data.csv")[1:167, ]
colnames(MASTER)[1] &lt;- "ID" # Change first column name to ID to be able to load .csv file

#restrict studies to those with appropriate data
studies&lt;-MASTER$ID[!$T_r..O.) &amp; !$T_r..R.)] ##to keep track of which studies are which
studies&lt;-studies[-31]##remove one problem study with absurdly high sample size (N = 23,0047)

#set font size for plotting
theme_set(theme_gray(base_size = 30))

#prepare correlation coefficients
dat_rank &lt;- data.frame(sample_size_O = rank(cbind(MASTER$T_N_O_for_tables[studies])),
sample_size_R = rank(cbind(MASTER$T_N_R_for_tables[studies])),
effect_size_O = rank(cbind(MASTER$T_r..O.[studies])),
effect_size_R = rank(cbind(MASTER$T_r..R.[studies])))
corr_O_Spearm = rcorr(dat_rank$effect_size_O, dat_rank$sample_size_O, type = "spearman")#yes, I know the type specification is superfluous
corr_R_Spearm = rcorr(dat_rank$effect_size_R, dat$sample_size_R, type = "spearman")

#compare Spearman correlation coefficients using cocor (data needs to be ranked in order to produce Spearman correlations!)
htest = cocor(formula=~sample_size_O + effect_size_O | sample_size_R + effect_size_R,
data = dat_rank, return.htest = FALSE)

#prepare data frame
dat_vis &lt;- data.frame(study = rep(c("Original", "Replication"), each=length(studies)),
sample_size = rbind(cbind(MASTER$T_N_O_for_tables[studies]), cbind(MASTER$T_N_R_for_tables[studies])),
effect_size = rbind(cbind(MASTER$T_r..O.[studies]), cbind(MASTER$T_r..R.[studies])))

#The plotting call
ggplot(data=dat_vis, aes(x=sample_size, y=effect_size, group=study)) +#the basic scatter plot
geom_point(aes(color=study),shape=1,size=4) +#specify marker size and shape
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm,   # Add linear regression lines
se=FALSE,    # Don't add shaded confidence region
aes(color=study))+#colour lines according to data points for consistency
geom_text(aes(x=750, y=0.46,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_O_Spearm$r[1,2], corr_O_Spearm$P[1,2]),
color="Original", hjust=0)) +#add text about Spearman correlation coefficient of original studies
guides(color = guide_legend(title=NULL)) + #avoid additional legend entry for text
geom_text(aes(x=750, y=0.2,
label=sprintf("Spearman rho = %1.3f (p = %1.3f)",
corr_R_Spearm$r[1,2], corr_R_Spearm$P[1,2]),
color="Replication", hjust=0))+#add text about Spearman correlation coefficient of replication studies
geom_text(x=1500, y=0.33,
label=sprintf("Difference: Pearson &amp; Filon z = %1.3f (p = %1.3f)",
htest@pearson1898$statistic, htest@pearson1898$p.value),
color="black", hjust=0)+#add text about testing difference between correlation coefficients
guides(color = guide_legend(title=NULL))+#avoid additional legend entry for text
ggtitle("Sampling until significant versus a priori power analysis")+#add figure title
labs(x="Sample Size", y="Effect size r")#add axis titles