false alarm

A critical comment on “Contextual sensitivity in scientific reproducibility”

Psychological science is surprisingly difficult to replicate (Open Science Collaboration, 2015). Researchers are desperate to find out why. A new study in the prestigious journal PNAS (Van Bavel et al., 2016) claims that unknown contextual factors of psychological phenomena (“hidden moderators”) are to blame. The more an effect is sensitive to unknown contextual factors, the less likely a successful replication is. In this blog post I will lay out why I am not convinced by this argument.

Before I start I should say that I really appreciate that the authors of this paper make their point with reference to data and analyses thereof. I believe that this is a big improvement on the state of the replicability debate of a few years back when it was dominated by less substantiated opinions. Moreover, they share their key data and some analysis code, following good scientific practice. Still, I am not convinced by their argument. Here’s why:

1) No full engagement with the opposite side of the argument

Van Bavel et al.’s (2016) suggested influence of replication contexts on replication success cannot explain the following patterns in the data set they used (Open Science Collaboration, 2015):

a) replication effect sizes are mostly lower than original effect sizes. Effects might well “vary by [replication] context” (p. 2) but why the consistent reduction in effect size when replicating an effect?

b) internal conceptual replications are not related to independent replication success (Kunert, 2016). This goes directly against Van Bavel et al.’s (2016) suggestion that “conceptual replications can even improve the probability of successful replications” (p. 5).

c) why are most original effects just barely statistically significant (see previous blog post)?

I believe that all three patterns point to some combination of questionable research practices affecting the original studies. Nothing in Van Bavel et al.’s (2016) article manages to convince me otherwise.

2) The central result completely depends on how you define ‘replication success’

The central claim of the article is based on the correlation between one measure of replication success (subjective judgment by replication team of whether replication was successful) and one measure of the contextual sensitivity of a replicated effect. While the strength of the association (r = -.23) is statistically significant (p = .024), it doesn’t actually provide convincing evidence for either the null or the alternative hypothesis according to a standard Bayesian JZS correlation test (BF01 = 1). [For all analyses: R-code below.]

Moreover, another measure of replication success (reduction of effect size between original and replication study) is so weakly correlated with the contextual sensitivity variable (r = -.01) as to provide strong evidence for a lack of association between contextual sensitivity and replication success (BF01 = 12, notice that even the direction of the correlation is in the wrong direction according to Van Bavel et al.’s (2016) account).

Bevel_figure

[Update: The corresponding values for the other measures of replication success are: replication p < .05 (r = -0.18; p = .0721; BF01 = 2.5), original effect size in 95%CI of replication effect size (r = -.3, p = .0032, BF10 = 6). I could not locate the data column for whether the meta-analytic effect size is different from zero.]

3) The contextual sensitivity variable could be confounded

How do we know which original effects were plagued by hidden moderators (i.e. by unknown context sensitivity) if, well, these moderators are hidden? Three of the authors of the article simply rated all replicated studies for contextual sensitivity without knowing each study’s replication status (but after the replication success of each study was known in general). The authors provide evidence for the ratings to be reliable but no one knows whether they are valid.

For example, the raters tried not to be influenced by ‘whether the specific replication attempt in question would succeed’ (p. 2). Still, all raters knew they would benefit (in the form of a prestigious publication) from a significant association between their ratings and replication success. How do we know that the ratings do not simply reflect some sort of implicit replicability doubt? From another PNAS study (Dreber et al., 2015) we know that scientists can predict replication success before a replication study is run.

Revealing hidden moderators

My problem with the contextual sensitivity account claiming that unknown moderators are to blame for replication failures is not so much that it is an unlikely explanation. I agree with Van Bavel et al. (2016) that some psychological phenomena are more sensitive to replication contexts than others. I would equally welcome it if scientific authors were more cautious in generalising their results.

My problem is that this account is so general as to be nearly unfalsifiable, and an unfalsifiable account is scientifically useless. Somehow unknown moderators always get invoked once a replication attempt has failed. All sorts of wild claims could be retrospectively claimed to be true within the context of the original finding.

In short: a convincing claim that contextual factors are to blame for replication failures needs to reveal the crucial replication contexts and then show that they indeed influence replication success. The proof of the unknown pudding is in the eating.

— — —
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research Proceedings of the National Academy of Sciences, 112 (50), 15343-15347 DOI: 10.1073/pnas.1516179112

Kunert, R. (2016). Internal conceptual replications do not increase independent replication success Psychonomic Bulletin & Review DOI: 10.3758/s13423-016-1030-9

Open Science Collaboration (2015). Estimating the reproducibility of psychological science Science, 349 (6251) DOI: 10.1126/science.aac4716

Van Bavel, J.J., Mende-Siedlecki, P., Brady, W.J., & Reinero, D.A. (2016). Contextual sensitivity in scientific reproducibility PNAS
— — —

########################################################################################################
# Script for article "A critical comment on "Contextual sensitivity in scientific reproducibility""    #
# Submitted to Brain's Idea                                                                            #
# Responsible for this file: R. Kunert (rikunert@gmail.com)                                            # 
########################################################################################################   
 
# source functions
if(!require(devtools)){install.packages('devtools')} #RPP functions
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))
 
if(!require(BayesMed)){install.packages('BayesMed')} #Bayesian analysis of correlation
library(BayesMed)
 
if(!require(Hmisc)){install.packages('Hmisc')} #correlations
library(Hmisc)
 
if(!require(reshape2)){install.packages('reshape2')}#melt function
library(reshape2)
 
if(!require(grid)){install.packages('grid')} #arranging figures
library(grid)
 
#get raw data from OSF website
info <- GET('https://osf.io/pra2u/?action=download', write_disk('rpp_Bevel_data.csv', overwrite = TRUE)) #downloads data file from the OSF
RPPdata <- read.csv("rpp_Bevel_data.csv")[1:100, ]
colnames(RPPdata)[1] <- "ID" # Change first column name
 
#------------------------------------------------------------------------------------------------------------
#2) The central result completely depends on how you define 'replication success'----------------------------
 
#replication with subjective judgment of whether it replicated
rcorr(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary, type = 'spearman')
#As far as I know there is currently no Bayesian Spearman rank correlation analysis. Therefore, use standard correlation analysis with raw and ranked data and hope that the result is similar.
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C, RPPdata$Replicate_Binary)#parametric Bayes factor test
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C), rank(RPPdata$Replicate_Binary))#parametric Bayes factor test
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
 
#replication with effect size reduction
rcorr(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)], RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)], type = 'spearman')
#parametric Bayes factor test
bf = jzs_cor(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)], RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)])
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
#parametric Bayes factor test with ranked data
bf = jzs_cor(rank(RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)]), rank(RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)]))
plot(bf$alpha_samples)
1/bf$BayesFactor#BF01 provides support for null hypothesis over alternative
 
#------------------------------------------------------------------------------------------------------------
#Figure 1----------------------------------------------------------------------------------------------------
 
#general look
theme_set(theme_bw(12)+#remove gray background, set font-size
            theme(axis.line = element_line(colour = "black"),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  panel.background = element_blank(),
                  panel.border = element_blank(),
                  legend.title = element_blank(),
                  legend.key = element_blank(),
                  legend.position = "top",
                  legend.direction = 'vertical'))
 
#Panel A: replication success measure = binary replication team judgment
dat_box = melt(data.frame(dat = c(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1],
                                  RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0]),
                          replication_status = c(rep('replicated', sum(RPPdata$Replicate_Binary == 1)),
                                                 rep('not replicated', sum(RPPdata$Replicate_Binary == 0)))),
               id = c('replication_status'))
 
#draw basic box plot
plot_box = ggplot(dat_box, aes(x=replication_status, y=value)) +
  geom_boxplot(size = 1.2,#line size
               alpha = 0.3,#transparency of fill colour
               width = 0.8,#box width
               notch = T, notchwidth = 0.8,#notch setting               
               show_guide = F,#do not show legend
               fill='black', color='grey40') +  
  labs(x = "Replication status", y = "Context sensitivity score")#axis titles
 
#add mean values and rhythm effect lines to box plot
 
#prepare data frame
dat_sum = melt(data.frame(dat = c(mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 1]),
                                  mean(RPPdata$ContextVariable_C[RPPdata$Replicate_Binary == 0])),
                          replication_status = c('replicated', 'not replicated')),
               id = 'replication_status')
 
#add mean values
plot_box = plot_box +
  geom_line(data = dat_sum, mapping = aes(y = value, group = 1),
            size= c(1.5), color = 'grey40')+
  geom_point(data = dat_sum, size=12, shape=20,#dot rim
             fill = 'grey40',
             color = 'grey40') +
  geom_point(data = dat_sum, size=6, shape=20,#dot fill
             fill = 'black',
             color = 'black')
plot_box
 
#Panel B: replication success measure = effect size reduction
dat_corr = data.frame("x" = RPPdata$FXSize_Diff[!is.na(RPPdata$FXSize_Diff)],
                      "y" = RPPdata$ContextVariable_C[!is.na(RPPdata$FXSize_Diff)])#plotted data
 
plot_corr = ggplot(dat_corr, aes(x = x, y = y))+
  geom_point(size = 2) +#add points
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression")) +
  stat_smooth(method = "rlm", size = 1, se = FALSE,
              aes(colour = "robust regression")) +
  labs(x = "Effect size reduction (original - replication)", y = "Contextual sensitivity score") +#axis labels
  scale_color_grey()+#colour scale for lines
  stat_smooth(method = "lm", size = 1, se = FALSE,
              aes(colour = "least squares regression"),
              lty = 2)
plot_corr
 
#arrange figure with both panels
multi.PLOT(plot_box + ggtitle("Replication success = replication team judgment"),
           plot_corr + ggtitle("Replication success = effect size stability"),
           cols=2)

Created by Pretty R at inside-R.org

Advertisements

Are internal replications the solution to the replication crisis in Psychology? No.

Most Psychology findings are not replicable. What can be done? Stanford psychologist Michael Frank has an idea : Cumulative study sets with internal replication. ‘If I had to advocate for a single change to practice, this would be it.’ I took a look whether this makes any difference.

A recent paper in the journal Science has tried to replicate 97 statistically significant effects (Open Science Collaboration, 2015). In only 35 cases this was successful. Most findings were suddenly a lot weaker upon replication. This has led to a lot of soul searching among psychologists. Fortunately, the authors of the Science paper have made their data freely available. So, soul searching can be accompanied by trying out different ideas for improvements.

What can be done to solve Psychology’s replication crisis?

One idea to improve the situation is to demand study authors to replicate their own experiments in the same paper. Stanford psychologist Michael Frank writes:

If I had to advocate for a single change to practice, this would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. […] If you show me a one-off study and I fail to replicate it in my lab, I will tend to suspect that you got lucky or p-hacked your way to a result. But if you show me a package of studies with four internal replications of an effect, I will believe that you know how to get that effect – and if I don’t get it, I’ll think that I’m doing something wrong.
If this argument were true, then the 41 studies which were successfully, conceptually replicated in their own paper should show higher rates of replication than the 56 studies which were not. Of the 41 internally replicated studies, 19 were replicated once, 10 twice, 8 thrice, 4 more than three times. I will treat all of these as equally internally replicated.

Are internal replications the solution? No.

internal_external_replication

So, does the data by the reprocucibility project show a difference? I made so-called violin plots, thicker parts represent more data points. In the left plot you see the reduction in effect sizes from a bigger original effect to a smaller replicated effect. The reduction associated with internally replicated effects (left) and effects which were only reported once in a paper (right) is more or less the same. In the right plot you can see the p-value of the replication attempt. The dotted line represents the arbitrary 0.05 threshold used to determine statistical significance. Again, replicators appear to have had as hard a task with effects that were found more than once in a paper as with effects which were only found once.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 29% of internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The equivalent number of not internally replicated effects is 41%. A contingency table Bayes factor test (Gunel & Dickey, 1974) shows that the null hypothesis of no difference is 1.97 times more likely than the alternative. In other words, the 12 %-point replication advantage for non-replicated effects does not provide convincing evidence for an unexpected reversed replication advantage. The 12%-point difference is not due to statistical power. Power was 92% on average in the case of internally replicated and not internally replicated studies. So, the picture doesn’t support internal replications at all. They are hardly the solution to Psychology’s replication problem according to this data set.

The problem with internal replications

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

What can be done?

In my previous post I have shown evidence for questionable research practices in this data set. These lead to less replicable results. Pre-registering studies makes questionable research practices a lot harder and science more reproducible. It would be interesting to see data on whether this hunch is true.

[update 7/9/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to take into account the different denominators for replicated and unreplicated effects. Lee Jussim pointed me to this.]

[update 24/10/2015: Adjusted claims in paragraph starting ‘If you do not know how to read these plots…’ to provide correct numbers, Bayesian analysis and power comparison.]

— — —
Bem DJ (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of personality and social psychology, 100 (3), 407-25 PMID: 21280961

Galak, J., LeBoeuf, R., Nelson, L., & Simmons, J. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103 (6), 933-948 DOI: 10.1037/a0029709

Gunel, E., & Dickey, J. (1974). Bayes Factors for Independence in Contingency Tables. Biometrika, 61(3), 545–557. http://doi.org/10.2307/2334738

Open Science Collaboration (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science (New York, N.Y.), 349 (6251) PMID: 26315443

Ritchie SJ, Wiseman R, & French CC (2012). Failing the future: three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PloS one, 7 (3) PMID: 22432019

Schimmack U (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17 (4), 551-66 PMID: 22924598
— — —

code for reproducing the figure (if you find mistakes, please tell me!):

## Estimating the association between internal replication and independent reproducibility of an effect

#Richard Kunert for Brain's Idea 5/9/2015

# a lot of code was taken from the reproducibility project code here https://osf.io/vdnrb/

# installing/loading the packages:
library(devtools)
source_url('https://raw.githubusercontent.com/FredHasselman/toolboxR/master/C-3PR.R')
in.IT(c('ggplot2','RColorBrewer','lattice','gridExtra','plyr','dplyr','httr','extrafont'))

#loading the data
RPPdata <- get.OSFfile(code='https://osf.io/fgjvw/',dfCln=T)$df
RPPdata <- dplyr::filter(RPPdata, !is.na(T.pval.USE.O),!is.na(T.pval.USE.R), complete.cases(RPPdata$T.r.O,RPPdata$T.r.R))#97 studies with significant effects

#prepare IDs for internally replicated effects and non-internally replicated effects
idIntRepl <- RPPdata$Successful.conceptual.replications.O > 0
idNotIntRepl <- RPPdata$Successful.conceptual.replications.O == 0

# Get ggplot2 themes predefined in C-3PR
mytheme <- gg.theme("clean")

#restructure data in data frame
dat <- data.frame(EffectSizeDifference = as.numeric(c(c(RPPdata$T.r.R[idIntRepl]) - c(RPPdata$T.r.O[idIntRepl]),
                                                          c(RPPdata$T.r.R[idNotIntRepl]) - c(RPPdata$T.r.O[idNotIntRepl]))),
                  ReplicationPValue = as.numeric(c(RPPdata$T.pval.USE.R[idIntRepl],
                                                   RPPdata$T.pval.USE.R[idNotIntRepl])),
                  grp=factor(c(rep("Internally Replicated Studies",times=sum(idIntRepl)),
                               rep("Internally Unreplicated Studies",times=sum(idNotIntRepl))))
  )

# Create some variables for plotting
dat$grp <- as.numeric(dat$grp)
probs   <- seq(0,1,.25)

# VQP PANEL A: reduction in effect size -------------------------------------------------

# Get effect size difference quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$EffectSizeDifference[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$EffectSizeDifference[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$EffectSizeDifference[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.es <- ggplot(dat,aes(x=grp,y=EffectSizeDifference)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.es0 <- vioQtile(g.es,qtiles,probs)
# Garnish (what does this word mean???)
g.es1 <- g.es0 +
  ggtitle("Effect size reduction") + xlab("") + ylab("Replicated - Original Effect Size") + 
  xlim("Internally Replicated", "Not Internally Replicated") +
  mytheme + theme(axis.text.x = element_text(size=20))
# View
g.es1


# VQP PANEL B: p-value -------------------------------------------------

# Get p-value quantiles and frequencies from data
qtiles <- ldply(unique(dat$grp),
                function(gr) quantile(round(dat$ReplicationPValue[dat$grp==gr],digits=4),probs,na.rm=T,type=3))
freqs  <- ldply(unique(dat$grp),
                function(gr) table(cut(dat$ReplicationPValue[dat$grp==gr],breaks=qtiles[gr,],na.rm=T,include.lowest=T,right=T)))
labels <- sapply(unique(dat$grp),
                 function(gr)levels(cut(round(dat$ReplicationPValue[dat$grp==gr],digits=4), breaks = qtiles[gr,],na.rm=T,include.lowest=T,right=T)))

# Get regular violinplot using package ggplot2
g.pv <- ggplot(dat,aes(x=grp,y=ReplicationPValue)) + geom_violin(aes(group=grp),scale="width",color="grey30",fill="grey30",trim=T,adjust=.7)
# Cut at quantiles using vioQtile() in C-3PR
g.pv0 <- vioQtile(g.pv,qtiles,probs)
# Garnish (I still don't know what this word means!)
g.pv1 <- g.pv0 + geom_hline(aes(yintercept=.05),linetype=2) +
  ggtitle("Independent replication p-value") + xlab("") + ylab("Independent replication p-value") + 
  xlim("Internally Replicated", "Not Internally Replicated")+
  mytheme + theme(axis.text.x = element_text(size=20))
# View
g.pv1

#put two plots together
multi.PLOT(g.es1,g.pv1,cols=2)</pre>
<pre>

The scientific community’s Galileo affair (you’re the Pope)

Science is in crisis. Everyone in the scientific community knows about it but few want to talk about it. The crisis is one of honesty. A junior scientist (like me) asks himself a similar question to Galileo in 1633: how much honesty is desirable in science?

Galileo versus Pope: guess what role the modern scientist plays.

Science Wonderland

According to nearly all empirical scientific publications that I have read, scientists allegedly work like this:

Introduction, Methods, Results, Discussion

Scientists call this ‘the story’ of the paper. This ‘story framework’ is so entrenched in science that the vast majority of scientific publications are required to be organised according to its structure: 1) Introduction, 2) Methods, 3) Results, 4) Discussion. My own publication is no exception.

Science Reality

However, virtually all scientists know that ‘the story’ is not really true. It is merely an ideal-case-scenario. Usually, the process looks more like this:

questionable research practices

Scientists call some of the added red arrows questionable research practices (or QRP for short). The red arrows stand for (going from left to right, top to bottom):

1) adjusting the hypothesis based on the experimental set-up. This is particularly true when a) working with an old data-set, b) the set-up is accidentally different from the intended one, etc.

2) changing design details (e.g., how many participants, how many conditions to include, how many/which measures of interest to focus on) depending on the results these changes produce.

3) analysing until results are easy to interpret.

4) analysing until results are statistically desirable (‘significant results’), i.e. so-called p-hacking.

5) hypothesising after results are known (so-called HARKing).

The outcome is a collection of blatantly unrealistic ‘stories’ in scientific publications. Compare this to the more realistic literature on clinical trials for new drugs. More than half the drugs fail the trial (Goodman, 2014). In contrast, nearly all ‘stories’ in the wider scientific literature are success stories. How?

Joseph Simmons and colleagues (2011) give an illustration of how to produce spurious successes. They simulated the situation of researchers engaging in the second point above (changing design details based on results). Let’s assume that the hypothesised effect is not real. How many experiments will erroneously find an effect at the conventional 5% significance criterion? Well, 5% of experiments should (scientists have agreed that this number is low enough to be acceptable). However, thanks to the questionable research practices outlined above this number can be boosted. For example, sampling participants until the result is statistically desirable leads to up to 22% of experiments reporting a ‘significant result’ even though there is no effect to be found. It is estimated that 70% of US psychologists have done this (John et al., 2012). When such a practice is combined with other, similar design changes, up to 61% of experiments falsely report a significant effect. Why do we do this?

The Pope of 1633 is back

If we know that the scientific literature is unrealistic why don’t we just drop the pretense and just tell it as it is? The reason is simple: because you like the scientific wonderland of success stories. If you are a scientist reader, you like to base the evaluation of scientific manuscripts on the ‘elegance’ (simplicity, straight-forwardness) of the text. This leaves no room for telling you what really happened. You also like to base the evaluation of your colleagues on the quantity and the ‘impact’ of their scientific output. QRPs are essentially a career requirement in such a system. If you are a lay reader, you like the research you fund (via tax money) to be sexy, easy and simple. Scientific data are as messy as the real world but the reported results are not. They are meant to be easily digestible (‘elegant’) insights.

In 1633 it did not matter much whether Galileo admitted to the heliocentric world view which was deemed blasphemous. The idea was out there to conquer the minds of the renaissance world. Today’s Galileo moment is also marked by an inability to admit to scientific facts (i.e. the so-called ‘preliminary’ research results which scientists obtain before applying questionable research practices). But this time the role of the Pope is played both by political leaders/ the lay public and scientists themselves. Actual scientific insights get lost before they can see the light of day.

There is a great movement to remedy this situation, including pressure to share data (e.g., at PLoS ONE), replication initiatives (e.g., RRR1, reproducibility project), the opportunity to pre-register experiments etc. However, these remedies only focus on scientific practice, as if Galileo was at fault and the concept of blasphemy was fine. Maybe we should start looking into how we got into this mess in the first place. Focus on the Pope.

— — —
John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices with Incentives for Truth-Telling SSRN Electronic Journal DOI: 10.2139/ssrn.1996631

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

— — —

Picture: Joseph Nicolas Robert-Fleury [Public domain], via Wikimedia Commons

PS: This post necessarily reflects the field of science that I am familiar with (Psychology, Cognitive Science, Neuroscience). The situation may well be different in other scientific fields.

 

 

canine confirmation confound – lessons from poorly performing drug detection dogs

Intuitively, the use of police dogs as drug detectors makes sense. Dogs are known to have a better sense of smell than their human handlers. Furthermore, they cooperate easily. Still, compared to the generally good picture sniffer dogs have in the public eye, their performance as drug detectors in real life is terrible. The reason why scent dogs get used anyway holds important lessons for behavioural researchers working with animals or humans.

Survey data coming out of Australia paints an appalling picture of sniffer dog abilities. Their noses hardly ever detect drugs that they are trained on. For example, only about 6% of regular ecstasy users in possession of drugs reported that they were found out by a sniffer dog they saw (Hickey et al., 2012). But once they bark, you can be pretty sure that a drug was found, right? No, you can’t be sure at all. An Australian review by the ombudsman for New South Wales found that nearly three quarters of dog alerts did not result in any drugs being found. It’s clear: using sniffer dogs to detect drugs just does not work very well.
drug detection, military, dog

Both looking in the same direction. Who is following whom?

This raises the question why scent dogs are actually used at all. My guess is that they perform a lot better in ability demonstrations compared to real life. This is because in demonstration scenarios their handlers know the right answer. This answer can then be read off unconscious behavioural cues and thus guide the dog. This is exactly what a Californian research team led by Lit et al. (2011) found. When an area was marked so as to make the handler believe that it was containing an illicit substance, more than 80% of the time the handler reported that his/her dog had found the substance. However, the researchers in this study misled the dog handlers and in fact never hid any illicit substances, i.e. every alarm was a false alarm. Interestingly, when an area was not marked, significantly fewer dog alerts were reported. This suggests that the dog owners control to a large extent when their own dog responds. Apparently, sniffer dogs game the system by trusting not just their nose but also their handler when it comes to looking for drugs. This trick won’t work, though, if the handler himself doesn’t have a clue either, as in real life scenarios.
The deeper issue is that good test design has to exclude the possibility that the participant can game it. The most famous case where this went wrong was a horse called Clever Hans. Early last century this horse made waves because it could allegedly count and do all sorts of computations. Hans, however, was clever in a different way than people realised. He only knew the answer if the person asking the question and recording the response also knew the answer. Clearly, Hans gamed the system by reading off the right answers from behavioural cues sent out by the experimenter.
Whether reading research papers or designing studies, remember Hans! Remember that the person handling the participant during a test should never know the right answer. If s/he does, the research is more likely to produce the intended result for unintended reasons. This can happen with scent dogs (Lit et al., 2011), with horses but also with adult humans (see the Bargh controversy elicited by Doyen et al., 2012). Unfortunately, after 100 years of living with this knowledge, reviewers start noticing that the lesson has been forgotten (see Beran, 2012). Drug detection dogs show where this loss leads us.

——————————————————————————————————–

Beran, M.J. (2012). Did you ever hear the one about the horse that could count? Front. Psychology, 3 DOI: 10.3389/fpsyg.2012.00357

Doyen S, Klein O, Pichon CL, & Cleeremans A (2012). Behavioral priming: it’s all in the mind, but whose mind? PloS one, 7 (1) PMID: 22279526

Hickey S, McIlwraith F, Bruno R, Matthews A, & Alati R (2012). Drug detection dogs in Australia: More bark than bite? Drug and alcohol review, 31 (6), 778-83 PMID: 22404555

Lit L, Schweitzer JB, & Oberbauer AM (2011). Handler beliefs affect scent detection dog outcomes. Animal cognition, 14 (3), 387-94 PMID: 21225441

NSW Ombudsman (2006). Review of the Police Powers (Drug Detection Dogs) Act 2001 Sydney: Office of the New SouthWales Ombudsman
—————————————————————————————————————————-

ResearchBlogging.orgIf you liked this post you may also like:
Correcting for Human Researchers – the Rediscovery of Replication

images:

1) By U.S. Navy photo by Photographer’s Mate 3rd Class Douglas G. Morrison [Public domain], via Wikimedia Commons

—————————————————————————————————————————-

If you were not entirely indifferent to this post, please leave a comment.