Abstract
“To boldly go where no man has gone before”! Exploring and innovating—isn't this why we are in science after all? But as exciting this may be, others must be able to confirm our results through competent replication. As Karl Popper famously put it: “Single occurrences that cannot be reproduced are of no significance to science” (Popper, 1935). However, despite its status as a founding principle of modern science, replication is often viewed as pedestrian and unoriginal. Academia rewards the explorer, not the replicator. Not surprisingly, the current biomedical literature is dominated by exploration, but results are rarely confirmed. But how many of our discoveries are robust and will hold up if others try to reproduce them?
Subject Categories: Methods & Resources
Only after pharmaceutical companies reported a high failure rate when they tried to replicate pivotal findings of academic researchers, and several fields started systematic replication efforts, it began to dawn on many scientists that we may be in the midst of a “replication crisis” (Baker, 2016). At the same time, meta‐research exposed numerous validity threats in the biomedical literature. Many studies were found wanting in measures to prevent selection, detection, or attrition bias, among other deficits in design, analysis, and reporting. In addition to low internal validity, it appears that small sample sizes and thus low statistical power appear to contribute to a lack of robustness of the results of many studies—which may then be hard to replicate (Button et al, 2013).
During the past decade, many studies have analyzed and quantified substantial shortcomings of scientific rigor in the life sciences. A number of remedies were proposed, and journals, funders, and learned societies have begun to adopt policies to improve the quality of science they support or publish. These efforts are of utmost importance. Non‐reproducible results waste resources and potentially endanger patients (Yarborough et al, 2018). But even if we explore the complex biology of organisms with the highest methodological standards, confirmation of our results through replication will often remain elusive. In the following, I will explore some of the complexities that haunt efforts to reproduce scientific results. I will focus on issues that are rarely covered in the current discussion, which is dominated by diffuse terminology and a binary view of replication, which only knows success or failure.
Defining reproducibility
A conceptual framework for “reproducibility” was laid out by Goodman et al (2016), which discriminates between three basic concepts: (i) Methods reproducibility, which requires that procedures of a study must be described in such detail that an expert can faithfully repeat it. (ii) Results reproducibility—often referred to as “replication”—which results from a technically competent repetition of the study. It could be exact, using identical conditions, or conceptual, using altered conditions. The latter helps to extend the causal claim to previously unsampled settings. (iii) Inferential reproducibility, which relates to the question whether a reanalysis or replication of a study would come to qualitatively similar, if not exactly the same conclusions.
But what do we actually mean when we say that a study was “replicated”, or that “it could not be reproduced”? We might, for example, focus on statistical significance and P‐values. In other words, evaluate the replication effect against the null hypothesis. Another approach is to compare the replication effect against the original effect size: Is the original effect size within the 95% confidence interval of the effect size estimate from the replication? Or, a meta‐analysis could combine original and replication effects and scrutinize the cumulative evidence. Lastly, we could ask experts to subjectively assess the results: “Did it replicate?” By now, it should be obvious that the common focus on “Was the study reproducible?” is imprecise. A binary “yes” or “no” lacks information on the type of reproducibility, and leads to “sizeless science”, comparable to the common fallacy of focusing on arbitrary significance levels instead of looking at effect sizes and variance of data.
The meaning of failed replication
But matters get even more complicated when we try to interpret a failure to replicate. For many scientists, a positive result has a priori more authority than a negative one. On the other hand, once a failed replication makes it into the publication record, the implicit assumption often is that the original result was a false positive. But what if the failed replication was falsely negative? Conversely, does successful replication imply that the original result was correct, or could both be falsely positive?
In fact, most researchers overestimate the additional evidence generated by replication, in particular strict (exact) ones. As a warning to those who find the fuzz regarding reproducibility irrelevant, since they already always replicate their own important results: Replication studies often have little more power than flipping a coin! In the case of an exact replication of an original effect, which was significant with a P‐value of less than (but close to) 0.05, and assuming that the effect found in the original experiment equals the true population effect, the probability of obtaining a P‐value at least as small as 0.05 is approximately 50% (Goodman, 1992).
In addition, the ability to confirm or refute results critically depends on the provision of sufficient detail in the original study, and on the competence of the replicator. To make matters worse, the “known unknowns” of tacit knowledge and “unknown unknowns” of hidden moderators may further confound replication. Consequently, contextual sensitivity and putative researcher incompetence are the most popular explanations put forward by those who were informed that their results “could not be reproduced”. Moreover, replication has a social aspect as failure appears to stigmatize those whose results could not be confirmed. At the same time, trying to reproduce studies of others carries another stigma: These must be unoriginal scientists, with an affinity to embarrassing fellow researchers.
Cutting‐edge research produces false positives
With all these complexities and complications, shall we abandon replications as futile and continue to focus our efforts on exploration? Definitely not, since competent exploration at the frontiers of biology and pathophysiology must often lead to results, which cannot be replicated, simply because they were falsely positive. It comes as a surprise to many scientists that research pushing the boundaries of what is currently known must result in a plethora of false‐positive results. In fact, the more original initial findings are, the less likely it is that they can subsequently be confirmed. This is the straightforward result of the fact that cutting‐edge research must operate with low base rates, which means that it is unlikely that the tested hypotheses are actually true. On the flip side, the more mainstream and less novel research is, the likelier its hypotheses are true. This can easily be framed statistically. For example, if research operates with the probability that 10% of its findings are true, and accepts Type I errors (false positives) at 5% and Type II errors (false negatives) at 20% (i.e., 80% power), it will generate a false positive almost 40% of the times. In other words, the positive predictive value (PPV) of results is much worse than the Type I error level under those conditions, which are actually overly optimistic in much of biomedical research in which power is below 50% (Button et al, 2013), This has grave consequences, in particular as many researchers confuse PPV and significance level, nurturing their delusion that they are operating at an acceptable level of falsely accepting their hypotheses in only 5% of the cases (Colquhoun, 2014).
Confirmation to the rescue
The only remedy we have against drowning in a sea of false‐positive results is the systematic and competent confirmation of pivotal studies. Results are pivotal if they form the basis for further investigation, if they directly or indirectly impact on human health—for instance by informing the design of future clinical development, including trials in humans (Yarborough et al, 2018)—or if they are challenging accepted evidence. Exploration must be sensitive, since it must be able to faithfully capture rare but critical results, but confirmation must be highly specific, since further research based on a false‐positive or non‐robust finding is wasteful and unethical (Macleod et al, 2014).
Since discovery is unavoidably linked to high false‐positive rates and cannot support confirmatory inference, dedicated investigation is needed to validate pivotal results. Any study must describe its design and analysis in sufficient detail and provide all the necessary information, which includes the source data of the results. Confirmatory studies need to be sufficiently powered, which often means substantially larger sample sizes as in the original study (Simonsohn, 2015). In addition, exploratory and confirmatory investigation differ in many aspects. While exploration may start without any hypothesis (“unbiased”), a proper hypothesis is the obligatory starting point of any confirmation. Exploration investigates biological mechanisms or screens interventions, the results of which need confirmation. Confirmation of the hypothesis is the default primary endpoint of any confirmatory investigation. The protocol of the confirmatory study ought to be published (preregistration) before the first experiment takes place.
Both exploration and confirmation need to be of high internal validity. This means that they need to effectively control biases, use validated reagents and biologicals, and so on. However, generalizability is of greater importance in confirmation than in exploration, which therefore needs to be also of high external validity. Multi‐laboratory studies increase external validity and reduce the burden of increased sample sizes (Llovera et al, 2015). Importantly, as exploration aims at finding what might work or is “true”, Type II error level needs to be low and therefore statistical power high. Conversely, as confirmation tries to weed out the false positives, Type I error (false negatives) is a major concern. These are but a few of the idiosyncrasies of exploratory and confirmatory investigation.
A culture of reproducibility
Beyond such technicalities, we need a culture change with respect to research reproducibility. We must reinstate reproducibility to its position as a cornerstone of science, conceptually inseparable of exploration. This implies to educate our students better about its role and complexities. We need to fund and incentivize those individuals and teams that confirm (or refute) the findings of others. We should not stigmatize those whose findings could not be confirmed. On the contrary, competent work that others regarded so relevant as to try to replicate it should be held in high esteem, even if the results could not be confirmed. Journals might even create a new type of article that ties exploration to an independent, statistically rigorous confirmation. Combining the flexibility of basic research with the rigor of clinical trials would be particularly advantageous for animal studies of disease therapies (Mogil & Macleod, 2017).
Reproducibility is not an end in itself. Trivial findings may be highly reproducible, and non‐reproducible results may be actually true. But the new knowledge mapped through exploration is only useful if it is robust and reproducible. We need to find a way to archive failed replication attempts in a systematic and transparent manner, linked to the studies that they set out to replicate. Let us rethink research reproducibility in education, scientific practice, publishing, funding, and how we incentivize and reward researchers.
Conflict of interest
The author declares that he has no conflict of interest.
The EMBO Journal (2019) 38: e101117
References
- Baker M (2016) 1,500 scientists lift the lid on reproducibility. Nature 533: 452–454 [DOI] [PubMed] [Google Scholar]
- Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14: 365–376 [DOI] [PubMed] [Google Scholar]
- Colquhoun D (2014) An investigation of the false discovery rate and the misinterpretation of P values. R Soc Open Sci 1: 140216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodman SN (1992) A comment on replication, P‐values and evidence. Stat Med 11: 875–879 [DOI] [PubMed] [Google Scholar]
- Goodman SN, Fanelli D, Ioannidis JPA (2016) What does research reproducibility mean? Sci Transl Med 8: 341ps12 [DOI] [PubMed] [Google Scholar]
- Llovera G, Hofmann K, Roth S, Salas‐Pérdomo A, Ferrer‐Ferrer M, Perego C, Zanier ER, Mamrak U, Rex A, Party H, Agin V, Fauchon C, Orset C, Haelewyn B, De Simoni M‐G, Dirnagl U, Grittner U, Planas AM, Plesnila N, Vivien D et al (2015) Results of a preclinical randomized controlled multicenter trial (pRCT): anti‐CD49d treatment for acute brain ischemia. Sci Transl Med 7: 299ra121 [DOI] [PubMed] [Google Scholar]
- Macleod MR, Michie S, Roberts I, Dirnagl U, Chalmers I, Ioannidis JPA, Salman RA‐S, Chan A‐W, Glasziou P (2014) Biomedical research: increasing value, reducing waste. Lancet 383: 101–104 [DOI] [PubMed] [Google Scholar]
- Mogil JS, Macleod MR (2017) No publication without confirmation. Nature 542: 409–411 [DOI] [PubMed] [Google Scholar]
- Popper K (1935) Logik der Forschung. Berlin: Springer; [Google Scholar]
- Simonsohn U (2015) Small telescopes: detectability and the evaluation of replication results. Psychol Sci 26: 559–569 [DOI] [PubMed] [Google Scholar]
- Yarborough M, Bredenoord A, D'Abramo F, Joyce NC, Kimmelman J, Ogbogu U, Sena E, Strech D, Dirnagl U (2018) The bench is closer to the bedside than we think: uncovering the ethical ties between preclinical researchers in translational neuroscience and patients in clinical trials. PLoS Biol 16: e2006343 [DOI] [PMC free article] [PubMed] [Google Scholar]