Abstract
The purported reproducibility crisis has triggered both studies to analyze the problem and proposals to address it. However, it is important to not stifle exploratory, cutting‐edge research.
Subject Categories: Molecular Biology of Disease, S&S: Economics & Business, S&S: Health & Disease
The debate over a reproducibility crisis has been simmering for years now, amplified by growing concerns over a number of reproducibility studies that have failed to replicate previous positive results. Additional evidence from larger meta‐analysis of past papers also points to a lack of reproducibility in biomedical research with potentially dire consequences for drug development and investment into research. One of the largest meta‐analyses concluded that low levels of reproducibility, at best around 50% of all preclinical biomedical research, were delaying lifesaving therapies, increasing pressure on research budgets and raising costs of drug development 1. The paper claimed that about US$28 billion a year was spent largely fruitlessly on preclinical research in the USA alone.
A problem of statistics
However, the assertion that a 50% level of reproducibility equates to a crisis, or that many of the original studies were really fruitless, has been disputed by some specialists in replication. “A 50% level of reproducibility is generally reported as being bad, but that is a complete misconstrual of what to expect”, commented Jeffrey Mogil, who holds the Canada Research Chair in Genetics of Pain at McGill University in Montreal. “There is no way you could expect 100% reproducibility, and if you did, then the studies could not have been very good in the first place. If people could replicate published studies all the time then they could not have been cutting edge and pushing the boundaries”.
One reason not to expect 100% reproducibility in preclinical studies is that cutting edge or exploratory research deals with a lot of uncertainty and competing hypotheses of which only a few can be correct. After all, there would be no need to conduct experiments at all if the outcome were completely predictable. For that reason, initial preclinical study cannot be absolutely false or true, but must rely on weight of the evidence, usually using the P‐test as a tiebreaker. The interpretation of experiments often relies on probabilities (P‐values) of < 0.05 as the gold standard test for statistical significance, which creates a sharp but somewhat arbitrary cut‐off. It means that if a study's results fall just on the side of statistical significance, a replication has a high probability of refuting them, explained Malcolm Macleod, who specializes in meta‐analysis of animal studies of neurological diseases at Edinburgh University in the UK. “A replication of a study that was significant just below P 0.05, all other things being equal and the null hypothesis being indeed false, has only a 50% chance to again end up with a ‘significant’ P‐value on replication”, said Macleod. “So many of the so called ‘replication studies’ may have been false negatives”.
… the assertion that a 50% level of reproducibility equates to a crisis, or that many of the original studies were really fruitless, has been disputed by some specialists in replication
For this reason, replication studies need even greater statistical power than the original, Macleod argued, given that the reason for doing them is to confirm or refute previous results. They need to have “higher n's” than the original studies, otherwise the replication study is no more likely to be correct than the original.
This leads to a fundamental problem for the life sciences, especially preclinical research: a huge vested interest in positive results has mitigated against replication. Authors have grants and careers at stake, journals need strong stories to generate headlines, and pharmaceutical companies have invested large amounts of money in positive results and patients yearn for new therapies. There is also a divergence in interest between different parties in the overall research and development pipeline. Preclinical researchers need freedom to explore the borders of knowledge, while clinical researchers rely on replication to weed out false positives.
To address this dichotomy, Mogil and Macleod have proposed a new strategy for conducting health‐relevant studies. “Malcolm is studying clinical trials and replicability itself, while I'm at the preclinical end of the spectrum, so our needs and takes on the problem are not the same but our analysis of the potential solution is very similar”, commented Mogil. They suggest a three‐stage process to publication whereby the first stage allows for exploratory studies that generate or support hypotheses away from the yoke of statistical rigour, followed by a second confirmatory study, performed with the highest levels of rigour by an independent laboratory. A paper would then only be published after successful completion of both stages. A third stage, involving multiple centres, could then create the foundation for human clinical trials to test new drug candidates or therapies.
“The idea of this compromise is that I get left alone to fool around and not get every single preliminary study passed to statistical significance, with a lot of waste in money and time”, Mogil explained. “But then at some point I have to say ‘I've fooled around enough time that I'm so convinced by my hypothesis that I'm willing to let someone else take over’”. Mogil is aware that this would require the establishment and funding of a network of laboratories to perform confirmatory or replication studies. “I think this a perfect thing for funding agencies so I'm trying to get the NIH (National Institutes of Health) to give a consortium of pain labs a contract”, he added.
Reproducibility projects
There have been various projects to reproduce results, but these merely helped to define the scale of the problem rather than provide solutions, according to Mogil. The two most prominent reproducibility projects, one for psychology and one for cancer research, were set up by the Centre for Open Science, a non‐profit organization founded in 2013 to “increase the openness, integrity, and reproducibility of scientific research”.
The cancer study reported its results in January 2017 2, but it raised as many questions as it answered. Two out of five studies “substantially reproduced” the original findings, although not all experiments met the threshold of statistical significance. Two others yielded “uninterpretable results”, and one failed to replicate a paper by Erkki Ruoslahti, from the cancer research centre at Sanford Burnham Prebys Medical Discovery Institute in La Jolla, California. His original study had identified a peptide that appears to help anti‐cancer drugs penetrate tumours 3.
… replication studies need even greater statistical power than the original, Macleod argued, given that the reason for doing them is to confirm or refute previous results.
Ruoslahti has been hotly disputing the results of that replication, arguing that it was a limited study comprising just a single experiment and that the associated meta‐analysis ignored previous reproduction of his results by three generations of post docs. “I do disagree with the idea of reproducibility studies”, he said. “If only one experiment is done without any troubleshooting, the result is a tossup. Anything more extensive would be prohibitively costly. So many things can go wrong in an experiment done by someone for the first time. Instead I think we should let the scientific process run its course. Findings that are not correct will disappear because others can't reproduce them or publish divergent results, after an adequate try and hopefully also explaining why the results are different”. Ruoslahti has received support from Tim Errington, manager of the Centre for Open Science's cancer reproducibility project, who agreed that a single failure to replicate should not invalidate a paper.
Method descriptions and biology
Nonetheless, the cancer reproducibility project highlighted a wider problem: that experimental methods or environmental conditions are often not reported in sufficient detail to recreate the original set up accurately. Indeed, the most obvious conclusion was that many papers provide too little detail about their methods, according to Errington. As a result, replication teams have to devote many hours to chase down protocols and reagents, which often had been developed by students or post docs no longer with the team.
The exposure of such discrepancies is itself a positive result from the replication study, Errington asserted, and it has sparked efforts to make experiments more repeatable. “The original authors, just like all scientists, are focused on conducting their current research projects, writing up their results for publication, and writing grants and job applications for advancement”, Errington noted. “Digging through old lab notebooks to find what was previously published is not as high a priority. This points to a gap that can be filled by making this information readily available to complement the publication at the time of publication. We demonstrate a way to do this with each Replication Study, where the underlying methods/data/analysis scripts are made available using https://osf.io. And unique materials that are not available can be made available for the research community to reuse, for replication or new investigations”.
Another major factor that can cause replication to fail is the biology itself. By way of example, the effect of a drug might depend on the particular metabolic or immunological state of an animal, asserted Hanno Würbel from the Division of Animal Welfare at the University of Bern in Switzerland, who has a longstanding interest in reproductivity in research. “If a treatment effect, for example a drug effect, is conditional on some phenotypic characteristics, such as the drug only working under conditions of stress, then it seems inappropriate to speak of a ‘failed’ replication. In that case both study outcomes would be ‘true’ within a certain range of conditions”, he explained. Nonetheless, discrepancies between original and replication studies could indeed enrich research. “Provided all studies were done well, different outcomes of replicate studies would be informative in telling us that conditions matter, and that we need to search further to establish the range of conditions under which a given treatment works”, Würbel said.
Preclinical researchers need freedom to explore the borders of knowledge, while clinical researchers rely on replication to weed out false positives.
Another related issue is the high level of standardization to make results as generally valid and reproducible as possible. But, as Würbel emphasized, this can actually have the opposite effect. “The standard approach to evidence generation in preclinical animal research are single‐laboratory studies conducted under highly standardized conditions, often for both the genotype of the animals and the conditions under which they are reared, housed and tested”, he said. “Because of this, you can never know for sure whether a study outcome has or hasn't got external validity. If you think about it, this means that replication studies are inherently required by the very nature of the standard approach to preclinical animal research. Yet results of single‐laboratory studies conducted under highly standardized conditions are still being sold, that is still getting published, as if the results were externally valid and reproducible, but without any proof. And then people are surprised when replication studies ‘fail’”.
Rodent animal models in particular have been highly standardized as inbred strains with the aim of making results more repeatable by eliminating genetic differences. But it also means that results cannot readily be generalized and different strains can yield different results. This has long been appreciated in some fields such as ageing research where genetic differences can have a huge impact. Steve Austad at the University of Michigan, USA, realized as early as 1999 that relying on genetically homogenous rodents for ageing research often led to conclusions that tend to reflect strain‐specific idiosyncrasies. He therefore advocated the development of pathogen‐free stocks from wild‐trapped progenitors for study of ageing and late‐life pathophysiology 4. It has now inspired calls for greater genetic diversity among laboratory rodents for preclinical research in drug development and ageing.
Publishing negative and confirmation studies
However, perhaps the biggest elephant in the room is related to publication bias towards positive results and away from the null hypothesis. This affects replication in so far that not just negative but also confirmation studies tend not to get reported. This distortion can also make failure to replicate more likely by encouraging false positives in the first place. Macleod therefore urges the whole research community to adopt a more upbeat approach to null results as these can be just as valuable as positive ones. “Maybe we should think of studies as means to provide information. If a study provides information, non‐replication is on par with initial reports”, he remarked. “In any case, replication, if well done, and regardless of whether results are in agreement or at variance with the original study, adds information that can be aggregated with the initial study and furthers our evidence”.
… experimental methods or environmental conditions are often not reported in sufficient detail to recreate the original set up accurately
There has been some recognition of the need to promote null results, notably through the Journal of Negative Results in Biomedicine (JNRBM). Surprisingly, it is scheduled to cease publication by BioMed Central in September, on the grounds that its mission has been accomplished. The publisher argues that results which would previously have remained unpublished were now appearing in other journals. Many though would contend that null results are still greatly underrepresented in the literature and that there is a shortage of both resources and motivation for replication studies in general.
On that front, there are some promising initiatives though, such as StudySwap, a platform hosted by the Centre for Open Science, to help biologists find suitable collaborators for replication studies. “StudySwap allows scholars from all over the world to replicate effects before they ever get published”, explained Martin Schweinsberg, assistant professor of organizational behaviour at the European School of Management and Technology, Berlin. “Such pre‐publication independent replications (PPIRs) are important because they allow scientists to better understand the phenomenon they're studying before they submit a paper for publication”. According to Christopher Chartier from the Department of Psychology at Ashland University, Ohio, USA, a common use of StudySwap will be for two or more researchers concurrently collecting data at several different sites to combine the samples for analysis. “These types of study would result in larger and more diverse samples than any of the individuals”, he remarked. However, StudySwap is currently run by volunteers and is not yet geared up for large‐scale replication work. This will require active support from major funding agencies, and there are welcome signs of this happening, according to Brian Nosek, executive director of the Center for Open Science. “For example, the NWO (Netherlands Organisation for Scientific Research) has a 3 million Euro funding line for replications of important discoveries”, he said.
The role of journals and funders
Journals also have an important role to play, Nosek added, by providing incentives for replications. He highlighted the TOP Guidelines (http://cos.io/top/) from the Centre for Open Science, which specifies replication as one of its key eight standards for advancing transparency in research and publication. “TOP is gaining traction across research communities with about 3,000 journal signatories so far”, said Nosek. “COS also offers free training services for doing reproducible research, and fosters adoption of incentives that can make research more open and reproducible, such as badges to acknowledge open practices”, Another COS initiative called Registered Reports (http://cos.io/rr/) promotes a publishing model where peer review is conducted prior to the outcomes of the research being known.
Many […] would contend that null results are still greatly underrepresented in the literature, and that there is a shortage of both resources and motivation for replication studies in general
There is no shortage of projects focusing on replication, and there are signs of funding bodies devoting resources, as in the Netherlands. “This is changing, and research assessment structures are I think open to measures beyond the grants in/papers out approach”, Macleod said. “The real challenge is with individual institutions and their policy and practices around academic promotion, tenure, and so on, which are still largely wedded to outdated measures such as Impact Factor and author position”.
Journals in particular have a responsibility and could help by changing outdated methods of reward and insisting on more detailed method descriptions, according to Matt Hodgkinson, Head of Research Integrity at Hindawi, one of the largest open‐access journal publishers. “Incentive structures now reward publication volume and being first”, he said. “Citations are currently counted and more is considered better for the authors and journal, which can perversely reward controversial findings that fail to replicate. Instead funders and institutions should reward quality of reporting, replicability, and collaborations”. Hodgkinson added that journals should also abandon space constraints on the methods sections to allow authors to describe the experimental procedures and conditions in much more detail. In fact, a growing number of journals and publishers encourage authors to provide more details on experiments and to format their methods section so as to make it easier for other researchers to reproduce their results.
Whatever the measures to improve reproducibility of biomedical research, it will have to involve all actors from researchers to funding agencies to journals and eventually commercial players as the main customers of academic research. In addition, it may also require new ways to conduct research and validate experimental data.
References
- 1. Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility in preclinical research. PLoS Biol 13: e1002165 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Baker M, Dolgin E (2017) Reproducibility project yields muddy results. Nature 541: 269–270 [DOI] [PubMed] [Google Scholar]
- 3. Sugahara KN, Teesalu T, Karmali PP, Kotamraju VR, Agemy L, Greenwald DR, Ruoslahti E (2010) Coadministration of a tumor‐penetrating peptide enhances the efficacy of cancer drugs. Science 328: 1031–1035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Miller RA, Austad S, Burke D, Chrisp C, Dysko R, Galecki A, Jackson A, Monnier V (1999) Exotic mice as models for aging research: polemic and prospectus. Neurobiol Aging 20: 217–231 [DOI] [PubMed] [Google Scholar]