Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Oct 1.
Published in final edited form as: J Bone Miner Res. 2014 Oct;29(10):2131–2140. doi: 10.1002/jbmr.2293

Reproducibility of Results in Preclinical Studies: A Perspective From the Bone Field

Stavros C Manolagas 1, Henry M Kronenberg 2
PMCID: PMC4356005  NIHMSID: NIHMS668478  PMID: 24916175

Abstract

The biomedical research enterprise—and the public support for it—is predicated on the belief that discoveries and the conclusions drawn from them can be trusted to build a body of knowledge which will be used to improve human health. As in all other areas of scientific inquiry, knowledge and understanding grow by layering new discoveries upon earlier ones. The process self-corrects and distills knowledge by discarding false ideas and unsubstantiated claims. Although self-correction is inexorable in the long-term, in recent years biomedical scientists and the public alike have become alarmed and deeply troubled by the fact that many published results cannot be reproduced. The chorus of concern reached a high pitch with a recent commentary from the NIH Director, Francis S. Collins, and Principal Deputy Director, Lawrence A. Tabak, and their announcement of specific plans to enhance reproducibility of preclinical research that relies on animal models. In this invited perspective, we highlight the magnitude of the problem across biomedical fields and address the relevance of these concerns to the field of bone and mineral metabolism. We also suggest how our specialty journals, our scientific organizations, and our community of bone and mineral researchers can help to overcome this troubling trend.

Keywords: GENETIC ANIMAL MODELS, BONE HISTOMORPHOMETRY, BONE QCT/MICROCT

NIH Proposals to Enhance Reproducibility of Preclinical Research

In recent years, a growing number of studies have documented concerns about a surprising lack of reproducibility in scientific studies, particularly studies involving animals that have relevance to human biology and disease (called “preclinical studies” here). NIH Director, Francis S. Collins, and Principal Deputy Director, Lawrence A. Tabak, have responded to these concerns with an article and a series of proposals.(1) Here we will consider the context of the Collins-Tabin article, their proposals, and the relevance of these concerns to the bone field.

In their announcement of specific plans to enhance reproducibility of preclinical research that relies on animal models, Collins and Tabak(1) stated at the outset that scientific misconduct is a rare cause of irreproducibility in such studies. However, they identified several other factors commonly contributing to the lack of reproducibility; among them, poor training of investigators in experimental design, increased emphasis on making provocative statements, over interpretation of results, designing experiments to uncover new avenues of inquiry rather than to provide definitive proofs for any single question, and insufficient reporting or withholding of technical details. Pointedly, they blamed funding agencies for uncritically encouraging the overvaluation of research published in high-profile journals; academic centers granting promotion, tenure, and even cash rewards to investigators publishing in such journals; and scientific publishers for exacerbating the problem.

To deal with the crisis, Collins and Tabak announced immediate and substantive actions. First, the NIH will develop a training module on enhancing reproducibility and transparency of research findings with an emphasis on good experimental design. Second, the NIH will institute the use of checklists for reviewers on NIH panels and the assignment of at least one reviewer on each panel with the specific task to vigorously evaluate the “scientific premise” of grant applications, especially when a costly human clinical trial is proposed based on animal experimentation. Third, the NIH will explore means to provide greater transparency, including the creation of a data discovery index (DDI) to allow public access to primary data and give credit to the owner of the primary data set. In December 2013, the NIH also launched an online forum called “PubMed Commons” (http://www.ncbi.nlm.nih.gov/pubmedcommons/) to encourage commentary and discourse about published articles. To deal more specifically with the shortcomings of the academic incentive system and the excessive emphasis on publishing in high profile journals, the NIH is contemplating changes in the format of the biographical sketch form to emphasize the significance of advances resulting from the work of the applicant and his or her specific role. Additionally, the NIH is considering grant mechanisms to allow more flexibility and a longer period of funding than the current average of 4 years, and anonymizing the peer review process.

Last, but not least, Collins and Tabak asked that the research community, publishers, universities, industry, professional organizations, and patient-advocacy groups take steps to reset the self-corrective process of scientific inquiry. In response to the same concerns, major scientific journals, such as Nature and Science, announced changes in their practices, including abolishing the length of the methods section and adopting the use of checklists for editors and reviewers, ensuring the incorporation of critical experimental design features into reports, along with presentation of more primary data.(2,3)

How Did We Get Here?

In an influential essay published in 2005, John Ioannidis,(4) a Stanford epidemiologist who focuses on health research and policy, concluded that most published research findings are likely to be false. Using an analysis of the positive predictive value (PPV) associated with a study, he showed that the probability that a research claim is true depends on the power and bias of the study, the number of other studies on the same question, and more importantly the ratio of true to no relationships among thousands and millions of relationships that may be hypothesized and postulated in each scientific field. He concluded that a “research finding is less likely to be true when the studies conducted in a field are smaller, when effect sizes are smaller, when there are a greater number and lesser pre-selection of tested relationships, where there are greater flexibility in designs, definitions, outcomes, and analytical modes, when there is greater financial and other interest and prejudice, and when more teams are involved in a scientific field in chase of statistical significance.”(4) In a different article, Ioannidis and Trikalinos(5) coined the term “Proteus phenomenon” to describe the rollercoaster of alternating extreme research claims and extreme opposite refutations, a common event particularly in early studies of genetic markers of human disease.

In a large and comprehensive survey of the quality of the experimental design of preclinical studies, funded by the UK’s National Center for the Replacement, Refinement, and Reduction of Animals in Research and the NIH’s Office of Laboratory Animal Welfare, published in 2009, the authors collected data from 271 papers reporting original studies carried out in UK and U.S. publicly-funded research institutions.(6) Only 59% of the papers had stated the hypothesis or objective of the study and the number and characteristics of the animal used. Eighty-seven percent did not use randomization or blinding, and only 70% of the publications that used statistical methods described their methods and presented results with measures of error size and variability. As a result of this survey, a set of guidelines consisting of a check list of 20 items, referred to as Animals in Research: Reporting In Vivo Experiments (ARRIVE) was developed in consultation with scientists, statisticians, journal editors (including Nature Cell Biology and Science), and several funding agencies, such as the UK Medical Research Council, Wellcome Trust, and the Royal Society(7) (Table 1). Sadly, 2 years after the announcement of the ARRIVE guidelines, there was very little improvement in reporting standards, suggesting that authors, referees, and editors, in general, are ignoring the guidelines.(8)

Table 1.

ARRIVE Guidelines for Reporting In Vivo Animal Experiments

Item Recommendation
Title 1 Provide as accurate and concise a description of the content of the article as possible.
Abstract 2 Provide an accurate summary of the background, research objectives (including details of the species or strain of animal used), key methods, principal findings, and conclusions of the study.
Introduction
 Background 3
  1. Include sufficient scientific background (including relevant references to previous work) to understand the motivation and context for the study, and explain the experimental approach and rationale.

  2. Explain how and why the animal species and model being used can address the scientific objectives and, where appropriate, the study’s relevance to human biology.

 Objectives 4 Clearly describe the primary and any secondary objectives of the study, or specific hypotheses being tested.
Methods
 Ethical statement 5 Indicate the nature of the ethical review permissions, relevant licenses (eg, Animal [Scientific Procedures] Act 1986), and national or international guidelines for care and use of animals, that cover the research.
 Study design 6 For each experiment, give brief details of the study design, including:
  1. The number of experimental and control groups.

  2. Any steps taken to minimize the effects of subjective bias when allocating animals to treatment (eg, randomization procedure) and when assessing results (eg, if done, describe who was blinded and when).

  3. The experimental unit (eg, a single animal, group, or cage of animals).

    A time-line diagram or flow chart can be useful to illustrate how complex study designs were carried out.

 Experimental procedures 7 For each experiment and each experimental group, including controls, provide precise details of all procedures carried out. For example:
  1. How (eg, drug formulation and dose, site and route of administration, anesthesia and analgesia used [including monitoring], surgical procedure, method of euthanasia). Provide details of any specialist equipment used, including supplier(s).

  2. When (eg, time of day).

  3. Where (eg, home cage, laboratory, water maze).

  4. Why (eg, rationale for choice of specific anesthetic, route of administration, drug dose used).

 Experimental animals 8
  1. Provide details of the animals used, including species, strain, sex, developmental stage (eg, mean or median age plus age range), and weight (eg, mean or median weight plus weight range).

  2. Provide further relevant information such as the source of animals, international strain nomenclature, genetic modification status (eg, knockout or transgenic), genotype, health/immune status, drug-naïve or test-naïve, previous procedures, etc.

 Housing and husbandry 9 Provide details of:
  1. Housing (eg, type of facility, eg, specific pathogen free [SPF]; type of cage or housing; bedding material; number of cage companions; tank shape and material, etc. for fish).

  2. Husbandry conditions (eg, breeding program, light/dark cycle, temperature, quality of water, etc. for fish, type of food, access to food and water, environmental enrichment).

  3. Welfare-related assessments and interventions that were carried out before, during or after the experiment.

 Sample size 10
  1. Specify the total number of animals used in each experiment and the number of animals in each experimental group.

  2. Explain how the number of animals was decided. Provide details of any sample size calculation used.

  3. Indicate the number of independent replications of each experiment, if relevant.

 Allocating animals to experimental groups 11
  1. Give full details of how animals were allocated to experimental groups, including randomization or matching if done.

  2. Describe the order in which the animals in the different experimental groups were treated and assessed.

 Experimental outcomes 12 Clearly define the primary and secondary experimental outcomes assessed (eg, cell death, molecular markers, behavioral changes).
 Statistical methods 13
  1. Provide details of the statistical methods used for each analysis.

  2. Specify the unit of analysis for each dataset (eg, single animal, group of animals, single neuron).

  3. Describe any methods used to assess whether the data met the assumptions of the statistical approach.

Results
 Baseline data 14 For each experimental group, report relevant characteristics and health status of animals (eg, weight, microbiological status, and drug-naïve or test-naïve) before treatment of testing (this information can often be tabulated).
 Numbers analyzed 15
  1. Report the number of animals in each group included in each analysis. Report absolute numbers (eg, 10/20, not 50%).

  2. If any animals or data were not included in the analysis, explain why.

 Outcomes and estimation 16 Report the results for each analysis carried out, with a measure of precision (eg, standard error or confidence interval).
 Adverse events 17
  1. Give details of all important adverse events in each experimental group.

  2. Describe any modifications to the experimental protocols made to reduce adverse events.

  • Discussion

 Interpretation/scientific implications 18
  1. Interpret the results, taking into account the study objectives and hypotheses, current theory, and other relevant studies in the literature.

  2. Comment on the study limitations including any potential sources of bias, any limitations of the animal model, and the imprecision associated with the results.

  3. Describe any implications of your experimental methods or findings for the replacement, refinement, or reduction (the 3Rs) of the use of animals in research.

 Generalizability/translation 19 Comment on whether, and how, the findings of this study are likely to translate to other species or systems, including any relevance to human biology.
 Funding 20 List all funding sources (including grant number) and the role of the funder(s) in the study.

A different survey of 100 articles published in Cancer Research in 2011 found that only 20% of the papers reported that the animals were randomly allocated to treatment groups, just 2% had reported that the observers were blinded, and none reported methods used to determine the number of animals per group.(9) Serious methodological shortcomings that could introduce bias were also revealed in the analysis of several hundreds of studies reporting findings from animal models of stroke, Parkinson’s disease, and multiple sclerosis.(10) The situation was as grim in a review of 76 high-impact articles (with more than 500 citations) showing lack of methodological information needed for informed evaluation of the results.(11)

Poor reporting of data and replication problems notwithstanding, it has also become painfully clear that “the gold standards of statistical validity are not as reliable as many scientists assume.”(1216) In an article in Nature with the exact quotation in the title, Regina Nuzzo(16) cites the analysis of Charles Lambdin(17) on how statistics and p values have become a numbers game and the tool of “‘a sterile intellectual rake’ who ravishes science but leaves it with no progeny.” Nuzzo(16) warns the readers that p values are of little help when one ignores another and very critical piece of information: the chances that an effect would be there in the first place. In other words, the more implausible the hypothesis, the more likely is the chance that exciting findings are a false alarm, irrespective of what the p value says (Fig. 1). Nuzzo’s paper reinforces the message of the Ioannidis paper.

Fig. 1.

Fig. 1

The limited relevance of p values in the face of implausible hypotheses. Reproduced from Nuzzo R., Nature. 2014; 506:150–2.

In 2012, the U.S. National Institute of Neurological Disorders and Stroke convened academic researchers and educators, reviewers, journal editors, and representatives from funding agencies, disease advocacy communities, and the pharmaceutical industry to discuss the causes of deficient reporting. The results of these deliberations were distilled into “a core set of reporting standards for rigorous study design” published in Nature under the title, “A call for transparent reporting to optimize the predictive value of preclinical research”(18) (Table 2). Further, a recent series of articles in the January 11, 2014 issue of The Lancet, provide thoughtful suggestions from different groups of experts on how to increase the value of biomedical research and reduce waste.(1923)

Table 2.

Set of Reporting Standards for Rigorous Study Design

Randomization
  • Animals should be assigned randomly to the various experimental groups, and the method of randomization reported.

  • Data should be collected and processed randomly or appropriately blocked.

Blinding
  • Allocation concealment: the investigator should be unaware of the group to which the next animal taken from a cage will be allocated.

  • Blinded conduct of the experiment: animal caretakers and investigators conducting the experiments should be blinded to the allocation sequence.

  • Blinded assessment of outcome: investigators assessing, measuring, or quantifying experimental outcomes should be blinded to the intervention.

Sample-size estimation
  • An appropriate sample size should be computed when the study is being designed and the statistical method of computation reported.

  • Statistical methods that take into account multiple evaluations of the data should be used when an interim evaluation is carried out.

Data handling
  • Rules for stopping data collection should be defined in advance.

  • Criteria for inclusion and exclusion of data should be established prospectively.

  • How outliers will be defined and handled should be decided when the experiment is being designed, and any data removed before analysis should be reported.

  • The primary end point should be prospectively selected. If multiple end points are to be assessed, then appropriate statistical corrections should be applied.

  • Investigators should report on data missing because of attrition or exclusion.

  • Pseudo replicate issues need to be considered during study design and analysis.

  • Investigators should report how often a particular experiment was performed and whether results were substantiated by repetition under a range of conditions.

A Glance at the Lay Public Perspective

Not surprisingly, articles in nonscientific publications, such as The New York Times, have picked up the theme and describe the situation as “science is in crisis, just when we needed it most.”(24) Wisely, the editorialist of this particular article, titled “Scientific Pride and Prejudice,” points out that “a researcher cannot separate in advance the productive prejudices that enable understanding from the prejudices that hinder it.” Another editorial in The Economist, titled “Unreliable research: Trouble at the lab,”(25) concludes with a quote from Bruce Alberts—a former editor of Science: “[scientists themselves] need to develop a value system where simply moving on from one’s mistakes without publicly acknowledging them severely damages, rather than protects, a scientific reputation.”

Closer to Home

Methodological and conceptual advances have led to enormous progress in the bone field over the last couple of decades. Indeed, modern mouse genetics have led to an explosion of information about bone biology, both in the basic and clinical realms, and have changed entirely the criteria for validity of claims. It is no longer enough to see what happens to bone cells in culture to guess at what happens in vivo. Mechanisms can now be addressed in molecular detail with the proper animal models. For example, the combined use of imaginative cell culture studies followed by genetic manipulation in mice led to the discovery of the importance of RANKL and the discovery of an antibody to RANKL as a rational therapy for the treatment of osteoporosis.(2628) Studies in rats and mice complemented human studies and led to the development of intermittent PTH administration as an anabolic therapy for skeletal disorders associated with low bone mass.(29,30) Thus, mice and rats have been invaluable tools in the bone field for the elucidation of seminal mechanisms of metabolic bone diseases as well as faithful models of the effects of growth factors, cytokines, hormones, drugs, and gravity on human bone. Nevertheless, during this period, the field of bone and mineral metabolism has witnessed a number of reproducibility problems in the arena of preclinical studies. Not surprisingly, all of the concerns expressed in the critiques of current science summarized earlier also apply to studies in the bone field. Although uncommon, irreproducibility of preclinical studies should be of concern in a field with a very successful record of major advances toward the improvement of human health in a relatively short period of time.

In addition to the general concerns expressed by Collins and Tabak,(1) some challenges associated with common research strategies used by preclinical bone scientists are worth considering in the context of the concerns already discussed. Although the power of genetic manipulation of mice has led to an enormous increase in skeletal research, results using these mice need to be interpreted cautiously for two reasons: first, of course, rodents differ in many important ways from humans and these differences may make them uncertain predictors of some aspects of human physiology; second, genetic manipulations, like all approaches in science, have their inevitable limitations and require controls that are not always obvious. We consider each of these issues in turn.

First, about the problems of applying lessons learned from mice to humans: some differing results across species may reflect the differing structures and cellular relationships of human and rodent bone. For example, many groups have shown that bisphosphonates and PTH have additive effects on bone mass in rodents,(31) but such additive effects are generally not seen in human studies.(32,33) Another example, the effects in mice of osteocalcin on carbohydrate metabolism, is an area of current debate, reflecting uncertainty about the differences between species of everything from vitamin K levels to determinants of glucose disposal (summarized in Andrews(34)).

The second reason for interpreting mouse genetic manipulations cautiously is more subtle: the new genetic methodologies are straightforward to apply, but the relevant controls needed are not always self-evident. Each transgenic mouse is an adventure requiring careful navigation. Routine generation of transgenic mice leads to insertion of the transgene into unpredictable, possibly multiple, sites in the genome. Each insertion involves disruption of the gene located at the insertion site and the local environment affects the expression levels of the inserted transgene. Thus, to be sure that a particular transgenic mouse reflects the properties of the transgene in a predictable way, investigators must examine the properties of multiple, independently derived mouse lines. Even with the use of strategies of homologous recombination to increase the likelihood of targeted gene manipulation, the altered gene (often a “knockout”) is altered in all cells in the organism. Too often investigators assume that the abnormalities in the knockout mouse reflect direct effects of the knockout in bone, but, in fact, these universal knockouts may well generate more complicated, indirect effects on bone tissue. Therefore, inferences for the phenotypes of “universal” gene knockout mice should always be complemented by other studies that argue for direct consequences of the knockout directly in bone (if that is the inference claimed). For example, in vitro studies with isolated bone cells from such knockout mice can address certain specific predictions of the hypothesis that the actions of the knockout result from the loss of gene function in a particular bone cell type. Furthermore, knockouts may not be as straightforward as they seem: sometimes when only part of a gene is removed, the remaining portion can be expressed and cause an unexpected phenotype; at other times, an apparently simple knockout may remove sequences that regulate nearby genes, again causing unanticipated phenotypes.

Some transgenic strategies involve the use of promoter fragments that may have unanticipated expression in unexpected target organs. Investigators often make a gesture toward assessing ectopic gene expression by surveying mRNA levels in RNA preparations from several organs of the transgenic mouse. But these tissue samples are seldom exhaustive and are insensitive to gene expression in cell types that make up a modest fraction of a heterogeneous organ. More exhaustive examination of gene expression in tissue sections from multiple organs allows the inspection of gene expression in cell types that may represent a modest fraction of an RNA preparation made from whole tissue, but, of course, such tissue surveys cannot be truly comprehensive. One always worries that a transgene might be expressed in some important but minor collection of neurons in the brain, for example; such ectopic expression is extremely difficult to eliminate as a possibility by all available methods. All of these concerns are compounded when the transgene drives the expression of cre recombinase. This recombinase may well be active transiently in “unplanned” cell types, so that the tissue surveys just described, in that case, must involve strategies that can detect the consequences of prior, transient expression of the recombinase through the use of, for example, floxed reporter genes that can be easily scored after cre recombinase action.

If the transgenic/gene knockout strategy does not involve time-regulated gene expression using genes regulated by, for example, tamoxifen or tetracycline derivatives, then the phenotype of transgenic mice reflects the action of the transgene during the entire development of the mouse. Determining such phenotypes may be an important goal, but often will not reflect only the action of the manipulated gene in adult mice. Because the role of a gene in adult life may be different from and/or quantitatively less dramatic than the role during development, such studies require cautious conclusions.

The use of tamoxifen and tetracycline derivatives to regulate the timing of gene expression are fraught with their own potential complications requiring careful controls. Tamoxifen, as a selective estrogen receptor modulator, has its own actions on bone, so it should be used as little as possible in all experiments that use tamoxifen-regulated cre recombinase. Each such experiment, of course, should include a control in which tamoxifen is used in a genetic context in which cre recombinase action does not occur (eg, by omitting the floxed target of cre recombinase). Even then, it is almost impossible to eliminate the possibility that the actions of tamoxifen synergize with the effects of the intended gene knockout, in ways that lead to a phenotype different from the phenotypes of tamoxifen action or the gene knockout action each by themselves. Tetracycline derivatives are effective modulators of promoters that can be turned on or off by genetically modified versions of the E. coli tetracycline repressor. All such tetracycline analogs bind avidly to bone-forming surfaces and thus might have actions, for example, on metalloproteinases. Further, this avidity is both a strength or weakness, depending on the experimental design, because it leads to a prolonged presence of the analog in the bloodstream of treated mice. When tet-off strategies are used, the analog may suppress target gene function for weeks after cessation of administration of the analog. Controls using, for example, suitable reporter mice are needed to determine the functional time course of the action of tetracycline analogs in such experiments.

Another caveat about the use of tamoxifen-regulated or tetracycline-regulated cre recombinase systems involves the issue of leakiness of expression of the transgene. In bone science, one of the major attractions of the use of such transgenes is that they can be turned on for the first time in adult life, thereby avoiding effects during growth and development. However, most transgenes that are designed to be “off” in the absence of tamoxifen or tetracycline derivatives, in fact, have a modest level of basal expression even in the absence of the triggering ligand. Such modest leakiness can lead to important cumulative effects after many months, however, particularly in long-lived cells such as osteocytes. Consequently, all such studies should include controls in which mice are carefully scrutinized for signs of cre expression, usually through the use of appropriate reporter mice studied at the same age as the experimental mice are studied.

Finally, because phenotypes of mice reflect the effects of multiple genes in addition to the gene manipulated by transgenesis, a particular genetic manipulation may have diverse effects in different genetic backgrounds. This had dramatic consequences, for example, when one of our groups knocked out the gene encoding the PTH/PTHrP receptor. In the C57BL6 strain of mice, almost all of these mice died early in gestation, probably because of abnormal cardiac development, making study of bone development impossible.(35) Only when the same knockout was transferred to a Black Swiss background was it possible to define the bone phenotype in such mice.(36)

With this bewildering array of potential sources of misleading results with transgenic mice, is the solution simply to avoid the use of these techniques? Of course not! The solution to each of the problems just listed involves the use of multiple controls, multiple approaches to asking the same question, and modesty in the strength of the conclusions drawn from the study of any one mouse.

Although we have focused here on the hazards of studying transgenic mice in bone research, such hazards are inevitable whenever new technology is used in science, whether the studies be of humans, mice, or cultured cells. Many current exciting new methodologies involve analogous problems that are inevitable with the introduction of new technology. For example, the use of shRNA to knockdown expression of genes often leads to only partial decrease in gene expression and inevitably involves the knockdown of “off-target” genes in ways that are hard to predict and control.(37) Thus, all such experiments should involve the use of multiple shRNAs that target differing portions of the targeted gene; such controls have been omitted from many published studies. The powerful new use of “clustered regularly interspaced short palindromic repeats” (CRISPR)-Cas methodology to precisely change specific sequences in individual cells can cause changes in off-target sequences; strategies to minimize these difficult to detect off-target effects are still in their developmental stages.(38)

The solutions to these new problems are necessarily tentative and provisional. The scientific community needs to understand the need for multiple approaches to solve every problem and that, necessarily, every conclusion drawn from experiments is tentative and contingent. There is nothing new in this idea, but, as Collins and Tabak(1) noted, some journals appear to discourage cautious and tentative conclusions, precisely in studies that use bold and powerful but incompletely understood technologies. The solution is, of course, not to avoid the use of new technologies, but, quite the opposite, to better understand their limitations and to interpret the results with the caution required by the use of incompletely understood technology. The challenge will be to introduce the safeguards proposed by Collins and Tabak and others, without unnecessarily slowing scientific progress through the introduction of over-elaborate bureaucratic constraints.

Suggestions and Ideas for Tackling the Problem

Recommendations for editors and editorial boards

Major scientific journals and funding agencies in the UK and U.S. have already endorsed and adopted the ARRIVE guidelines. We recommend that JBMR and the rest of the journals in our field adopt and enforce these guidelines immediately. The recommendation of Collins and Tabak for the inclusion of at least one reviewer in each grant review panel with the specific task of assessing “the scientific premise” (and we would like to add the clinical and physiological context) of proposals for clinical trials is, in our opinion, significant and likely to help. In any event, we propose that the same standard of “scientific premise” should be adopted by the editorial boards of subspecialty publications, and even more so of general interest publications, during the review of manuscripts with strong translational claims. We further suggest that, under the circumstances, editors can no longer afford to act as the impartial watchers of a tennis match, following the ball at each end of the court. A good, old-fashioned word for manuscript reviewing used to be “refereeing.” Editors and editorial board members should, more often than not in our opinion, indeed referee and step in to balance a more versus less well-reasoned opinion. The benefit of improving reproducibility of published work should outweigh the cost of displeasing some reviewers.

Recommendations for meetings chairs

As mentioned earlier, lack of appropriate training in statistics and other experimental methodologies is a very likely source of the irreproducibility problem, already recognized in many other fields. We, therefore, recommend that the program committees of national and international meetings organized by the American Society for Bone and Mineral Research (ASBMR), the International Bone and Mineral Society (IBMS), the European Calcified Tissue Society (ECTS), and other national and international meetings of the bone field incorporate courses for postdoctoral fellows and young investigators dealing with appropriate reporting of data and the problems of irreproducibility.

Lack of vigorous scientific debate is, in our opinion, another contributor to the irreproducibility problem in our field. We, therefore, recommend that meeting organizing committees promote and encourage debate by incorporating into their programs more specific sessions addressing controversies in the field. Indeed, we submit that quiescence—particularly common among young participants—for the sake of politeness or fear, in the face of questionable or unconvincing data presented by a fellow scientist, is counterproductive for the advancement of science. Interpretations of data presented at meetings can and must be vigorously questioned in ways that are well within the bounds of respectful discourse.

Recommendations for authors of animal studies in bone

Inconsistencies in reporting the bone phenotype of animal models very frequently make it difficult to interpret results and compare findings across studies. Over the years, the ASBMR and other organizations in the field have provided guidelines for the reporting of the skeletal phenotype of small animals, precisely for assuring consistency and reproducibility. We strongly recommend not only that authors consult these guidelines during the preparation of their manuscripts, but that reviewers and editors of our subspecialty journals also demand that the recommendations are strictly followed. The following set of guidelines is, in our opinion, critical for the reproducibility of studies describing the effects of experimentation in rodents.

Study design and statistics

Please see the ARRIVE Guidelines (Table 1) and the set of reporting standards for rigorous study design (Table 2).

Histomorphometry

In 1987, the ASBMR asked A. Michael Parfitt to form a committee of the Society to develop a unified system of terms that was suitable for adoption by JBMR as part of its instructions to authors.(39) In 2012, the ASBMR histomorphometry nomenclature committee published an update of its original report, which includes standardized nomenclature recommendations for handling dynamic parameters of bone formation in the setting of low bone turnover.(40) In addition to these two documents, the readers may find helpful practical clarifications of some confusing terms, reproduced here with permission from Robert S. Weinstein(41) (Table 3).

Table 3.

Explanation of Meaning of Confusing Histomorphometry Terms

Term Explanation
Active versus inactive Should be more explicit. May indicate the presence of bone cells or imply their vigor.
Resorptive parameters Precisely which measurement is used should be clearly stated. The best measure of resorption is the osteoclast number or surface.
Eroded surfaces The eroded surface is composed of osteoclast surface plus the reversal surface and is an unfaithful measure of bone resorption. Treatment with an antiresorptive agent will decrease the osteoclast surface and increase the reversal surface.
Formation parameters Precisely which measurement is used should be clearly stated. Osteoid is not an index of bone formation because it can accumulate with increased bone formation or from delayed mineralization. The only measure of bone formation is the tetracycline-based bone formation rate.
Coupling and balance Coupling refers to the appearance of osteoblasts at the site where osteoclasts have recently eroded bone cavities. Balance indicates that the new bone deposited by osteoblasts exactly equals the amount previously removed by osteoclasts.
Osteoblasts and osteoclasts: surface or areal referent? Osteoblasts and osteoclasts can only be on bone surfaces. The appropriate referent is the bone surface. Only osteocytes are correctly reported per bone area.
Activation frequency The incidence of new remodeling cycles (which usually refers to cancellous bone) can be estimated by dividing the bone formation rate by the wall width, an index of the amount of bone made by a previous team of osteoblasts.
μCT histomorphometry This is a misnomer. μCT does not reveal excess osteoid, woven bone, or embedded cartilage cores, features that are easily noted with histologic assessment of the cancellous microarchitecture. The quantification of bone cells and the rate of turnover can only be done by a histologic examination.

Imaging

Guidelines for the precise, accurate, and reproducible report of dual X-ray absorptiometry in mice in vivo have also been published.(42) More recently, micro–computed tomography (μCT) imaging has become an essential tool for the assessment of trabecular and cortical bone morphology. Lack of consistency in image acquisition, image evaluation, and the reporting of the assessment of bone microstructure in rodents using μCT, partly due to several different commercially available systems, have become of concern. As a result, in 2010 JBMR published guidelines for the assessment of bone microstructure in rodents using μCT, based on the recommendations of a committee of experts from Europe, Canada, and the United States, headed by Mary Bouxsein.(43) A summary of the μCT recommendations by this committee is reproduced here for the readers’ perusal (Table 4).

Table 4.

Recommendations for μCT Imaging of Trabecular and Cortical Bone Morphology

Image acquisition The methods section should report the following parameters: scan medium, X-ray tube potential, and voxel size, as well as clear descriptions of the size and location of the volume of interest.
Image processing The methods section should describe any algorithms used for image filtration and the approach used for image segmentation, including the method used to delineate cortical from trabecular bone regions.
Image analysis 3D algorithms that do not rely on assumptions about the underlying structure should be used to compute trabecular and cortical bone morphometry whenever possible. Whereas tissue mineral density measurements are possible with μCT systems, significant artifacts can be associated with the use of polychromatic X-ray sources, and therefore, these measurements must be conducted with extreme care and interpreted with caution.
Reporting of μCT results The minimal set of variables that should be used to describe trabecular bone morphometry include bone volume fraction and trabecular number, thickness, and separation. The minimal set of variables that should be used to describe cortical bone morphometry includes total cross-sectional area, cortical bone area, cortical bone area function, and cortical thickness. Other variables also may be appropriate depending on the research question.
Presentation of μCT images Either 2D or 3D images are appropriate, but the criteria used to select the “representative” image(s) must be described either in the methods section or the figure legend.
Quality control Investigators should follow manufacturer-specific instructions for regular quality control and document that these instructions are followed. All images should be inspected visually to identify possible scanning artifacts.

Closing Remarks

In closing, we submit that the onus is upon each member of the bone and mineral research community to accept responsibility for the failure to self-correct in our field and to welcome and respond to the jolting wakeup call of Collins and Tabak. In the long-term, publishing first is good, but publishing work that is confirmed and extended by others is far more important and useful. Creativity and originality must be combined with rigorous attention to the limitations of all experimental technologies. The discussion of data should always include alternative explanations and alert the readers to existing basic, clinical, and physiological evidence that argues against the proposed hypothesis. Moreover, we strongly endorse Bruce Alberts’ call to develop new ethical standards, “where simply moving on from one’s mistakes without publicly acknowledging them severely damages, rather than protects, a scientific reputation.” Unless the irreproducibility problem is tackled as quickly as possible and the public faith in science is restored, the ability of future generations of scientists to enjoy the privilege of public support for their careers may be in serious jeopardy—not a legacy that any one of us would wish to leave behind.

Acknowledgments

The authors’ research is supported by the NIH (P01 AG13918, R01 AR56679 [to SCM]; DK11794 and DK56246 [to HMK]) and the Biomedical Laboratory Research and Development Service of the Veterans Administration Office of Research and Development (I01 BX001405 [to SCM]). We thank Leah Elrod for help with the preparation of the manuscript.

Footnotes

Disclosures

Both authors state that they have no conflicts of interest.

Authors’ roles: The authors contributed equally to the writing of this perspective and share the views described herein.

References

  • 1.Collins FS, Tabak LA. Policy, NIH plans to enhance reproducibility. Nature. 2014;505:612–3. doi: 10.1038/505612a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Announcement: Reducing our irreproducibility. Nature. 2013 Apr 25;496:398. doi: 10.1038/496398a. Available from: http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852. [DOI] [Google Scholar]
  • 3.McNutt M. Reproducibility. Science. 2014;343:229. doi: 10.1126/science.1250475. [DOI] [PubMed] [Google Scholar]
  • 4.Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2:e124. doi: 10.1371/journal.pmed.0020124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ioannidis JP, Trikalinos TA. Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. J Clin Epidemiol. 2005;58:543–549. doi: 10.1016/j.jclinepi.2004.10.019. [DOI] [PubMed] [Google Scholar]
  • 6.Kilkenny C, Parsons N, Kadyszewski E, et al. Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PLoS One. 2009;4:e7824. doi: 10.1371/journal.pone.0007824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research. PLoS Biol. 2010;8:e1000412. doi: 10.1371/journal.pbio.1000412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Baker D, Lidster K, Sottomayor A, Amor S. Two years later: journals are not yet enforcing the ARRIVE guidelines on reporting standards for pre-clinical animal studies. PLoS Biol. 2014;12:e1001756. doi: 10.1371/journal.pbio.1001756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hess KR. Statistical design considerations in animal studies published recently in cancer research. Cancer Res. 2011;71:625. doi: 10.1158/0008-5472.CAN-10-3296. [DOI] [PubMed] [Google Scholar]
  • 10.Sena E, van der Worp HB, Howells D, Macleod M. How can we improve the pre-clinical development of drugs for stroke? Trends Neurosci. 2007;30:433–9. doi: 10.1016/j.tins.2007.06.009. [DOI] [PubMed] [Google Scholar]
  • 11.Hackam DG, Redelmeier DA. Translation of research evidence from animals to humans. JAMA. 2006;296:1731–2. doi: 10.1001/jama.296.14.1731. [DOI] [PubMed] [Google Scholar]
  • 12.Nuzzo R. Scientific method: statistical errors. Nature. 2014;506:150–2. doi: 10.1038/506150a. [DOI] [PubMed] [Google Scholar]
  • 13.Number crunch. Nature. 2014;506:131–2. doi: 10.1038/506131b. [DOI] [PubMed] [Google Scholar]
  • 14.Krzywinski M, Altman N. Points of significance: importance of being uncertain. Nat Methods. 2013;10:809–10. doi: 10.1038/nmeth.2613. [DOI] [PubMed] [Google Scholar]
  • 15.Krzywinski M, Altman N. Significance. P values and t-tests. Nat Methods. 2013;10:1041–2. doi: 10.1038/nmeth.2698. [DOI] [PubMed] [Google Scholar]
  • 16.Matters of significance. Nat Methods. 2013;10:805. doi: 10.1038/nmeth.2638. [DOI] [PubMed] [Google Scholar]
  • 17.Lambdin C. Significance tests as sorcery: science is empirical—significant tests are not. Theory Psychol. 2012;22:67–690. [Google Scholar]
  • 18.Landis SC, Amara SG, Asadullah K, et al. A call for transparent reporting to optimize the predictive value of preclinical research. Nature. 2012;490:187–91. doi: 10.1038/nature11556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chalmers I, Bracken MB, Djulbegovic B, et al. How to increase value and reduce waste when research priorities are set. Lancet. 2014;383:156–65. doi: 10.1016/S0140-6736(13)62229-1. [DOI] [PubMed] [Google Scholar]
  • 20.Ioannidis JP, Greenland S, Hlatky MA, et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet. 2014;383:166–75. doi: 10.1016/S0140-6736(13)62227-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Al-Shahi SR, Beller E, Kagan J, et al. Increasing value and reducing waste in biomedical research regulation and management. Lancet. 2014;383:176–85. doi: 10.1016/S0140-6736(13)62297-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chan AW, Song F, Vickers A, et al. Increasing value and reducing waste: addressing inaccessible research. Lancet. 2014;383:257–66. doi: 10.1016/S0140-6736(13)62296-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Glasziou P, Altman DG, Bossuyt P, et al. Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014;383:267–76. doi: 10.1016/S0140-6736(13)62228-X. [DOI] [PubMed] [Google Scholar]
  • 24.Chwe MS-Y. Scientific pride and prejudice. New York Times. 2014 Jan 31; Sunday Review. Available from: http://www.nytimes.com/2014/02/02/opinion/sunday/scientific-pride-and-prejudice.html?_r=0.
  • 25.Unreliable research: trouble at the lab. The Economist. 2013 Oct 19; Available from; http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble.
  • 26.Boyle WJ, Kung Y, Lacey DL, et al. Osteoprotegerin ligand (OPGL) is required for murine osteoclastogenesis. Bone. 1998;23:S189. [Google Scholar]
  • 27.Lacey DL, Tan HL, Lu J, et al. Osteoprotegerin ligand modulates murine osteoclast survival in vitro and in vivo. Am J Pathol. 2000;157:435–48. doi: 10.1016/S0002-9440(10)64556-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Cummings SR, San MJ, McClung MR, et al. Denosumab for prevention of fractures in postmenopausal women with osteoporosis. N Engl J Med. 2009;361:756–65. doi: 10.1056/NEJMoa0809493. [DOI] [PubMed] [Google Scholar]
  • 29.Tam CS, Heersche JNM, Murray TM, Parsons JA. Parathyroid hormone stimulates the bone apposition rate independently of its resorptive action: differential effects of intermittent and continuous administration. Endocrinology. 1982;110:506–12. doi: 10.1210/endo-110-2-506. [DOI] [PubMed] [Google Scholar]
  • 30.Neer RM, Arnaud CD, Zanchetta JR, et al. Effect of parathyroid hormone (1–34) on fractures and bone mineral density in post-menopausal women with osteoporosis. N Engl J Med. 2001;344:1434–41. doi: 10.1056/NEJM200105103441904. [DOI] [PubMed] [Google Scholar]
  • 31.Wronski TJ, Yen C-F, Qi H, Dann LM. Parathyroid hormone is more effective than estrogen or bisphosphonates for restoration of lost bone mass in ovariectomized rats. Endocrinology. 1993;132:823–31. doi: 10.1210/endo.132.2.8425497. [DOI] [PubMed] [Google Scholar]
  • 32.Finkelstein JS, Hayes A, Hunzelman JL, Wyland JJ, Lee H, Neer RM. The effects of parathyroid hormone, alendronate, or both in men with osteoporosis. N Engl J Med. 2003;349:1216–26. doi: 10.1056/NEJMoa035725. [DOI] [PubMed] [Google Scholar]
  • 33.Black DM, Greenspan SL, Ensrud KE, et al. The effects of parathyroid hormone and alendronate alone or in combination in postmenopausal osteoporosis. N Engl J Med. 2003;349:1207–15. doi: 10.1056/NEJMoa031975. [DOI] [PubMed] [Google Scholar]
  • 34.Andrews N. Skeletal regulation of glucose metabolism: challenges in translation from mouse to man. IBMS Bonekey. 2013 May 22;10:353. doi: 10.1038/bonekey.2013.87. [DOI] [Google Scholar]
  • 35.Qian J, Colbert MC, Witte D, et al. Midgestational lethality in mice lacking the parathyroid hormone (PTH)/PTH-related peptide receptor is associated with abrupt cardiomyocyte death. Endocrinology. 2003;144:1053–61. doi: 10.1210/en.2002-220993. [DOI] [PubMed] [Google Scholar]
  • 36.Lanske B, Karaplis AC, Lee K, et al. PTH/PTHrP receptor in early development and Indian hedgehog-regulated bone growth. Science. 1996;273:663–6. doi: 10.1126/science.273.5275.663. [DOI] [PubMed] [Google Scholar]
  • 37.Echeverri CJ, Beachy PA, Baum B, et al. Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Methods. 2006;3:777–9. doi: 10.1038/nmeth1006-777. [DOI] [PubMed] [Google Scholar]
  • 38.Fu Y, Foden JA, Khayter C, et al. High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nat Biotechnol. 2013;31:822–6. doi: 10.1038/nbt.2623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Parfitt AM, Drezner MK, Glorieux FH, et al. Bone histomorphometry: standardization of nomenclature, symbols, and units. Report of the ASBMR Histomorphometry Nomenclature Committee. J Bone Miner Res. 1987;2:595–610. doi: 10.1002/jbmr.5650020617. [DOI] [PubMed] [Google Scholar]
  • 40.Dempster DW, Compston JE, Drezner MK, et al. Standardized nomenclature, symbols, and units for bone histomorphometry: a 2012 update of the report of the ASBMR Histomorphometry Nomenclature Committee. J Bone Miner Res. 2013;28:2–17. doi: 10.1002/jbmr.1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Weinstein RS. Understanding bone histomorphometry: sampling, evaluation, and interpretation. In: Orwoll ES, editor. Atlas of osteoporosis. Philadelphia: Springer; 2009. pp. 13–9. [Google Scholar]
  • 42.Iida-Klein A, Lu SS, Yokoyama K, Dempster DW, Nieves JW, Lindsay R. Precision, accuracy, and reproducibility of dual X-ray absorptiometry measurements in mice in vivo. J Clin Densitom. 2003;6:25–33. doi: 10.1385/jcd:6:1:25. [DOI] [PubMed] [Google Scholar]
  • 43.Bouxsein ML, Boyd SK, Christiansen BA, Guldberg RE, Jepsen KJ, Muller R. Guidelines for assessment of bone microstructure in rodents using micro-computed tomography. J Bone Miner Res. 2010;25:1468–86. doi: 10.1002/jbmr.141. [DOI] [PubMed] [Google Scholar]

RESOURCES