Abstract
Recently, the discussion on the implications of irreproducibility in the sciences has been brought into the spotlight. This topic has been discussed for years in the literature. A multitude of reasons have been attributed to this issue; one commonly labeled culprit is the overuse of the p value as a determinant of significance by the scientific community. Both scientists and statisticians have questioned the use of null hypothesis testing as the basis of scientific analysis. This survey of the current issues at hand in irreproducibility in research emphasizes potential causes of the issue, impacts that this can have for drug development and efforts been taken to increase transparency of findings in research.
Keywords: p-Value, Data Analysis, Statistics, P Value, Statistical Modeling, Irreproducibility in Research
THE VALUE OF P VALUE
Researchers use a variety of statistical analytic methods to better quantify the validity and reproducibility of their experiments. A popular method of determining statistical significance is through the use of the p-value, a measure of how well experimental data rejects the null hypothesis. For years, the debate over the value of p-value has been discussed in the literature [1–4]. Critics argue that p-value creates an artificial metric of establishing significance in a data set. Recently, Leek and Peng published a commentary in Nature that discusses limitations of null hypothesis testing.5 Rather than focusing their criticism on p-value, the authors argue that data analysis, as a whole, must be more stringently regulated. The null hypothesis is widely used in both the basic sciences and in clinical/translational research, a field in which experimental findings strongly influence the treatment of patients. Therefore, it is vital to understand how experimental design, data analysis, and statistical testing may be optimized to generate studies that more effectively improve human health.
Typical Research Methodology
In most research studies, there is a classical progression of experimental steps (Fig. 1). First, a problem is identified through analysis of evidence seen in prior experimentation and literature. A hypothesis is proposed to address the problem, and experiments are devised to test the hypothesis.6 After the generation of data and appropriate statistical filtering, a p-value test is applied to suggest whether the findings are “significant” or “not significant.” P-value is popularly used as a measure of significance because it represents the probability of an event occurring solely by chance or random error. Therefore, a high p-value indicates that there is a higher probability that the event could be explained by chance. In other words, a higher p-value means there is a higher probability that the null hypothesis should be accepted instead of rejected.7 Cut off values are arbitrarily assigned for evaluation of significance of the p-value. By definition, a p-value larger than the cutoff value indicates that the event is more likely to have occurred due to chance.8 Unfortunately, many scientists misinterpret the p-value as a simple method to determine significance while ignoring other critical factors of a data set that can drastically impact the p-value such as the magnitude of the association and the sample size. These components must be recognized and evaluated for a data set to ensure that the p-value is an accurate measure of deviation from the null hypothesis. The greatest impact on a data set is not the calculation of p value, but rather the manners in which the raw data is altered through cleaning and statistical modeling.9 Typically, data is purged of outliers using various statistical algorithms to generate a “clean” dataset. At this stage, the use of null hypothesis testing may generate a p-value that may falsely represent the significance of the study.10–12
Figure 1.

A pictorial representation of data processing from collection of data to final statistical analysis.
Implications of Misuse of P Value
An underappreciated but critical aspect of analysis of research findings is the correct interpretation of a calculated p value in the context of the sample set of data. There are many factors that define the reliability of a sample set including potential bias and sample size.13 For example, having a smaller sample size is less likely to represent the population as a whole. This makes the p value generated from this data less likely to represent the population and therefore less likely to be replicated if the experiment was repeated. Additionally, p value is limited in that it does not correlate to or provide information on the magnitude of an effect.14
In basic science research where the use of animal models is common, careful planning of experiments is essential to minimize the number of animals required to generate usable data. However, this factor must be balanced with the risk of drawing incorrect conclusions due to an inadequate sample size. Careful selection of animal models and the proper use of controls are critical to picking the appropriate type of data analysis.15 Another issue that impacts statistical analysis is the inclusion of replicates. There are two types of experimental replicates: biologic and technical. A technical replicate assays a single sample multiple times and serves as an internal control for the experiment. Whereas a biological replicate is when different samples are assayed.16 Replicates are not independent tests of the hypothesis, and therefore should not be used to calculate p values. However, they are often incorrectly included in statistical calculations.17
In the clinical sciences, the results of studies often guide clinicians in treatment and can have a more immediate impact on human health. All clinical studies have some type of bias or effects that can alter conclusions away from the truth.18 It is also important to look at the statistical power of the study. For example, pharmaceutical companies may include hundreds of patients in a clinical trial; this leads to the ability to detect minor differences that are very statistically significant that may not represent a clinical difference.19, 20
A Journal that Publishes Without p Value
The use of p value has been present in literature for years and is a central tenant of evidence-based scientific methods.21 The raging debate over the usage of the p-value in science encouraged the journal Basic and Applied Social Psychology to ban the use of null hypothesis significance testing analysis in it’s future publications.22 This journal aims to publish research that integrates basic science research with practical real world applications. Declaring that null hypothesis testing is invalid, the journal now calls for more descriptive statistics to be used like effect size, a measure of the strength of a phenomenon.23 This is the first journal to institute a ban of this type. This decision has been met with applause from some and criticism from others in the scientific community who believe that this bold move will have little effect on the quality of scientific literature.
Impacts to Drug Development
A common result of the systematic tailoring of data to achieve a p value that demonstrates “significant” is that many studies are irreproducible.24 Validation of scientific data through reproduction of experiments is fundamental to the scientific method. Pharmaceutical companies constantly scan the literature to identify new molecular targets or compounds that show high potential to further study. These targets are often identified and published by academic labs. Amgen, a biotechnology company based in California, identified fifty-three promising cancer studies over the course of ten years from high profile journals. However, only 6 of these studies, a mere 11%, were both reproducible and demonstrated continued promise as a novel target.25 Other pharmaceutical companies have also reported difficulties with validating promising studies, a problem that drastically hinders the progress of therapeutic development.26 This lack of reproducibility has been attributed to a variety of factors, including the chronic misuse and lack of understanding of what the p value represents.27
Future Directions
To help eliminate some of the issues derived from poor experimental design, students-in-training along with experienced research scientists must learn to better critically scrutinize experimental design and identify flaws in published studies. The National Institutes of Health (NIH) is currently developing a training module for postdoctoral students that emphasize the importance of good experimental design.28 The development of similar initiatives in research centers and universities will be positive steps to improve reproducibility of experiments. Publishers of peer-reviewed research also can play a role in improving the reliability and quality of research by enforcing stricter statistical guidelines. Already, high impact journals such as Nature and Science have made changes by eliminating a word count for the methods section and calling for increased transparency of data handling.29, 30 These steps are merely the beginning, but provide hope that the scientific community will begin to recognize the dire need for an improvement in the reliability and reproducibility of research.
The scientific community has urged the push for transparency in research presentation. For example should be a continued push to eliminate the use of phrases such as “representative data” and “data not shown,” but rather show the primary data and create a culture of transparency.31 Reporting of other statistical values such as confidence interval can help provide a more complete understanding of the statistics by going beyond the scope of a p-value to provide the magnitude and accuracy of an effect.32
The p value is inherently a valuable statistical measure. However, the manner in which the use of the p value has morphed into a sole determinant of significance has led to an epidemic of irreproducible, vastly over-interpreted research. A complete reform of the experimental process is needed to accelerate the production of scientific research that can more directly and expediently be applied to the treatment and improvement of human health. The actions of the NIH and high profile journals help push in the direction of achieving more research transparency. However, this is only the beginning. Reform of the experimental process must take place at all levels of the scientific ladder, from those students who are first learning about the scientific method to experienced researchers currently training future generations of scientists.
Footnotes
Author Contributions
Dinesh Vyas and Archana collected the material and wrote the manuscript. Dinesh Vyas and Arpita discussed the topic, and Dinesh Vyas supervised the publication of this commentary.
Core Tip
This article provides an overview of the current issues surrounding irreproducibility in research and discusses initiatives being done to rectify this. In this article, we are expanding upon a recently published commentary in Nature.
REFERENCES
- 1.Evans SJ, Mills P, Dawson J. The end of the p value? Br. Heart J. 1988;60(3):177–180. doi: 10.1136/hrt.60.3.177. [PMID: 3052552 PMCID: Pmc1216550] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy. Ann. Intern. Med. 1999;130(12):995–1004. doi: 10.7326/0003-4819-130-12-199906150-00008. [PMID: 10383371] [DOI] [PubMed] [Google Scholar]
- 3.Dixon P. The p-value fallacy and how to avoid it. Can. J. Exp. Psychol. 2003;57(3):189–202. doi: 10.1037/h0087425. [PMID: 14596477] [DOI] [PubMed] [Google Scholar]
- 4.Mogie M. In support of null hypothesis significance testing. Proc. Biol. Sci. 2004;271(Suppl 3):S82–S84. doi: 10.1098/rsbl.2003.0105. [PMID: 15101426 PMCID: Pmc1810001] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Leek JT, Peng RD. Statistics: P values are just the tip of the iceberg. Nature. 2015;520(7549):612. doi: 10.1038/520612a. [PMID: 25925460] [DOI] [PubMed] [Google Scholar]
- 6.Guyatt G, Jaeschke R, Heddle N, Cook D, Shannon H, Walter S. Basic statistics for clinicians: 1. Hypothesis testing. Cmaj. 1995;152(1):27–32. [PMID: 7804919 PMCID: Pmc1337490] [PMC free article] [PubMed] [Google Scholar]
- 7.Dorey F. In brief: The P value: What is it and what does it tell you? Clin. Orthop. Relat. Res. :2297–2298. doi: 10.1007/s11999-010-1402-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Biau DJ, Jolles BM, Porcher R. P value and the theory of hypothesis testing: An explanation for new researchers. Clin. Orthop. Relat. Res. 2010;468(3):885–892. doi: 10.1007/s11999-009-1164-4. [PMID: 19921345 PMCID: Pmc2816758] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Motulsky HJ. Common misconceptions about data analysis and statistics. Pharmacol. Res. Perspect. 2015;3(1):e00093. doi: 10.1002/prp2.93. [PMID: 25692012 PMCID: Pmc4317225] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol. 2015;13(3):e1002106. doi: 10.1371/journal.pbio.1002106. [PMID: 25768323 PMCID: Pmc4359000] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 2011;22(11):1359–1366. doi: 10.1177/0956797611417632. [PMID: 22006061] [DOI] [PubMed] [Google Scholar]
- 12.Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc. Open. Sci. 2014;1(3):140216. doi: 10.1098/rsos.140216. [PMID: 26064558 PMCID: Pmc4448847] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Krzywinski M, Altman N. Points of significance: Importance of being uncertain. Nat. Methods. 2013;10(9):809–810. doi: 10.1038/nmeth.2613. [PMID: 24143821] [DOI] [PubMed] [Google Scholar]
- 14.Nuzzo R. Scientific method: Statistical errors. Nature. England. 2014:150–152. doi: 10.1038/506150a. [DOI] [PubMed] [Google Scholar]
- 15.Johnson PD, Besselsen DG. Practical aspects of experimental design in animal research. Ilar j. 2002;43(4):202–206. doi: 10.1093/ilar.43.4.202. [PMID: 12391395] [DOI] [PubMed] [Google Scholar]
- 16.Blainey P, Krzywinski M, Altman N. Points of significance: Replication. Nat Methods. 2014;11(9):879–880. doi: 10.1038/nmeth.3091. [PMID: 25317452] [DOI] [PubMed] [Google Scholar]
- 17.Vaux DL, Fidler F, Cumming G. Replicates and repeats–what is the difference and is it significant? A brief discussion of statistics and experimental design. EMBO Rep. 2012;13(4):291–296. doi: 10.1038/embor.2012.36. [PMID: 22421999 PMCID: Pmc3321166] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lambert J. Statistics in brief: How to assess bias in clinical studies? Clin. Orthop. Relat. Res. 2011;469(6):1794–1796. doi: 10.1007/s11999-010-1538-7. [PMID: 20809163 PMCID: Pmc3094617] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hochster HS. The power of “p”: On overpowered clinical trials and “positive” results. Gastrointest Cancer Res. 2008;2(2):108–109. [PMID: 19259305 PMCID: Pmc2630828] [PMC free article] [PubMed] [Google Scholar]
- 20.Fernandes-Taylor S, Hyun JK, Reeder RN, Harris AH. Common statistical and research design problems in manuscripts submitted to high-impact medical journals. BMC Res. Notes. 2011;4:304. doi: 10.1186/1756-0500-4-304. [PMID: 21854631 PMCID: Pmc3224575] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Greenhalgh T, Howick J, Maskrey N. Evidence based medicine: A movement in crisis? Bmj. 2014;348:g3725. doi: 10.1136/bmj.g3725. [PMID: 24927763 PMCID: Pmc4056639] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Trafimow D, Marks M. Editorial. Basic and Applied Social Psychology. 2015;37(1) [Google Scholar]
- 23.Trafimow D. Editorial. Basic and Applied Social Psychology. 2014;36(1):1–2. [Google Scholar]
- 24.Freedman LP, Gibson MC, Ethier SP, Soule HR, Neve RM, Reid YA. Reproducibility: Changing the policies and culture of cell line authentication. Nat. Methods. 2015;12(6):493–497. doi: 10.1038/nmeth.3403. [PMID: 26020501] [DOI] [PubMed] [Google Scholar]
- 25.Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012;483(7391):531–533. doi: 10.1038/483531a. [PMID: 22460880] [DOI] [PubMed] [Google Scholar]
- 26.Prinz F, Schlange T, Asadullah K. Believe it or not: How much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. England. 2011:712. doi: 10.1038/nrd3439-c1. [DOI] [PubMed] [Google Scholar]
- 27.Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Methods. 2015;12(3):179–185. doi: 10.1038/nmeth.3288. [PMID: 25719825] [DOI] [PubMed] [Google Scholar]
- 28.Collins FS, Tabak LA. Policy: NIH plans to enhance reproducibility. Nature. 2014;505(7485):612–613. doi: 10.1038/505612a. [PMID: 24482835 PMCID: Pmc4058759] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nature Publishing Group. Announcement: Reducing our irreproducibility. Nature. 2013;496(7446) [Google Scholar]
- 30.McNutt M. Reproducibility. Science United States. 2014;229 doi: 10.1126/science.1250475. [DOI] [PubMed] [Google Scholar]
- 31.Pulverer B. Significant statistics. EMBO Rep. England. 2012:280. doi: 10.1038/embor.2012.38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Singh AK, Kelley K, Agarwal R. Interpreting results of clinical trials: A conceptual framework. Clin. J. Am. Soc. Nephrol. United States. 2008:1246–1252. doi: 10.2215/CJN.03580807. [DOI] [PMC free article] [PubMed] [Google Scholar]
