Abstract
Introduction
Randomized controlled trials (RCT) represent evidence at the lowest potential risk for bias. Clinicians in all specialties depend upon RCTs to guide patient care. Issues such as statistical discordance, or reporting statistical results that cannot be reproduced, should be uncommon. Our aim was to confirm the statistical reproducibility of published RCTs.
Methods
PubMed was searched using “randomized controlled trial.” Studies were selected using a random number generator. Studies were included if the primary outcome could be reproduced using the data and statistical test reported in the manuscript. The reproduced p-value from our analysis and the published p-value were compared. Primary outcome was the number of studies that reported p-values that differed in statistical significance (crossed p-value=0.05) from the reproduction analysis. Assuming an alpha of 0.05, a beta of 0.80, an estimated rate of statistical discordance of 5% for RCTs, a total of at least 568 studies were required.
Results
Overall, 572 RCTs were selected involving six specialties. Of these, 45% were positive (p<0.05) studies. Eleven (2%) published results that differed from the reproduction analysis and crossed the p=0.05 threshold. All 11 studies were positive studies (while the reproduction analysis demonstrated p≥0.05).
Conclusion
Less than 5% of published RCTs reported a discordant p-value that crossed the “p=0.05” threshold. Although the occurrence is uncommon, the existence of even one RCT publishing nonreproducible results is concerning. Future studies should seek to identify why some RCTs report discordant statistics and how to prevent this from occurring.
INTRODUCTION
More than 2.5 million scientific medical papers are published each year.1,2 Of these, randomized controlled trials (RCT) are considered the highest tier of evidence as the study design minimizes the risk of bias.3 Clinicians often make decisions regarding patient care based on published RCTs.4,5 It is imperative that the methods are clear and the published results are reliable and reproducible. Consolidated Standards of Reporting Trials (CONSORT) guidelines were first published in 1996, with the intent to improve the reporting of RCTs to facilitate appraisal and interpretation (http://www.consort-statement.org/).
Given the rigorous methodology required with the design of RCTs, the final analysis of outcomes can often be performed with basic statistics such as t-test or chi-square. Errors in reporting the final statistical analysis would seem unlikely. The aim of our study was to confirm the statistical reproducibility of published RCTs. We sought to determine the number of RCTs that would report a p-value of the primary outcome that cannot be reproduced and crosses the “p=0.05” threshold.
METHODS
This was a cross-sectional study. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines were followed (https://www.strobe-statement.org/index.php?id=available-checklists). Institutional review board approval was waived as there were no human subjects in this study.
A PubMed search was performed using the term “randomized controlled trial,” sorted by most recent. Studies were randomly selected using a random number generator. Studies were included if the primary outcome could be reproduced using the data and statistical test reported in the published manuscript. Each study was reviewed and the following data was extracted: primary outcome, the denominators used for the intervention arm and control arm, the statistical method(s) used to analyze the primary outcome, journal, impact factor of journal, degree of author(s), and whether or not a statistician was involved.
Two authors (NHD and OAO) independently reviewed each study and re-analyzed the primary outcome using the statistical method reported in the manuscript. All work was reviewed by a PhD statistician who validated any discordant findings and randomly audited results. Any discrepancies were discussed and resolved with the principal investigating author (MKL).
Discordance between the obtained p-value from our analysis and published p-value was assessed. The primary outcome was the number of studies that reported p-values that differed in statistical significance (crossed p-value of 0.05) from our reproduction analysis. Secondary outcome included p-values that differed by an absolute difference of 0.1 or more (e.g., p-value of 0.05 to 0.15). We chose a 10% difference because it often represents the minimal clinically important difference in results.6,7 The difference being the p-value calculated by the authors of this study versus the authors of the published RCTs. In addition, we reported studies where use of chi-squared approximation was inappropriate due to expected cell count <5 and where proper use of Fisher’s exact test would have changed the significant findings of the study.
Assuming an alpha of 0.05, a beta of 0.80, and an estimated rate of statistical discordance of 5% for RCTs, a total of at least 568 studies were required. Categorical data was assessed using two-sided Fisher’s exact or Pearson’s chi-square test, when all expected cell counts were >5.8
RESULTS
A total of 572 RCTs were included, published from 1995 to 2020 (Fig. 1). Trials were published from around the world, with most studies from Europe, Asia, and North America. Most studies (n=372, 65%) included authors with either a master’s degree or PhD, or a statistician. Impact factors ranged from 0.431 to 74.699. The most common specialties included surgery, internal medicine, and obstetrics and gynecology (Table 1).
Figure 1.
Flowsheet of included studies.
Table 1.
Study Characteristics
| All (n=572) | Reproducible p-values (n=512) | Nonreproducible p-values (n=60) |
p-values | ||
|---|---|---|---|---|---|
| Crosses p=0.05 (n=11) | Does not cross p=0.05 (n=49) | ||||
| Year, median (IQR) | 2014 (2010, 2017) | 2014 (2010, 2017) | 2015 (2008, 2016) | 2016 (2012, 2018) | 0.012 |
| Continent | 0.164 | ||||
| Africa | 25 (4%) | 22 (4%) | 2 (18%) | 1 (2%) | |
| Asia | 173 (30%) | 149 (29%) | 5 (45%) | 19 (39%) | |
| Australia/Zealandia | 20 (3%) | 17 (3%) | 0 | 3 (6%) | |
| Europe | 216 (38%) | 199 (39%) | 2 (18%) | 15 (31%) | |
| North America | 125 (22%) | 113 (22%) | 1 (9%) | 11 (22%) | |
| South America | 13 (2%) | 12 (2%) | 1 (9%) | 0 | |
| Masters/PhD/statistician | 372 (65%) | 331 (65%) | 7 (63%) | 34 (69%) | 0.176 |
| Impact factor, median (IQR) | 2.89 (1.92, 5.05) | 2.91 (1.96, 5.18) | 2.99 (1.75, 5.78) | 2.78 (1.70, 4.10) | 0.281 |
| Positive outcome | 257 (45%) | 246 (48%) | 11 (100%) | 0 | <0.001 |
| Specialty | 0.203 | ||||
| Anesthesiology | 6 (1%) | 5 (1%) | 0 | 1 (2%) | |
| Dermatology | 7 (1%) | 7 (1%) | 0 | 0 | |
| Primary care | 282 (49%) | 258 (50%) | 3 (27%) | 21 (43%) | |
| Emergency medicine | 6 (2%) | 5 (2%) | 0 | 1 (5%) | |
| Internal medicine | 144 (51%) | 133 (52%) | 3 (100%) | 8 (38%) | |
| Obstetrics and gynecology | 83 (29%) | 75 (29%) | 0 | 8 (38%) | |
| Pediatrics | 49 (17%) | 45 (17%) | 0 | 4 (19%) | |
| Psychiatry | 4 (1%) | 3 (1%) | 0 | 1 (2%) | |
| Radiology | 5 (1%) | 4 (1%) | 0 | 1 (2%) | |
| Surgery | 268 (47%) | 235 (46%) | 8 (72%) | 25 (51%) | |
| General surgery | 238 (89%) | 211 (90%) | 8 (100%) | 19 (76%) | |
| Neurosurgery | 3 (1%) | 0 | 0 | 3 (12%) | |
| Ophthalmology | 9 (3%) | 8 (3%) | 0 | 1 (4%) | |
| Orthopedic surgery | 12 (4%) | 12 (5%) | 0 | 0 | |
| Urology | 6 (2%) | 4 (2%) | 0 | 2 (8%) | |
IQR interquartile range
Of the included RCTs, 528 (92%) clearly reported the primary outcome and 546 (95%) clearly reported the statistical method used to analyze outcomes. Nine (2%) out of 572 studies used denominators that differed from what our team was able to extract using the included flowsheets and intention to treat or per protocol analysis as stated by the manuscript. Positive primary outcome (p-value <0.05) was reported in 260 (45%) studies (Table 1).
Eleven (2%) of the 572 RCTs published nonreproducible p-values that crossed a p-value of 0.05 (Table 2). All 11 studies were published as “positive studies” while the reproduction statistics were all ≥0.05. The median interquartile range (IQR) of the absolute difference between the reproduced and the reported p-value was 0.055 (0.026, 0.104). The 11 RCTs were published in journals with impact factors ranging from 1.60 to 6.30. None of these 11 studies reported the use of a statistician and seven studies (63%) reported authors with either a master’s degree or PhD.
Table 2.
Studies with “Nonreproducible Statistics”
| Year | Specialty | Topic | Statistical test used | Reported p-value | Reproduced p-value | |
|---|---|---|---|---|---|---|
| 1 | 2004 | Internal medicine | Catheter management | Fisher’s | 0.033 | 0.066 |
| 2 | 2005 | Surgery | Hernia prevention | Chi-square | <0.05 | 0.160 |
| 3 | 2008 | Internal medicine | Catheter management | Fisher’s | 0.03 | 0.054 |
| 4 | 2010 | Surgery | Hernia repair | Fisher’s | <0.05 | 0.076 |
| 5 | 2011 | Surgery | Esophagectomy | Fisher’s | 0.046 | 0.069 |
| 6 | 2014 | Surgery | Pulmonary lobectomy | Fisher’s | 0.044 | 0.076 |
| 7 | 2015 | Internal medicine | Chronic kidney disease | Chi-square | 0.004 | 0.113 |
| 8 | 2016 | Surgery | Wound management | Fisher’s | 0.02 | 0.096 |
| 9 | 2016 | Surgery | Colorectal surgery | Fisher’s | 0.006 | 0.148 |
| 10 | 2017 | Surgery | Hernia repair | Fisher’s | 0.043 | 0.147 |
| 11 | 2018 | Surgery | Hernia prevention | Fisher’s | 0.012 | 0.101 |
Forty-nine (9%) studies published p-values that differed by 0.1 or more from our reproduction analysis but did not cross 0.05. All studies published p-values of ≥0.05. The median (IQR) of the absolute difference between the reproduced and the reported p-value was 0.200 (0.133, 0.301). Of these, impact factors ranged from 0.55 to 53.30. Twelve (24%) reported the use of a statistician, and 34 (69%) reported authors with either a master’s degree or PhD.
Four (1%) studies had a categorical primary outcome that would have been better assessed using Fisher’s exact test rather than chi-square (Table 3) and proper use of Fisher’s would have changed the significant findings of the study. Of these, none had published protocols reporting their intention to analyze their study using the chi-square statistic. When retested using Fisher’s exact test, all four crossed the 0.05 threshold. All four studies were published as “positive studies” while the reproduction statistics were all ≥0.05. The median (IQR) of the absolute difference between the reproduced and the reported p-value was 0.040 (0.027, 0.065). Of these, impact factors ranged from 0.818 to 1.952. None of these four studies reported the use of a statistician, and one (25%) reported authors with either a master’s degree or PhD.
Table 3.
Studies with “Nonreproducible Results” Due to Inappropriate Use of Chi-Square Given Expected Frequency of <5 Events in at Least One Arm
| Year | Specialty | Topic | Reported p-value using chi-square | Reproduced p-value using Fisher’s exact | |
|---|---|---|---|---|---|
| 1 | 2003 | Surgery | Hernia prevention | 0.02 | 0.055 |
| 2 | 2004 | Surgery | Gastric cancer | 0.0365 | 0.061 |
| 3 | 2007 | Surgery | Hernia repair | 0.036 | 0.107 |
| 4 | 2014 | Surgery | Hernia repair | 0.01 | 0.055 |
DISCUSSION
Among a sampling of over 500 randomized controlled trials, 2% reported p-values that could not be reproduced and crossed the threshold of 0.05. Of importance, all 2% published positive results. In addition, another 9% of studies published statistics that differed from the reproduction results by greater than 0.1 but did not cross the 0.05 threshold.
Fisher’s exact and chi-square tests were used most often in analyzing results, for both accordant and discordant. Several studies used chi-square in their pre-specified analysis when Fisher’s exact would have been appropriate given the small sample sizes. For our primary analysis, we analyzed results according to the statistical method reported in the paper. However, when using the optimal test, we identified an additional four studies that reported positive results that would have been negative studies had the optimal statistical test been used.
It is unclear what is an acceptable percentage of studies with statistics that cannot be reproduced. While some amount of human error is plausible, among RCTs this seems potentially problematic. First, RCTs require substantial preparation and often years of work. Second, the final analysis of the primary outcome is often “simple” because the design is at low risk for bias. In this study, we only chose results that we could reproduce with “basic statistics” such as Fisher’s or chi-square test. Finally, all of the studies that differed from the reproduced p-values and crossed the p=0.05 threshold crossed in favor of “positive results.” No study incorrectly reported negative results when actually positive results occurred. All of this is concerning and suggestive of possible intentional reporting of incorrect “positive results”.
There is substantial pressure among academic healthcare scientists to publish studies, particularly impactful studies. Promotion, funding, reputation, and invitations to speak are all affected by publications and high-impact publications. In addition, it is well known that positive results are more likely to be published and published in higher-impact journals.9,10 Conflicts of interests have been known to impair judgement and decision-making among scientists resulting in some extremely high-profile examples of violations of medical ethics and scientific integrity.11–13
The next steps in assessing this topic are to identify why these errors are occurring and how to prevent their occurrence. We have considered contacting authors of studies to identify their responses as to why there were discordant statistics. However, there is concern that many of these authors would not respond or would be upset. A screening approach could prevent these statistical errors. Currently, many journals require data and/or data availability statement prior to manuscript review. Prior to publication acceptance, all RCTs (and perhaps all studies in general) should have a statistical review of the raw data. Several journals have already adopted this approach. This review would serve as a screening for statistical errors. Of course, implementation of this review process would place a huge burden on statisticians and journals. However, knowing that this review process would occur may prevent authors from reporting results that cannot be replicated. Alternatively, it could force authors to simply alter or manipulate their raw data. Any author who has reported concerning outcomes (i.e., discordant statistics that cross 0.05) could be publicly reported and blocked from publishing for a prescribed duration. Preprint servers are another option in providing authors additional opportunities for review and error detection prior to formal submission.
There are a number of limitations to this study. This study only assessed published RCTs and was not able to evaluate unpublished RCTs which are likely to include more “negative studies.” Only studies with statistical analyses that could be reproduced by data presented in the manuscript were included. Studies with stratified outcomes, time-to-event analysis, or regression analysis also could not be assessed. While we powered our study from an epidemiological sampling perspective, the number of studies with nonreproducible statistical results was small. The results of this study should be replicated among other medical publications and academic specialties. Finally, the reason and cause of nonreproducible results were not directly explored through contacting these authors or journal edits and we can only surmise possible etiology.
CONCLUSION
RCTs remain the foundation for evidence at the lowest risk of bias and guide treatment and decision-making among healthcare providers. While uncommon, 2% of published trials sampled reported statistical results that could not be reproduced, and all 2% errored in reporting “positive” results while reproduction analysis demonstrated p≥0.05. There should be focus on preventing this type of event from occurring which can be facilitated through increased transparency and double-checking results.
Declarations
Conflict of Interest
The authors declare that they do not have a conflict of interest.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Ware M, Mabe M. The STM Report: An overview of scientific and scholarly journal publishing. International Association of Scientific, Technical and Medical Publishers. Prama House, UK. 2005.
- 2.Fletcher RH, Fletcher SW. Evidence-based approach to the medical literature. J Gen Intern Med. 1997;12. [DOI] [PMC free article] [PubMed]
- 3.Akobeng AK. Understanding randomised controlled trials. Archives of Disease in Childhood. 2005;90:840-4. [DOI] [PMC free article] [PubMed]
- 4.Kwaan MR, Melton GB. Evidence-based medicine in surgical education. Clin Colon Rectal Surg. 2012;25.3:151-5. [DOI] [PMC free article] [PubMed]
- 5.Masic I, Miokovic M, Muhamedagic B. Evidence based medicine - new approaches and challenges. Acta Inform Med. 2008;16.4:219–225. [DOI] [PMC free article] [PubMed]
- 6.Wright A, Hannon J, Hegedus EJ, et al. Clinimetrics corner: a closer look at the minimal clinically important difference (MCID). J Man Manip Ther. 2012;20.3:160-6. [DOI] [PMC free article] [PubMed]
- 7.Mouelhi Y, Jouve E, Castelli C, et al. How is the minimal clinically important difference established in health-related quality of life instruments? Review of anchors and methods. Health Qual Life Outcomes. 2020;18.136. [DOI] [PMC free article] [PubMed]
- 8.Agresti A. An Introduction to Categorical Data Analysis. New York: John Wiley and Sons; 1996. [Google Scholar]
- 9.Olson CM, Rennie D, Cook D, et al. Publication bias in editorial decision making. JAMA. 2002;287.21:2825–8. [DOI] [PubMed]
- 10.Chalmers I, Glasziou P. Avoidable waste in the production and reporting of research evidence. Lancet. 2009;374.9683:86-9. [DOI] [PubMed]
- 11.Mehra MR, Desai SS, Ruschitzka F, et al. Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis. Lancet 2020;6736.20:31180-6. [DOI] [PMC free article] [PubMed] [Retracted]
- 12.Mehra MR, Desai SS, Kuy S, et al. Cardiovascular disease, drug therapy, and mortality in Covid-19. N Engl J Med 2020;382.25. [DOI] [PMC free article] [PubMed] [Retracted]
- 13.Eggertson L. Lancet retracts 12-year-old article linking autism to MMR vaccines. CMAJ. 2010;182.4:E199-E200. [DOI] [PMC free article] [PubMed]

