To the Editor,
In their editorial, Drs. Leopold and Porcher highlighted what I believe is the most-important misunderstanding between the biostatistics community and the health-research audience it seeks to serve: How (and how not) to use the p value [7].
The p value can help readers interpret results, but it is easily misapprehended, and so this tool needs to be used thoughtfully by investigators and assessed carefully by journal editors; it’s also worth noting that not every study necessarily benefits from inference testing using p values at all [2-4, 11, 13]. The biostatistics community has made efforts to educate others regarding concerns about the inappropriate importance that commonly is placed onto p values. Even so, many nonstatistically trained persons place great—and often inordinate—importance on the presence or absence of statistical significance as defined by a p value [2-4, 11, 13]. The p value provides a miniscule amount of information (and what little it provides is commonly misinterpreted), which is why methodologists and statisticians continue to implore health researchers to stop relying on the p value for interpretive assistance [2-4, 11, 13].
Current Knowledge on Dissemination Efforts
Although the efforts to educate audiences about the limitations of the p value are warranted, the avenues in which statisticians attempt to disseminate their messages generally are flawed. The American Statistical Association’s (ASA) statement on p values is a thorough essay with a robust list of references that call the common uses of the p value in medical research into question [13]. The problem with many of these articles is that they are published in journals that serve a statistical audience, such as The American Statistician [13], instead of journals targeted at readers who would gain the most value from these publications, such as clinicians and policy makers. Of course, there are notable exceptions [3, 5, 7], and as such, I commend the CORR editorial team in their efforts to address this issue within their journal.
An alternative approach is to directly target the specific clinical fields in a grass-roots-type movement, as was done with OrthoEvidence [8], an online platform that provides summaries of current evidence to the orthopaedic community through a social media news feed. The site’s editorial commentaries section, dubbed “OE Originals”, targets orthopaedic surgeons and related clinician groups [8]. In this section, pertinent topics are discussed. One such editorial, “3 reasons why p-values are dangerous”, provides a brief but effective summary of the issues raised in the ASA’s statement on p values [9, 13]. The key difference is that the OrthoEvidence commentary is written for and provided to an audience that may be most influenced by its concepts. Perhaps grass-roots efforts will be a more-effective method of providing clarity than the previous publications in statistical journals about one of the largest disconnects between the biostatistical field and its clinical audience.
If Not the p Value, Then What?
It’s a simple question: If the not the p value, then what should we use to decide which clinical research findings matter? To answer this question, we need to find an approach that is easy enough to use and understand, but that still gives us confidence that the inferences drawn by the research we read are fair.
Perhaps the most likely statistical metrics to fill the void left behind are the confidence interval (CI) around the point estimate [4, 7]. Correctly interpreting the CI inherently addresses the concerns that the research community has with regards to the p value in several ways [4]. For example, CIs are already commonplace in most studies. Additionally, CIs give some insight into the range of the likely magnitudes of the observed effect [4] in ways that are much more helpful than p values alone. They also tend to focus the eye on effect sizes than do p values, which neither patients nor providers can perceive.
When a clinician begins to consider effect sizes, a number of good things happen. First, they start to think about whether the size of the effect is worth the risk or the cost of the treatment, and whether—for smaller effects—a patient is even likely to consider a “statistically significant benefit” of the treatment being studied is clinically important. The latter involves an important (and underappreciated) concept called the minimum clinically important difference (MCID) [6]. If the lower limit of a 95% CI does not include the MCID, the clinician reading the article can be fairly certain that the treatment in question will not improve a patient’s health in ways the patient is likely to care about. While CIs are, on the surface, harder to interpret than p values, with a little use they become intuitive, and they provide a much greater amount of information than provided by the p value alone [4, 10].
The Fragility Index (FI) is another example of a helpful statistic that should be considered in conjunction with p values, as it provides additional context to the credibility of the results [12]. The FI deals with the question of what would happen to a study’s result if a few events in one or another arm of the study were not to have occurred, and so is a metric of confidence one can have in the robustness of the p value itself. Imagine a study in which 55 patients received a drug that may reduce the risk of infection after surgery, 55 patients did not, and five were lost to followup in each group. When the results were tallied, eight patients in the control group and one patient who received the drug developed an infection. The p value (depending on how it is calculated) in this hypothetical study is about 0.03, which would be considered statistically significant if the typical (0.05) level was chosen by the investigators. However, if even one more patient in the treatment group developed an infection—which could easily have occurred, with five missing—the p value would be greater than 0.09, and so no longer “statistically significant.” As a result, this study would have a FI of 1 (meaning that only one patient with a different result would change one’s impression of the statistical significance of the study at the predetermined level). Calculators are freely available online that allow users to explore this in the context of the studies they read [1]. Given that orthopaedic studies often are small, event rates are low, and loss to followup is common, these studies tend to suffer from low FI; that is, their statistical significance hinges on a very small number of events [12]. The FI helps the reader to know how much to trust a p value, and for that reason, is a useful metric in evidence-based decision making.
While it is clear within the biostatistical community that the p value should carry less weight in clinical studies, this message—along with suggestions for alternatives—should be more effectively disseminated among clinicians and policymakers.
Footnotes
(Leopold SS, Porcher R. Editorial: Threshold P Values in Orthopaedic Research—We Know the Problem. What is the Solution? Clin Orthop Relat Res. 2018;476:1689-1691).
The author certifies that neither he, nor any members of his immediate family, have any commercial associations (such as consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article.
All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
The opinions expressed are those of the writers, and do not reflect the opinion or policy of CORR® or The Association of Bone and Joint Surgeons®.
References
- 1.Fragility Index Calculator. Available at: https://clincalc.com/Stats/FragilityIndex.aspx. Accessed April 30, 2019.
- 2.Gelman A, Stern H. The difference between “significant” and “not significant” is not itself statistically significant. Am Stat. 2006;60:328-331. [Google Scholar]
- 3.Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med . 1999;130:995-1004. [DOI] [PubMed] [Google Scholar]
- 4.Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, p values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol. 2016;31:337-350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005;294:218-228. [DOI] [PubMed] [Google Scholar]
- 6.Johnston BC, Ebrahim S, Carrasco-Labra A, Furukawa TA, Patrick DL, Crawford MW, Hemmelgarn BR, Schunemann HJ, Guyatt GH, Nesrallah G. Minimally important difference estimates and methods: A protocol. BMJ Open. 2015;5:e007953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Leopold SS, Porcher R. Editorial: Threshold p values in orthopaedic research-We know the problem. What is the solution? Clin Orthop Relat Res. 2018;476:1689-1691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.OrthoEvidence-Editors. Available at: www.myorthoevidence.com. OrthoEvidence Inc. [Google Scholar]
- 9.OrthoEvidence-Editors. 3 reasons why p-values are dangerous. OrthoEvidence Inc; Available at: https://myorthoevidence.com/Blog/show/16. Accessed April 30, 2019. [Google Scholar]
- 10.OrthoEvidence-Editors. The 1 thing you need to know when interpreting study results with confidence! OrthoEvidence Inc. Available at: https://myorthoevidence.com/Blog/Show/19. April 30, 2019.
- 11.Schervish MJ. P values: What they are and what they are not. Am Stat. 1996;50:203-206. [Google Scholar]
- 12.Walsh M, Srinathan SK, McAuley DF, Mrkobrada M, Levine O, Ribic C, Molnar AO, Dattani ND, Burke A, Guyatt G, Thabane L, Walter SD, Pogue J, Devereaux PJ. The statistical significance of randomized controlled trial results is frequently fragile: A case for a Fragility Index. J Clin Epidemiol. 2014;67:622-628. [DOI] [PubMed] [Google Scholar]
- 13.Wasserstein RL, Lazar NA. The ASA's statement on p-values: Context, process, and purpose. Am Stat. 2016;70:129-133. [Google Scholar]
