Where Are We Now?
For years, a steady drumbeat has been building regarding the problems associated with continuing to rely on null-hypothesis significance testing (for example, p < 0.05) as the primary measure for success in research studies [1]. In 2005 [7] and more recently [18], the drumbeat intensified when prominent scientists set forth a litany of reasons why interpreting medical research articles based solely on p values as thresholds for “real” findings does not lead to sound statistical inference [7, 18]. Thus, increasing concerns over the reproducibility of scientific findings [17] have partly led to a progressive worsening of public confidence in science over the past several years [9]. Indeed, in 2022, only 29% of United States adults (down from 40% in 2020) have a great deal of confidence that scientists will act in the best interests of the public [9]. Our shared mission is to improve the health and quality of our patients’ lives, and public trust, specifically patient trust, is critical to the success of that mission [13].
To a large extent, orthopaedic surgery has led the march away from reliance on p values as the sole arbiters of real findings, and we are instead increasingly asking whether a treatment or procedure is beneficial from the patient’s perspective [19]. A quick search of PubMed identifies more than 4000 articles using the terms “patient-reported outcome measures” (PROMs) and “orthopaedics,” most of which were published in the past 3 years. However, this increasing inclusion of the patient perspective has raised new issues regarding standardization, such that we can accurately interpret results to provide support for or against a specific intervention. Although conventional null-hypothesis testing attempts to answer the question of whether a treatment affects people, effect sizes assess how much the treatment affects people [15]. The minimum clinically important difference (MCID) is one measure of effect size that provides a threshold for the smallest change in a score patients can notice and see as a meaningful change. Crucially, PROMs and the associated measure of effect, the MCID, can include the patient’s perspective as an alternative to the paternalism inherent in the p value [9].
The authors of the current article [3] reviewed 38 articles including more than 700,000 patients that calculated MCID values for PROMs after primary TKA. They demonstrated that despite considerable variability in reported MCID values, for the most frequently reported PROMs, there were clusters of MCID values. They provide recommendations for choosing MCID values for certain commonly used PROMs. Because some consensus about MCIDs exists, the results of this study can guide researchers when choosing or calculating an MCID by providing strong support for prioritizing anchor-based methods to determine the MCID, particularly when the anchor is based on patient input. Therefore, surgeons should engage in shared decision-making with patients to establish their “anchor” or the amount of improvement they feel is crucial to achieving a good treatment outcome. Similarly, clinical researchers should use anchor-based approaches, whenever possible, to ground study results in the patient’s own framework.
Where Do We Need To Go?
However, questions remain regarding the optimal use of PROMs and an MCID. Namely, like p values, there are many ways to determine an MCID. Anchor-based approaches use an external indicator (an anchor) that can be either an objective or subjective measure of interest, while distribution-based methods use statistical criteria from PROMs (such as the standard error or a fraction of the standard deviation) to estimate an MCID [12]. Understanding the rationale for how to choose an appropriate MCID and how to best calculate one is critically important, because providing a measure of effect size is crucial in studies related to efficacy of treatment [4, 8]. In addition, unlike p values, each MCID is unique to the PROM in question and may vary with treatment and with population size and composition [19]. Therefore, as we increasingly try to incorporate the patient’s perspective, we must now wrestle with the interpretability [4] of these results and how to evaluate interventions for health improvements that are not only measurable, but also meaningful.
To fully understand where we need to go, we must keep in mind our shared mission: to improve the health and quality of our patients’ lives. Therefore, we are searching for practical benefit in our treatments and research endeavors [14]. Pogrow [14] argued that practical benefit exists when the unadjusted performance of an experimental group provides a noticeable advantage over an existing benchmark. This sits in stark contrast to the traditional p value approach, where with a large enough sample size, one may detect a very small difference between experimental groups regardless of the meaning behind that difference. Practical benefit is similar to an anchor-based MCID, in which the anchor is the patient’s own benchmark. One advantage of shifting the scientific conversation away from mere statistical significance and toward metrics that more closely resemble practical benefit is an inherent raising of the bar for the reproducibility, and more importantly, the reliability of results [18].
Improving the reproducibility of our scientific results builds confidence the knowledge gained from any particular study is robust. However, reproducible results alone are not enough to rebuild trust between science or scientists and the public, nor is it enough to build a trusting clinician-patient relationship. Trust is forward-looking and relies on patient expectations of compassion, reliability, integrity, competence, and open communication [5, 6]. Therefore, to build trust, we must bring research results out of the raw “we found a difference” model and be able to explain why the results are reliable, ethical, and matter to the patient.
How Do We Get There?
To achieve the goal of increasing the amount of practical benefits derived from our experiments, treatments, and interventions, it is not enough to simply eschew the p value. Leopold and Porcher [11] proposed a “backdoor Bayesian” approach, adjusting p value thresholds based on prior probability. Although this approach adds thoughtfulness [18] to the equation, it does not address the problem of frequent misinterpretation of p values and reinforces the dichotomization of them [1]. Instead, many journals have encouraged a combined approach in which “statistical significance” and confidence intervals are used to identify results that require more investigation, and multiple measures of effect size are used to convey potential practical benefit. In this way, as suggested by the authors of the current article [3], an MCID can be used to identify differences large enough to matter to a patient, and other anchor-based measures of the importance or size of an effect, such as the substantial clinical benefit, can be used to identify effects above the minimum threshold [2, 10].
In addition to broadening our horizons to include measures other than p values, we need to critically evaluate (and re-evaluate) our chosen metrics to ensure we are maintaining reliability and integrity. Devji et al. [4] proposed an instrument to evaluate the credibility of anchor-based MCID estimates, which could be extended to other anchor-based tools. Similar to bias-assessment tools for systematic reviews, using a credibility instrument can help build confidence in the reliability of measurement systems, which are quickly evolving. Similarly, we must pay increasing attention to reproducibility and remember that context matters [16]. As has been noted [19], patients may have values or expectations that differ according to context. Therefore, standardized construction of population-specific (not just procedure-specific) MCIDs should be a focus of future research, particularly using anchors relevant to the context of the population in question.
Trust is elusive and difficult to regain once lost. However, asking patients what matters to them and defining our collective success based on those metrics is one way to demonstrate compassion for the populations we serve. Reliability is an easier concept for most clinician-scientists to grapple with but is an essential pillar of the scientific endeavor. Raising the bar so results will have a practical benefit will increase the reliability of our reporting and help establish a level of integrity and competence needed to build trust.
Finally, patients expect open communication in a trusting relationship. Although science is a human endeavor and we are all fallible, embracing these tools will allow us to build trust and communicate our findings openly with patients in order to impact their lives.
Footnotes
This CORR Insights® is a commentary on the article “There are Considerable Inconsistencies Among Minimum Clinically Important Differences in TKA: A Systematic Review” by Deckey and colleagues available at: DOI: 10.1097/CORR.0000000000002440.
The author certifies that there are no funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article related to the author or any immediate family members.
All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
The opinions expressed are those of the writer, and do not reflect the opinion or policy of CORR® or The Association of Bone and Joint Surgeons®.
References
- 1.Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567:305-307. [DOI] [PubMed] [Google Scholar]
- 2.Bernstein DN, Nwachukwu BU, Bozic KJ. Value-based health care: moving beyond “minimum clinically important difference” to a tiered system of evaluating successful clinical outcomes. Clin Orthop Relat Res. 2019;477:945-947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Deckey DG, Verhey JT, Gerhart CRB, et al. There are considerable inconsistencies among minimum clinically important differences in TKA: a systematic review. Clin Orthop Relat Res. 2023;481:63-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Devji T, Carrasco-Labra A, Qasim A, et al. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. BMJ. 2020;369:m1714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Goold SD. Trust, distrust and trustworthiness. J Gen Intern Med. 2002;17:79-81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gopichandran V, Chetlapalli SK. Dimensions and determinants of trust in health care in resource poor settings – a qualitative exploration. PLoS One. 2013;8:e69170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2:e124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Karhade AV, Bono CM, Schwab JH, Tobert DG. Minimum clinically important difference. J Bone Joint Surg Am. 2021;103:2331-2337. [DOI] [PubMed] [Google Scholar]
- 9.Kennedy B, Tyson A, Funk C. Americans’ trust in scientists, other groups declines. Pew Research Center Report. 2/2015. Available at: https://www.pewresearch.org/science/2022/02/15/americans-trust-in-scientists-other-groups-declines/. Accessed October 8, 2022.
- 10.Leopold SS, Porcher R. Editorial: the minimum clinically important difference-the least we can do. Clin Orthop Relat Res. 2017;475:929-932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Leopold SS, Porcher R. Editorial: threshold p values in orthopaedic research-we know the problem. What is the solution? Clin Orthop Relat Res. 2018;476:1689-1691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ousmen A, Touraine C, Deliu N, et al. Distribution- and anchor-based methods to determine the minimally important difference on patient-reported outcome questionnaires in oncology: a structured review. Health Qual Life Outcomes. 2018;16:228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Parikh S. Why we must rebuild trust in science. Pew Research Center. Available at: https://www.pewtrusts.org/en/trend/archive/winter-2021/why-we-must-rebuild-trust-in-science. Accessed October 8, 2022.
- 14.Pogrow S. How effect size (practical significance) misleads clinical practice: the case for switching to practical benefit to assess applied research findings. Am Stat. 2019;73(sup1):223-234. [Google Scholar]
- 15.Sullivan GM, Feinn R. Using effect size-or why the p value is not enough. J Grad Med Educ. 2012;4:279-282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Van Bavel JJ, Mende-Siedlecki P, Brady WJ, Reinero DA. Contextual sensitivity in scientific reproducibility. Proc Natl Acad Sci U S A. 2016;113:6454-6459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. Am Stat. 2016;70:129-133. [Google Scholar]
- 18.Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p < 0.05”. Am Stat. 2019;73(sup1):1-19. [Google Scholar]
- 19.Zuckerman JD. CORR insights®: substantial inconsistency and variability exists among minimum clinically important differences for shoulder arthroplasty outcomes: a systematic review. Clin Orthop Relat Res. 2022;480:1384-1386. [DOI] [PMC free article] [PubMed] [Google Scholar]
