Abstract
The fragility index (FI) has been increasingly used to assess the robustness of the results of clinical trials since 2014. It aims at finding the smallest number of event changes that could alter originally statistically significant results. Despite its popularity, some researchers have expressed several concerns about the validity and usefulness of the FI. This article offers a comprehensive review of the FI’s rationale, calculation, software, and interpretation, with emphasis on application to studies in obstetrics and gynecology. We introduce the FI in the settings of individual clinical trials, standard pairwise meta-analyses, and network meta-analyses. We provide worked examples to demonstrate how the FI can be appropriately calculated and interpreted. In addition, we review the limitations of the traditional FI and some solutions proposed in the literature to address these limitations. In summary, we recommend using the FI as a supplemental measure in the reporting of clinical trials and a tool to communicate the robustness of trial results to clinicians. Other considerations that can aid in the FI’s interpretation include the loss to follow-up and the likelihood of data modifications that achieve the loss of statistical significance.
Keywords: confidence interval, fragility index, meta-analysis, P-value, research replicability, statistical significance
Background
The use of statistical significance to determine scientific conclusions has generated many debates in the scientific communities.1-3 Many pieces of evidence have indicated that statistical significance is frequently misused and misinterpreted. For example, a large-scale study found that P-values reported in biomedical research articles are often clustered around cutoff values, such as 0.05 and 0.001.4 Significant results are more likely to be published than non-significant ones, leading to publication bias.5-8 Improperly using statistical significance as a basis for decisions about which outcomes to report or which trials to publish is a form of “cherry-picking” that can seriously threaten the validity of scientific findings.9-11
To assess the robustness of statistical significance, Walsh et al.12 proposed the fragility index (FI) for studies with binary outcomes in 2014. Although this idea was not new, as similar concepts have emerged dating back to the 1990s,13, 14 the FI has been increasingly applied to the recent biomedical literature over the past decade, particularly since 2019 (Figure 1).15 Appendix A in the Supplementary Materials provides bibliographic details of publications that involve the FI from January 1, 2014 to August 23, 2022. These publications mostly appeared in the specialties of cardiology, critical care medicine, gastroenterology, neurology, oncology, orthopedics, pediatrics, sports medicine, surgery, and urology. The FI began attracting attention from the obstetrics and gynecology community in the past two years.16-18 The increasing use of the FI is likely associated with increasing concerns about the misuse and misinterpretation of P-values. In the meantime, the FI itself also comes with many debates about its usefulness in clinical practice. Several variants have been developed to extend the original FI and address its drawbacks.
Figure 1. Bar plot of the number of research items involving the fragility index based on PubMed as of August 23, 2022.
The abstract of each item was screened. The items using “fragility index” to refer to other concepts (not for assessing study results’ robustness) and retracted items were excluded. The publication year is the time when a research item was published in an issue; if an item was in press, the year of online publication ahead of print was used.
This article aims at offering a comprehensive review of the FI, including its variants, strengths, and limitations. We also provide worked examples to demonstrate the implementations of the FI, with emphasis on application to obstetrics and gynecology literature.
Fragility index for an individual study
The FI was initially developed for two-arm studies with binary outcomes that report counts of events and non-events as 2×2 tables.12 It aims at investigating whether statistically significant results may be altered with small changes in the number of events. Walsh et al.12 proposed to calculate the FI by modifying the event status of a patient in the group with fewer events, recalculating the two-sided P-value based on Fisher’s exact test, and iterating this process until obtaining P-value ≥0.05.
Table 1 presents an example adapted from the randomized controlled trial (RCT) by Sigurdardottir et al.19 The original analysis suggested an association between postpartum pelvic floor muscle training and urinary incontinence at endpoint 6 months postpartum, with P=0.025 based on Fisher’s exact test. The FI for this dataset and hypothesis test is 2. By changing two non-events to events in the intervention group, the P-value becomes 0.075 (event status modification I in Table 1), and the result is no longer significant. Of note, changing the status of 2 events in this example can be done in different ways, by making this change to the control group or to both groups, which also would yield statistically non-significant results.
Table 1.
An example of 2×2 table from the randomized controlled trial from Sigurdardottir et al.19 investigating whether postpartum pelvic floor muscle training reduces urinary incontinence.
| Group | No. of events | No. of non-events | Sample size | Fisher’s exact test P-value |
|---|---|---|---|---|
| Original dataset | ||||
| Control | 31 | 7 | 38 | 0.025 |
| Intervention | 21 | 16 | 37 | |
| Event status modification I | ||||
| Control | 31 | 7 | 38 | 0.075 |
| Intervention | 21+2 | 16−2 | 37 | |
| Event status modification II | ||||
| Control | 31−1 | 7+1 | 38 | 0.083 |
| Intervention | 21+1 | 16−1 | 37 | |
| Event status modification III | ||||
| Control | 31−2 | 7+2 | 38 | 0.090 |
| Intervention | 21 | 16 | 37 | |
For another outcome of being bothered by urinary symptoms, the intervention group had 10 events out of 37 patients, and the control group had 23 events out of 38 patients. Fisher’s exact test leads to P=0.005, and the FI is 4, suggesting that the statistical significance may be more robust than that of the previous outcome.
In addition, by the definition of Walsh et al.,12 Fisher’s exact test is used for recalculations of P-values, regardless of the original statistical methods for calculating P-values. As such, the FI could be 0, if the original analysis does not use Fisher’s exact test and has P<0.05 (e.g., based on the relative risk [RR]), but the re-analysis with Fisher’s exact test gives P>0.05. The FI may also be defined based on the same statistical method.18,20 In this case, the FI should be at least 1. Additionally, the FI originally defined by Walsh et al.12 is restricted to RCTs with a balanced design (i.e., a 1:1 treatment allocation ratio), while this restriction could be relaxed, and a more general algorithm of the FI is available for RCTs with an unbalanced design.20, 21
The FI can be easily calculated using the online tool Fragility Index Calculator22; see Figure S1 in the Supplementary Materials for a demonstration. For clinicians with some background in statistical coding, the FI can also be computed with several R packages,23-25 which offer more sophisticated methods and visualization tools than the online tool. Figure S2 briefly shows the FI calculation with the R package “fragility”; detailed instructions for using this package are available in Lin and Chu.26 Of note, the online calculator could sometimes output FIs that are too large compared to the R packages, in line with the original proposal for modifying outcomes by Walsh et al.12
Limitations and solutions
Along with the increasing use of the FI to assess the robustness of the results of clinical trials, several concerns have arisen to question its validity and usefulness. Nevertheless, several solutions have also been proposed to address the limitations of the original FI proposed by Walsh et al.12 We discuss some critiques as follows and introduce corresponding solutions.
First, the FI is often criticized because it may be highly associated with P-value and sample size.27-33 This critique is mostly from an undue conception that the FI aims to replace P-value while it does not. Instead, the FI only serves as a supplemental measure to aid in the assessment of the robustness of statistical significance. It translates P-value from a completely statistical concept to a clinically meaningful metric. The FI is intuitive to understand and can serve as a good communication tool, as opposed to being a formal statistical measure. In addition, if researchers want to account for the sample size in the robustness assessment, they may use the “relative” metric fragility quotient (FQ),34 which is simply calculated as the ratio of the FI over the sample size.
Second, the FI needs to be incorporated with other clinical considerations, such as the loss to follow-up (LTFU) in the trial being evaluated.35-37 For example, Shochet et al.38 found that 41% of 127 RCTs (mostly on nephrology) had a FI less than the number of LTFU subjects, suggesting that statistical significance may likely change when all enrolled individuals would have been taken into account. Nevertheless, directly comparing the traditional FI and the LTFU has a drawback, i.e., the traditional FI fixes the total sample size, which may be violated by adding lost patients back to the analysis. The LTFU-aware FI could help overcome this problem.37
Third, the current literature does not have a consensus on interpreting the magnitudes of the FI.29 Murad et al.39 recently conducted a meta-epidemiologic study consisting of 201 RCTs in cardiology. Based on various definitions of precision from clinical perspectives, they heuristically found that the cutoffs of about 19–22 for the FI may be used as a rule of thumb to suggest robust results. Similar practices could be borrowed to interpret the FI in research on obstetrics and gynecology. However, this rule is rough, and the foundations of this procedure should be further explored on a case-by-case basis. Despite precise thresholds not being clearly established, we believe that it is clear that extremely small FIs, such as 1 or 2, are worthy of reflection. Evidence synthesis platforms, particularly those adopting living systematic reviews (e.g., COVID-NMA and RecMap), offer opportunities for obtaining more trustworthy evidence for decision-makers than relying on single studies with small FIs.40, 41
Fourth, it is unclear that the event status modifications for changing statistical significance obtained via the FI calculation are practically possible.36 For example, for rare events, it is unlikely that a non-event could be modified into an event. In such cases, some visualization tools that display all possible event status modifications and the resulting P-values could help with the robustness assessment,20 where researchers may restrict to clinically sensible scenarios. To quantitatively address this issue, Baer et al.21 proposed the incidence FIs, i.e., a family of FIs that permit only sufficiently likely event status modifications. In addition, the FI should be interpreted in the context of the clinical question of the trial. For example, in the study by Sigurdardottir et al.19 that had FI of 2, we should think about whether it is clinically likely that two women, who were labeled as being continent based on a questionnaire, could have been mislabeled, and they were in fact incontinent. Indeed, the outcome misclassification of 2 women is quite likely in this case.
Fifth, the original FI was proposed only for binary outcomes. As time is often critical for RCTs in many specialties, Bomze et al. extended the original FI to make it suitable for time-to-event data and proposed the survival-inferred FI.42, 43 The survival-inferred FI is defined as the minimum number of reassignments of the best survivors from the intervention group to the control group that lead to the change of statistical significance (e.g., based on the log-rank test). In addition, Caldwell et al.44 recently extended it to deal with continuous outcomes. The continuous FI can be implemented in an online tool.45 This metric is well defined when original data (i.e., continuous data for each participant) are available. However, if original data are unavailable, the FI needs to be derived from algorithms for simulating individual continuous data from the sample mean and sample standard deviation. The simulation process may affect the accuracy of the FI value. Of note, both aforementioned methods for survival data and continuous data modify the treatment arms of patients rather than outcomes; therefore, they do not share the exact spirit of the original FI and may be viewed as FI-like measures.
Sixth, the FI has been criticized as a post hoc measure that cannot be incorporated into the design of the study. This limitation has been addressed by introducing a power-analysis-based sample size calculation that designs the FI to not be small with high probability.46 The procedure is analogous to how traditional sample size calculations design the P-value to be small with high probability. Thus, the FI can be leveraged prospectively, in addition to retrospectively.
Besides the foregoing limitations of the original FI and the corresponding solutions, several variants of the FI have also been proposed. For example, the FI was initially motivated to assess the robustness of statistically significant results, and it can be extended to the reverse FI that assesses the robustness of non-significant results.20, 47 The reverse FI quantifies the minimal event status modifications that change non-significance to significance. Nevertheless, as statistical significance is often pursued and it is more important to assess its robustness compared with non-significance, the reverse FI may be of less interest than the FI. Figure S3 in the Supplementary Materials gives an example of calculating the reverse FI. Moreover, the FI has been mostly applied to assess the robustness at the commonly used significance level of 0.05, while some other significance levels may be used.1 Calculating the FI at a different significance level is straightforward.20 In addition, we have focused on the FI for an individual study, and this concept has also been developed for synthesized results of a standard pairwise meta-analysis or network meta-analysis (NMA); this will be introduced in detail in the following.
Fragility index for a standard pairwise meta-analysis
As meta-analyses have been increasingly used to synthesize evidence from multiple studies and support decision-making in evidence-based practice,48 similar concerns have also arisen about the robustness of their results. Meta-analyses addressing the same research question could give conflicting conclusions.49, 50 Different choices of study inclusion and exclusion criteria used for a systematic review could lead to entirely opposite meta-analysis results.51 Some systematic reviewers have been found to manipulate the study selection criteria to produce a statistically significant result.11, 52 While most journals asked systematic reviewers to register their protocols and follow the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) reporting guidelines,53, 54 errors frequently occur when researchers extract data from original articles to perform a meta-analysis,55, 56 and meta-analyses may include studies originally reporting falsified data.57 In a recent large-scale reproducibility study,58 data extraction could not be reproduced in 1762 (17.0%) of 10,386 RCTs. Also, 554 (66.8%) of 829 meta-analyses contained at least one RCT subject to data extraction errors, and the effect direction and statistical significance were changed in a non-ignorable proportion (3.5% to 6.6%) of meta-analyses when such errors were corrected. Therefore, assessing the robustness of meta-analysis results plays a critical role in making reliable decisions.
Atal et al.59 proposed an algorithm to calculate the FI of a pairwise meta-analysis; they also offered an online tool for this calculation.60 Unlike calculating the FI of an individual study, it may be computationally infeasible to enumerate all possible event status modifications for a meta-analysis. Instead, an iterative heuristic process based on the synthesized effect estimate’s confidence interval (CI) was developed to derive the FI of a meta-analysis. Specifically, let us consider a pairwise meta-analysis comparing an intervention vs. control has a statistically significant overall effect estimate, and its CI is beyond the null value. We aim to move the CI toward the null, so we expect to decrease event counts in the intervention group or increase event counts in the control group. In each iteration, one subject changes its status from event to non-event in the intervention group or from non-event to event in the control group in a study. Meta-analyses are then performed for these modified datasets. We choose the event status modification that leads to the CI closest to the null value if all CIs do not cover the null; in other words, such a CI has the smallest lower bound. Then, based on this event status modification, we continue the iterative process until the CI based on a modified dataset covers the null. The FI of this meta-analysis is the total number of event status modifications in the foregoing iterative process. Figure 2 in Atal et al. gives a graphical illustration of this process.
Figure 2.


Forest plots of the meta-analysis for the risk of low umbilical cord pH originally reported by Di Mascio et al.61 (A) and the meta-analysis with an event status modification (adding three events to the control group in study 4) for deriving the fragility index (B).
We use the systematic review by Di Mascio et al.61 as demonstrative examples. This review evaluated the effect of delayed vs. immediate pushing in the second stage of labor. One meta-analysis investigated the risk of low umbilical cord pH. The original analysis used the Mantel-Haenszel method, and the effect measure was the RR. We use the same set of approaches (i.e., the random-effects model with the DerSimonian–Laird estimation) and reproduce the meta-analysis (Figure 2A). The overall RR is estimated as 2.00 with a 95% CI (1.30, 3.07), suggesting a statistically significant difference. This meta-analysis has a FI value of 3. By adding three events to the control group in study 4, the overall RR estimate becomes 1.68 with a 95% CI (0.99, 2.85) that covers the null value of 1 (Figure 2B). Figure S4 in the Supplementary Materials demonstrates the FI calculation with the Fragilty Index of meta-analyses calculator.60 Of note, although the random-effects model is used, the heterogeneity variance is estimated as 0, so the results would be identical to those from the common-effect model. However, the modified meta-analysis has some extents of heterogeneity; if the common-effect model is used for the FI analysis, the FI value could change.
The FI of 3 seems to suggest that the significance of this meta-analysis is quite fragile. However, the three patients whose outcomes are modified are all in the control group of study 4, where the (unmodified) event probability is 3/85=3.5%. Accounting for all five studies in the meta-analysis, the crude event probability is 30/2253=1.3%. These low event rates suggest that such event status modifications might not be very unlikely. Thus, clinical judgment is needed to interpret the FI results. Alternatively, the sufficiently likely construction in Baer et al.21 can directly ensure that the modifications are not unlikely.
Additionally, the outcome modifications were all made in a single study, which inflates the heterogeneity measure I2 of modified data from 0% to 23%. This happens because the meta-analysis FI (like other fragility measures) makes the outcome modifications that most impact the significance and consequently tends to “pick on” one study or a few studies that are most atypical. This limitation can be addressed by the stochastic fragility measures,62 which ensures that patients from across the data set contribute to altering the significance. Although a script can be written based on the “FragilityTools” package to calculate this,24 a user-friendly function does not exist yet.
Another meta-analysis by Di Mascio et al.61 investigated the risk of spontaneous vaginal delivery (Figure S5 in the Supplementary Materials). It has an overall RR estimate of 1.05 with a 95% CI (1.00, 1.10), which is just above the null value. This meta-analysis is even more fragile than the previous one, with a FI value of 1. The statistical significance will be lost if one event is removed from the experimental group in study 8. This study originally reports 24 events among 26 participants in the experimental group, leading to an event probability of 92.3%; across all 12 studies, the crude event probability is 2224/2751=80.8%. Having one fewer event in the experimental group may have a relatively low but nonignorable probability. In the FI analysis, the overall RR and its 95% CI did not virtually change. This example reflects the problem of using mechanical “bright-line” rules warned against in the American Statistical Association statement on P-values.3 It may be problematic to emphasize a binary decision of statistical significance for nearly the same effect estimates.
Fragility index for a network meta-analysis
NMA extends the traditional pairwise meta-analysis to simultaneously compare multiple treatments. It combines direct and indirect evidence from RCTs with different designs.63-65 Like the issues in traditional pairwise meta-analyses, NMAs on the same topic could differ in definitions of treatment nodes, effect directions, and conclusions of statistical significance, making the robustness of NMA results questionable.66
Xing et al.67 built an algorithm to derive the FI for an NMA based on the concept by Atal et al.59 for a pairwise meta-analysis. The major difficulty in this extension is that an NMA has multiple comparisons, so a greedy algorithm is needed to iterate the process for approximating the minimal event status modifications that alter statistical significance. As different treatment comparisons in an NMA have different P-values and statistical significance, the FI for an NMA is defined for each comparison separately. The algorithm is similar to the aforementioned iterative heuristic process for a pairwise meta-analysis. However, in theory, event status modifications could occur in any treatment group in the NMA, potentially making the process computationally demanding. Xing et al.67 proposed to restrict event status modifications to the two relevant treatment groups only (i.e., only modifying event statuses in treatment groups A and B for the comparison A vs. B). The calculation of the FI for an NMA can be performed with the R package “fragility.”25
We use the systematic review by Nguyen et al.68 for illustrations. It investigated the prevention of vertical transmission of the hepatitis B virus (HBV). The analyses were stratified by the statuses of maternal hepatitis B envelope antigen (HBeAg) and hepatitis B surface antigen (HBsAg) at baseline. We consider the stratum of HBsAg(+) and HBeAg(+) mothers. The treatment network consists of 19 studies investigating a total of 8 treatments (Figure 3).
Figure 3. Treatment network plot of the network meta-analysis by Nguyen et al.68.
The shaded area indicates a three-arm study. The width of the edge between two treatment nodes is proportional to the number of direct comparisons between the two treatments. The node size is proportional to the total sample size of the corresponding treatment. Notation of abbreviations: HBIG, hepatitis B immune globulin; HBV, hepatitis B virus; i.HBIG, HBIG in infants; i.Vaccine, HBV vaccine in infants; i.HBIG+i.Vaccine, HBIG and HBV vaccine in infants; m.HBIG, maternal HBIG; m.LAM, maternal lamivudine; m.LDT, maternal telbivudine; m.TDF, maternal tenofovir disoproxil fumarate.
We first focus on the comparison of i.HBIG+i.Vaccine (HBIG and HBV vaccine in infants) vs. i.Vaccine (HBV vaccine in infants), which has the largest number of RCTs (i.e., 7) that provide direct comparisons. The RR of this comparison from the NMA is 0.52 with a 95% CI (0.30, 0.91), indicating that i.HBIG+i.Vaccine is significantly better than i.Vaccine for preventing vertical transmission of HBV. We use the R package “fragility” to obtain the FI for this comparison; see the code in Figure S6 in the Supplementary Materials. The FI is 2, and the statistical significance is lost if one patient in the i.HBIG+i.Vaccine group in each of studies 10 and 15 experiences an event instead of a non-event. After these event status modifications, the RR becomes 0.59 with a 95% CI (0.34, 1.01). The event probabilities in these two studies are 1/30=3.3% and 3/23=13.0% accordingly. These event status modifications have a relatively low probability, but they are not impossible. Again, the idea of the sufficiently likely construction by Baer et al.21 could be similarly used for NMAs.
To demonstrate a comparison with more robust statistical significance, we consider m.TDF/i.HBIG+i.Vaccine (maternal tenofovir disoproxil fumarate/HBIG and HBV vaccine in infants) vs. placebo/no treatment. The RR from the NMA is 0.02 with a 95% CI of (0.00, 0.07). The FI is 43, which is fairly large; the event status modifications involving many patients are very unlikely to occur to alter the significance.
Conclusions
We reviewed the concept of the FI used to assess the robustness of clinical results. The FI can be applied to an individual study, a standard pairwise meta-analysis, as well as an NMA. We have also introduced several limitations of the FI and some solutions to address these limitations. As a relatively new concept, it is normal to expect debates on the usefulness of the FI. In many existing meta-epidemiologic studies that investigated the fragility of RCTs, the FI is usually reported without careful consideration of clinical interpretations, so it is fairly common to see RCTs with very small FI values and claims that many RCTs are fragile. As discussed in this article, some event status modifications that lead to those small FI values may not be practically possible. In future applications of the FI, we recommend researchers consider the LTFU and the likelihood of data modifications that achieve the loss of statistical significance. In addition, the FI for a pairwise meta-analysis and NMA is a fairly new concept developed in the past three years. We have focused on introducing it with worked examples, but we have not discussed its limitations because of the lack of methodological studies investigating its performance. Many limitations of the FI for individual studies remain important for the FI for a pairwise meta-analysis and NMA. The additional factors in meta-analyses, such as between-study heterogeneity, could further complicate the FI calculation and interpretation. More methodological works are welcomed in these directions.
Supplementary Material
Funding:
LL was supported in part by the US National Institutes of Health/National Library of Medicine grant R01 LM012982 and the National Institutes of Health/National Institute of Mental Health grant R03 MH128727. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of interest: None.
Data availability:
The datasets used in this article are available at https://osf.io/hgrud/.
References
- 1.Benjamin DJ, Berger JO, Johannesson M, et al. Redefine statistical significance. Nature Human Behaviour 2018;2:6–10. [DOI] [PubMed] [Google Scholar]
- 2.Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature 2019;567:305–07. [DOI] [PubMed] [Google Scholar]
- 3.Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. The American Statistician 2016;70:129–33. [Google Scholar]
- 4.Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting P values in the biomedical literature, 1990-2015. JAMA 2016;315:1141–48. [DOI] [PubMed] [Google Scholar]
- 5.Abaid LN, Grimes DA, Schulz KF. Reducing publication bias through trial registration. Obstetrics & Gynecology 2007;109:1434–37. [DOI] [PubMed] [Google Scholar]
- 6.Bibens ME, Chong AB, Vassar M. Utilization of clinical trials registries in obstetrics and gynecology systematic reviews. Obstetrics & Gynecology 2016;127:248–53. [DOI] [PubMed] [Google Scholar]
- 7.Lin L, Chu H. Quantifying publication bias in meta-analysis. Biometrics 2018;74:785–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lin L, Chu H, Murad MH, et al. Empirical comparison of publication bias tests in meta-analysis. Journal of General Internal Medicine 2018;33:1260–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine 2008;358:252–60. [DOI] [PubMed] [Google Scholar]
- 10.Eyding D, Lelgemann M, Grouven U, et al. Reboxetine for acute treatment of major depression: systematic review and meta-analysis of published and unpublished placebo and selective serotonin reuptake inhibitor controlled trials. BMJ 2010;341:c4737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sanchez-Ramos L. Intrapartum amnioinfusion for meconium-stained amniotic fluid: a systematic review of randomised controlled trials. BJOG: An International Journal of Obstetrics & Gynaecology 2008;115:409–10. [DOI] [PubMed] [Google Scholar]
- 12.Walsh M, Srinathan SK, McAuley DF, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. Journal of Clinical Epidemiology 2014;67:622–28. [DOI] [PubMed] [Google Scholar]
- 13.Feinstein AR. The unit fragility index: an additional appraisal of "statistical significance" for a contrast of two proportions. Journal of Clinical Epidemiology 1990;43:201–09. [DOI] [PubMed] [Google Scholar]
- 14.Walter SD. Statistical significance and fragility criteria for assessing a difference of two proportions. Journal of Clinical Epidemiology 1991;44:1373–78. [DOI] [PubMed] [Google Scholar]
- 15.Ho AK. The fragility index for assessing the robustness of the statistically significant results of experimental clinical studies. Journal of General Internal Medicine 2022;37:206–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pundir J, Achilli C, Bhide P, et al. Risk of foetal harm with letrozole use in fertility treatment: a systematic review and meta-analysis. Human Reproduction Update 2020;27:474–85. [DOI] [PubMed] [Google Scholar]
- 17.Pascoal E, Liu M, Lin L, Luketic L. The fragility of statistically significant results in gynaecologic surgery: a systematic review. Journal of Obstetrics and Gynaecology Canada 2022;44:508–14. [DOI] [PubMed] [Google Scholar]
- 18.Sanchez-Ramos L, Lin L. Cerclage placement in twin pregnancies with short or dilated cervix does not prevent preterm birth: a fragility index assessment. American Journal of Obstetrics & Gynecology 2022;227:338–39. [DOI] [PubMed] [Google Scholar]
- 19.Sigurdardottir T, Steingrimsdottir T, Geirsson RT, Halldorsson TI, Aspelund T, Bø K. Can postpartum pelvic floor muscle training reduce urinary and anal incontinence?: An assessor-blinded randomized controlled trial. American Journal of Obstetrics and Gynecology 2020;222:247.e1–47.e8. [DOI] [PubMed] [Google Scholar]
- 20.Lin L Factors that impact fragility index and their visualizations. Journal of Evaluation in Clinical Practice 2021;27:356–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Baer BR, Gaudino M, Charlson M, Fremes SE, Wells MT. Fragility indices for only sufficiently likely modifications. Proceedings of the National Academy of Sciences 2021;118:e2105254118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Fragility Index Calculator. URL https://clincalc.com/Stats/FragilityIndex.aspx.
- 23.Johnson KW, Rappaport E. R package "fragilityindex" version 0.1.0. URL: https://github.com/kippjohnson/fragilityindex, 2017. [Google Scholar]
- 24.Baer B, Fremes S, Charlson M, Gaudino M, Wells M. R package "FragilityTools" version 1.0.4. URL: https://github.com/brb225/FragilityTools, 2022. [Google Scholar]
- 25.Lin L, Chu H. R package "fragility" version 1.3. URL: https://CRAN.R-project.org/package=fragility, 2022. [Google Scholar]
- 26.Lin L, Chu H. Assessing and visualizing fragility of clinical results with binary outcomes in R using the fragility package. PLOS ONE 2022;17:e0268754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Carter RE, McKie PM, Storlie CB. The fragility index: a P-value in sheep's clothing? European Heart Journal 2016;38:346–48. [DOI] [PubMed] [Google Scholar]
- 28.Condon TM, Sexton RW, Wells AJ, To M-S. The weakness of fragility index exposed in an analysis of the traumatic brain injury management guidelines: a meta-epidemiological and simulation study. PLOS ONE 2020;15:e0237879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Niforatos JD, Zheutlin AR, Chaitoff A, Pescatore RM. The fragility index of practice changing clinical trials is low and highly correlated with P-values. Journal of Clinical Epidemiology 2020;119:140–42. [DOI] [PubMed] [Google Scholar]
- 30.Porco TC, Lietman TM. A fragility index: handle with care. Ophthalmology 2018;125:649. [DOI] [PubMed] [Google Scholar]
- 31.Li J, O’Connell PJ. The fragility index: the P-value by another name? Transplantation 2022;106:239–40. [DOI] [PubMed] [Google Scholar]
- 32.Potter GE. Dismantling the Fragility Index: a demonstration of statistical reasoning. Statistics in Medicine 2020;39:3720–31. [DOI] [PubMed] [Google Scholar]
- 33.Schröder A, Muensterer OJ, Oetzmann von Sochaczewski C. Paediatric surgical trials, their fragility index, and why to avoid using it to evaluate results. Pediatric Surgery International 2022;38:1057–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ahmed W, Fowler RA, McCredie VA. Does sample size matter when interpreting the fragility index? Critical Care Medicine 2016;44:e1142–e43. [DOI] [PubMed] [Google Scholar]
- 35.Acuna SA, Sue-Chue-Lam C, Dossa F. The fragility index—P values reimagined, flaws and all. JAMA Surgery 2019;154:674–74. [DOI] [PubMed] [Google Scholar]
- 36.Walter SD, Thabane L, Briel M. The fragility of trial results involves more than statistical significance alone. Journal of Clinical Epidemiology 2020;124:34–41. [DOI] [PubMed] [Google Scholar]
- 37.Baer BR, Fremes SE, Gaudino M, Charlson M, Wells MT. On clinical trial fragility due to patients lost to follow up. BMC Medical Research Methodology 2021;21:254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Shochet LR, Kerr PG, Polkinghorne KR. The fragility of significant results underscores the need of larger randomized controlled trials in nephrology. Kidney International 2017;92:1469–75. [DOI] [PubMed] [Google Scholar]
- 39.Murad MH, Kara Balla A, Khan MS, Shaikh A, Saadi S, Wang Z. Thresholds for interpreting the fragility index derived from sample of randomised controlled trials in cardiology: a meta-epidemiologic study. BMJ Evidence-Based Medicine 2022:In press. [DOI] [PubMed] [Google Scholar]
- 40.Boutron I, Chaimani A, Meerpohl JJ, et al. The COVID-NMA project: building an evidence ecosystem for the COVID-19 pandemic. Annals of Internal Medicine 2020;173:1015–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lotfi T, Stevens A, Akl EA, et al. Getting trustworthy guidelines into the hands of decision-makers and supporting their consideration of contextual factors for implementation globally: recommendation mapping of COVID-19 guidelines. Journal of Clinical Epidemiology 2021;135:182–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bomze D, Meirson T. A critique of the fragility index. The Lancet Oncology 2019;20:e551. [DOI] [PubMed] [Google Scholar]
- 43.Bomze D, Asher N, Hasan Ali O, et al. Survival-inferred fragility index of phase 3 clinical trials evaluating immune checkpoint inhibitors. JAMA Network Open 2020;3:e2017675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Caldwell J-ME, Youssefzadeh K, Limpisvasti O. A method for calculating the fragility index of continuous outcomes. Journal of Clinical Epidemiology 2021;136:20–25. [DOI] [PubMed] [Google Scholar]
- 45.Continuous Fragility Index Calculator. URL: https://jmcaldwell.shinyapps.io/CFIApp/.
- 46.Baer BR, Gaudino M, Fremes SE, Charlson M, Wells MT. The fragility index can be used for sample size calculations in clinical trials. Journal of Clinical Epidemiology 2021;139:199–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Khan MS, Fonarow GC, Friede T, et al. Application of the reverse fragility index to statistically nonsignificant randomized clinical trial results. JAMA Network Open 2020;3:e2012469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Niforatos JD, Weaver M, Johansen ME. Assessment of publication trends of systematic reviews and randomized clinical trials, 1995 to 2017. JAMA Internal Medicine 2019;179:1593–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ioannidis JPA. The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. The Milbank Quarterly 2016;94:485–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hacke C, Nunan D. Discrepancies in meta-analyses answering the same clinical question were hard to explain: a meta-epidemiological study. Journal of Clinical Epidemiology 2020;119:47–56. [DOI] [PubMed] [Google Scholar]
- 51.Palpacuer C, Hammas K, Duprez R, Laviolle B, Ioannidis JPA, Naudet F. Vibration of effects from diverse inclusion/exclusion criteria and analytical choices: 9216 different ways to perform an indirect comparison meta-analysis. BMC Medicine 2019;17:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Xu H, Hofmeyr J, Roy C, Fraser W. Intrapartum amnioinfusion for meconium-stained amniotic fluid: a systematic review of randomised controlled trials. BJOG: An International Journal of Obstetrics & Gynaecology 2007;114:383–90. [DOI] [PubMed] [Google Scholar]
- 53.Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLOS Medicine 2009;6:e1000097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ 2021;372:n160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Jones AP, Remmington T, Williamson PR, Ashby D, Smyth RL. High prevalence but low impact of data extraction and reporting errors were found in Cochrane systematic reviews. Journal of Clinical Epidemiology 2005;58:741–42. [DOI] [PubMed] [Google Scholar]
- 56.Gøtzsche PC, Hróbjartsson A, Marić K, Tendal B. Data extraction errors in meta-analyses that use standardized mean differences. JAMA 2007;298:430–37. [DOI] [PubMed] [Google Scholar]
- 57.Garmendia CA, Nassar Gorra L, Rodriguez AL, Trepka MJ, Veledar E, Madhivanan P. Evaluation of the inclusion of studies identified by the FDA as having falsified data in the results of meta-analyses: the example of the apixaban trials. JAMA Internal Medicine 2019;179:582–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Xu C, Yu T, Furuya-Kanamori L, et al. Validity of data extraction in evidence synthesis practice of adverse events: reproducibility study. BMJ 2022;377:e069155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Atal I, Porcher R, Boutron I, Ravaud P. The statistical significance of meta-analyses is frequently fragile: definition of a fragility index for meta-analyses. Journal of Clinical Epidemiology 2019;111:32–40. [DOI] [PubMed] [Google Scholar]
- 60.Fragility Index of meta-analyses. URL: https://clinicalepidemio.fr/fragility_ma/.
- 61.Di Mascio D, Saccone G, Bellussi F, et al. Delayed versus immediate pushing in the second stage of labor in women with neuraxial analgesia: a systematic review and meta-analysis of randomized controlled trials. American Journal of Obstetrics & Gynecology 2020;223:189–203. [DOI] [PubMed] [Google Scholar]
- 62.Baer BR, Fremes SE, Charlson M, Gaudino M, Wells MT. Fragility measures for typical cases. arXiv preprint arXiv:2201.07093, 2022. [Google Scholar]
- 63.Cipriani A, Higgins JPT, Geddes JR, Salanti G. Conceptual and technical challenges in network meta-analysis. Annals of Internal Medicine 2013;159:130–37. [DOI] [PubMed] [Google Scholar]
- 64.Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Statistics in Medicine 2004;23:3105–24. [DOI] [PubMed] [Google Scholar]
- 65.Zhang J, Carlin BP, Neaton JD, et al. Network meta-analysis of randomized clinical trials: reporting the proper summaries. Clinical Trials 2014;11:246–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Naudet F, Schuit E, Ioannidis JPA. Overlapping network meta-analyses on the same topic: survey of published studies. International Journal of Epidemiology 2017;46:1999–2008. [DOI] [PubMed] [Google Scholar]
- 67.Xing A, Chu H, Lin L. Fragility index of network meta-analysis with application to smoking cessation data. Journal of Clinical Epidemiology 2020;127:29–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Nguyen HT, Thavorncharoensap M, Phung TL, et al. Comparative efficacy and safety of pharmacologic interventions to prevent mother-to-child transmission of hepatitis B virus: a systematic review and network meta-analysis. American Journal of Obstetrics & Gynecology 2022;227:163–72. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used in this article are available at https://osf.io/hgrud/.


