Abstract
Objectives:
Network meta-analysis (NMA) is frequently used to synthesize evidence for multiple treatment comparisons, but its complexity may affect the robustness (or fragility) of the results. The fragility index (FI) is recently proposed to assess the fragility of results from clinical studies and from pairwise meta-analyses. We extend the FI to NMAs with binary outcomes.
Methods:
We define the FI for each treatment comparison in NMAs. It quantifies the minimal number of events necessary to be modified for altering the comparison’s statistical significance. We introduce an algorithm to derive the FI and visualizations of the process. A worked example of smoking cessation data is used to illustrate the proposed methods.
Results:
Some treatment comparisons had small FIs; their significance (or non-significance) could be altered by modifying a few events’ status. They were related to various factors, such as P-values, event counts, and sample sizes, in the original NMA. After modifying event status, treatment ranking measures also changed to different extents.
Conclusion:
Many NMAs include insufficiently-compared treatments, small event counts, or small sample sizes; their results are potentially fragile. The FI offers a useful tool to evaluate treatment comparisons’ robustness and reliability.
Keywords: fragility index, network meta-analysis, replicability, statistical significance, systematic review, treatment comparison
1. Introduction
Network meta-analysis (NMA) has been widely used as an evidence synthesis tool to compare multiple treatments [1–3]. It simultaneously combines both direct and indirect evidence of treatment comparisons and thus may provide more reliable treatment effect estimates than conventional pairwise meta-analysis [4–11]. Even if some treatments may not be directly compared in head-to-head studies, an NMA permits coherent judgement of clinical studies among all available treatments connected in one network by borrowing evidence from indirect comparisons [12].
One of the fundamental purposes of performing NMAs (as well as pairwise meta-analyses) is to produce more precise results, which more likely show statistically significant treatment effects compared with individual studies [13]. The results’ significance is a critical factor in the process of rating evidence and making decisions. This is typically based on the P-value; traditionally, P-values<0.05 indicate significant results [14, 15]. However, the threshold of 0.05 may be arbitrary, and making conclusions based solely on P-values has long been criticized, as P-values are often misinterpreted as an effect measure [16–19]. In the recent literature, there are ongoing debates on whether P-values should be abandoned or whether the threshold of statistical significance should be lowered [20–26].
Because P-values are often misused and statistically significant results may be more likely reported than non-significant ones, individual studies are subject to publication and reporting bias in many scientific areas, which may further arouse concerns about research reproducibility and replicability [27–30]. Such concerns are not limited to individual studies; they may also occur in meta-analyses. Meta-analyses have been massively produced so far, but many imply conflicting results even though they deal with the same research question [31, 32]. The discrepancies are partly caused by differences in study selection criteria, subgrouping criteria, treatment definitions, etc. [33–35]. Recently, prospective meta-analyses are advocated; they specify important research procedures (e.g., hypotheses, study selection criteria, and analysis methods) prior to obtaining the results of meta-analysis questions [36]. Compared with traditional retrospective ones, prospective meta-analyses can reduce research waste and bias, and increase transparency and accountability [37, 38]. Similar to the clinical trial registrations (e.g., ClinicalTrials.gov), meta-analysis protocols can be registered in PROSPERO (https://www.crd.york.ac.uk/prospero/), where protocols’ key features are permanently recorded and maintained. Such registered meta-analyses are recommended by the PRISMA statement (as well as its extension to NMA) [39, 40], and may be of higher quality than non-registered meta-analyses [41–43].
To supplement the use of P-value, Walsh et al. [44] proposed the fragility index (FI) for individual studies with binary outcomes, which helps researchers appraise the robustness (or fragility) of statistical significance. Early work on fragility existed in 1990s (e.g., Feinstein [45] and Walter [46]); with the growing concerns about research replicability in recent years, such work regains many researchers’ attention. Specifically, for an individual study with a significant result, the FI is defined as the minimal number of events needed to alter the result to be non-significant by changing certain subjects’ event status (i.e., from non-event to event or from event to non-event). Similarly, the FI can be defined for non-significant results by altering them to be significant. Smaller FI values indicate more fragile results. The FI has been investigated in randomized controlled trials in many fields [47–51]. For example, the trials focusing on mortality in critical care have been found to frequently rely on a small number of patients [52]. Reporting the FI helped examine results and offer additional treatment recommendations for trials on chronic heart failure [47].
Recently, Atal et al. [53] extended the FI of individual studies to pairwise meta-analysis. It evaluates the robustness of a meta-analysis by modifying the event counts in both treatment arms across studies. Based on a survey of 906 meta-analyses, they found that the significance of many meta-analyses depended on a few patients’ outcomes.
With the rapid development of NMAs, the FI is also highly needed to assess the fragility of NMA results. This metric might be more critical in NMAs, because NMAs may include some insufficiently-compared treatments [5], potentially making the results fragile. The calculation of the FI for an NMA is more challenging than that for a pairwise meta-analysis, because event modifications in more treatment arms need to be considered to assess their impact on the overall results’ significance. We will introduce the details to derive the FI for an NMA, illustrated by an example of smoking cession data.
2. Methods
2.1. Definition
The FI of an individual study or a pairwise meta-analysis is a single metric, because only one parameter (either the individual study’s treatment effect or the meta-analytic overall effect) is of interest, and the change of its significance determines the FI. However, an NMA compares multiple treatments, and their pairwise comparisons are simultaneously of interest. We propose to assess the fragility of the NMA for each treatment comparison; that is, the FI is a comparison-specific metric under the NMA setting. Formally, the FI for each treatment comparison in an NMA is the minimal event status modifications needed to alter this comparison’s statistical significance.
2.2. Computational considerations
Because the implementation of the NMA is more computationally demanding than that of an individual study or a pairwise meta-analysis, deriving the FI in the NMA needs to account for computational capacity. We consider a heuristic iterative process for deriving the FI in the NMA, which is similar to that used by Atal et al. [53] for a pairwise meta-analysis. It is based on a specific comparison’s confidence interval (CI) via iterative steps (detailed later), instead of P-values of all possible event status modifications via an exhaustive search. The exhaustive search is feasible in an individual study; however, in the NMA, it requires to enumerate all possible event status modifications in all studies and all arms for identifying the minimal modifications. Although the exhaustive search is ideal, it is very time-consuming and impractical for NMAs.
To illustrate this, we tentatively assume that a single event status modification has C cases. The exhaustive search considers up to cases of modifications, and performs the NMA with the modified data in each case, so the computation time increases exponentially with the FI value. On the other hand, using the iterative process, the cases of event status modifications in one iteration are based on modifications determined in the previous iterations, so we only need to consider up to C × FI cases, and the computation time increases linearly with the FI value.
In addition, ideally, we may modify event status of a subject in any study’s any treatment arm to derive a specific comparison’s FI. This approach, however, may be computationally expensive, especially when the NMA contains many treatments. Because the FI in the NMA is a comparison-specific metric, it might not be intuitive to modify event status in treatment arms other than the two treatments associated with the comparison. Therefore, we propose to restrict the event status modifications to the two associated treatment arms when deriving the FI for a specific comparison.
For example, suppose the NMA contains K treatments, and treatment k is included in Nk studies (k = 1, … , K). Without the aforementioned restriction, a single iteration considers an event status modification in up to treatment groups. With the restriction, for the comparison of treatment 2 vs. 1, a single iteration only needs to consider up to N1 + N2 scenarios, much less than . Thus, the restriction may substantially improve computational efficiency. As a sensitivity analysis, we will also consider the FI derivation without the restriction in our illustrative example.
2.3. Implementation
For the NMA with a binary outcome, the treatment effects may be measured using the odds ratio (OR), risk ratio, risk difference, etc., which may lead to different FI values. Instead of trying different measures to derive the FI, we suggest researchers to choose the measure pre-specified in the NMA protocol to avoid “fragility-hacking.”
For the original NMA and the NMA with modified event status, we perform the analyses using the frequentist method by Rücker [54] via the R package “netmeta” (version 1.0–1). Bayesian methods may be alternatives to implement the contrast- or arm-based NMA, and thus derive the FI based on the comparisons’ credible intervals [3, 7, 8, 55]. However, they may take considerable time for multiple Markov chain Monte Carlo (MCMC) procedures, so they are not used in this article.
We handle zero counts for each comparison within each study according to the Cochrane Handbook [56]. If both treatment arms in a comparison have zero event or non-event counts, then their comparison is removed from the analysis. If only a single arm contains a zero count, then the continuity correction of 0.5 is added to all four data cells in the associated 2×2 table. Of note, in the presence of zero counts, the above removals and/or corrections are applied only to successfully produce the NMA results; they are not used for counting the event status modifications. For example, although a zero event count is corrected as 0.5 in the analysis, the change is still counted as 1, instead of 0.5, when one non-event is modified as event in this arm.
2.4. Originally significant treatment comparisons
We detail the iterative algorithm for deriving the FI in the NMA as follows. Recall K denotes the number of treatments in the NMA. Thus, the NMA contains K(K – 1)/2 possible treatment comparisons; the FI is derived for each comparison.
For a comparison that is significant in the original NMA, its FI describes the minimal modifications that make the comparison non-significant. Figure 1 illustrates the steps for deriving the FI of the comparison A vs. B, whose pooled OR is significantly larger than the null value 1. The situations of OR significantly less than 1 or other effect measures are similar. Suppose NA and NB studies include treatments A and B, respectively. Because the OR of A vs. B is larger than 1, we may reduce the event counts in arm A or increase those in arm B to shrink the OR toward 1. Therefore, initiated from the original NMA, in each iteration of modifying a single event status, we have NA cases of reducing by one event in arm A and NB cases of increasing by one event in arm B. For each case, we recalculate the pooled OR and its CI using the NMA with modified data. All cases lead to (NA + NB) CIs. If all these CIs are still above 1, we select the one whose lower bound is closest to 1, in an effort to minimize the steps for altering the comparison’s significance. The next iteration is based on the modified data in the previous iterations. If at least one CI covers 1, we terminate the algorithm, and the FI is the number of iterations.
Figure 1. Illustration of deriving the fragility index of an originally significant treatment comparison in a network meta-analysis.
The associated treatments are A and B, and X represents any treatment arm other than A and B.
2.5. Originally non-significant treatment comparisons
If A vs. B in the original NMA is non-significant, its CI covers 1. It may be altered to be significant in two directions, i.e., above or below 1. In practice, clinicians may have an empirical hypothesis or experience about the comparison’s direction; the FI may be derived by modifying event status to move the CI toward such a direction.
Without experts’ opinion, the direction may be determined based on the point estimate of the effect size in the original NMA. For example, suppose the estimated OR is larger than 1. To obtain a significant comparison, we may expect minimal event status modifications if the CI would be moved above 1. This can be achieved by conducting iterations of increasing events in arm A or reducing events in arm B.
In addition, an originally non-significant comparison may never achieve significance (e.g., when the total sample size is small). In such cases, we may define the FI as not available.
2.6. Visualizing fragility
We propose several methods to visualize the fragility of an NMA. We may present the pooled effect size estimates with CIs or their P-values during the process of deriving the FI against the iterative steps; they visualize how the significance is altered. To demonstrate the detailed process of event status modifications, we may construct step plots that present the modified event or non-event in each iteration. Furthermore, clinicians are frequently interested in treatment ranking [57], which can be quantified by a measure called P-score for each treatment [58]. This treatment ranking measure lies within 0–1; a larger value implies a better treatment. We may plot this measure during the process of modifying event status to show the changes of treatment ranking.
2.7. Example
We use the NMA dataset on smoking cessation for illustrations. This dataset was originally collected by Hasselblad [59] and re-analyzed by Lu and Ades [60]. It contains 24 studies and compares four treatments. The treatments are indexed as 1) no contact; 2) self-help; 3) individual counseling; and 4) group counseling. They lead to six pairwise comparisons, measured by ORs. Two studies are three-armed, and the remaining 22 studies are two-armed. We derive the FI for each comparison. In our primary analysis, the significance level is α=0.05, corresponding to the confidence level 95% of CIs. To investigate the FI at different significance levels, we also consider α=0.01 and 0.1. Moreover, the FI of a treatment comparison is derived by restricting event status modifications to the associated two treatment arms in our primary analysis; as a sensitivity analysis, we also derive the FI without such a restriction.
The Supplemental File contains the R code for all analyses. The FI in the NMA can be also derived using our R package “fragility.”
3. Results
3.1. Original network meta-analysis
In the original NMA, two comparisons (3 vs. 1 and 4 vs. 1) were statistically significant, while the other four comparisons were not. Treatment 4 was likely the best, with the highest treatment ranking measure (0.84), followed by treatment 3 (0.71) and treatment 2 (0.41). Treatment 1 was likely the worst, because its treatment ranking measure 0.05 was close to 0.
3.2. Fragility index
Table 1 gives each comparison’s FI and ORs with 95% CIs before and after event status modifications for deriving the FI. The comparison 3 vs. 1 was the least fragile (FI=32), while 4 vs. 1 was the most fragile (FI=3). In other words, 32 subjects’ event status had to be modified to alter the significance of 3 vs. 1, but only 3 subjects’ event status modifications were sufficient to make 4 vs. 1 no longer significant.
Table 1. Odds ratios with 95% confidence intervals (in square brackets) of treatment comparisons in the original smoking cessation data and those after event status modifications for deriving the fragility index for each comparison.
Results in italic indicate the target treatment comparison for which the fragility index is derived. Treatment indexes: 1) no contact; 2) self-help; 3) individual counseling; and 4) group counseling.
Comparison | Fragility index |
Odds ratio with 95% confidence interval | ||||||
---|---|---|---|---|---|---|---|---|
Original data |
Data with event status modifications for deriving the fragility index for each comparison | |||||||
2 vs. 1 | 3 vs. 1 | 3 vs. 2 | 4 vs. 1 | 4 vs. 2 | 4 vs. 3 | |||
2 vs. 1 | 18 | 1.52 [0.74, 3.13] |
2.15 [1.01, 4.60] |
2.07 [1.32, 3.21] |
0.96 [0.44, 2.08] |
2.30 [1.01, 5.26] |
1.07 [0.43, 2.63] |
1.11 [0.51, 2.44] |
3 vs. 1 | 32 | 2.07 [1.34, 3.18] |
1.36 [0.62, 2.98] |
1.58 [0.99, 2.50] |
1.16 [0.52, 2.58] |
2.07 [0.87, 4.92] |
1.52 [0.59, 3.94] |
1.31 [0.58, 2.99] |
3 vs. 2 | 19 | 1.36 [0.65, 2.87] |
1.00 [0.46, 2.16] |
2.25 [1.44, 3.50] |
2.25 [1.01, 5.01] |
3.22 [1.39, 7.53] |
3.22 [1.20, 8.63] |
1.43 [0.65, 3.19] |
4 vs. 1 | 3 | 2.46 [1.09, 5.52] |
1.49 [0.71, 3.09] |
2.09 [1.35, 3.22] |
1.40 [0.66,2.99] |
2.24 [0.97, 5.15] |
1.51 [0.61, 3.72] |
1.07 [0.48, 2.38] |
4 vs. 2 | 12 | 1.61 [0.67, 3.93] |
1.20 [0.56, 2.53] |
2.16 [1.39, 3.35] |
1.81 [0.83, 3.93] |
3.13 [1.37, 7.18] |
2.62 [1.02, 6.71] |
1.45 [0.66, 3.18] |
4 vs. 3 | 23 | 1.19 [0.55, 2.56] |
1.74 [0.82, 3.70] |
1.98 [1.26, 3.09] |
1.13 [0.52, 2.45] |
4.41 [1.92, 10.11] |
2.53 [1.01, 6.30] |
2.23 [1.02, 4.90] |
The small FI of 4 vs. 1 was likely because treatment 4 included much fewer events than other arms across studies (Table 2). The modification of a single event status in this treatment group might lead to a considerable change in the estimated OR. In addition, although 4 vs. 1 had a larger OR (2.46) than other comparisons, its CI was relatively wide, whose lower bound 1.09 was close to 1. This also explained why its FI was small.
Table 2. Treatment-specific total event counts and treatment ranking measures (shown in square brackets) in the original smoking cessation data and those after event status modifications for deriving the fragility index for each comparison.
Results in italic correspond to treatment arms associated with the comparison for which the fragility index is derived. Treatment indexes: 1) no contact; 2) self-help; 3) individual counseling; and 4) group counseling.
Comparison with event status modifications | Treatment arm | |||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Original data | 606 [0.05] | 155 [0.41] | 1209 [0.71] | 103 [0.84] |
2 vs. 1 (FI=18) | 600 [0.02] | 167 [0.65] | 1209 [0.62] | 103 [0.71] |
3 vs. 1 (FI=32) | 633 [0.10] | 155 [0.45] | 1204 [0.62] | 103 [0.83] |
3 vs. 2 (FI=19) | 606 [0.17] | 141 [0.18] | 1214 [0.72] | 103 [0.93] |
4 vs. 1 (FI=3) | 606 [0.06] | 155 [0.41] | 1209 [0.76] | 100 [0.78] |
4 vs. 2 (FI=12) | 606 [0.11] | 144 [0.26] | 1209 [0.70] | 104 [0.93] |
4 vs. 3 (FI=23) | 606 [0.02] | 155 [0.44] | 1209 [0.55] | 126 [0.98] |
3.3. Step-by-step changes
Figure 2 presents the estimated ORs with 95% CIs of all six comparisons during the iterations for deriving the FI, and Supplemental Figure S1 presents the changes of P-values. As the event status was modified for deriving the FI of 3 vs. 1, the significance of 4 vs. 1 was also altered (Figure 2B). When driving the FI of 3 vs. 2, although the event status modifications were only in treatment arms 2 and 3, the OR of 4 vs. 1 and its CI noticeably increased (Figure 2C). These supported the notion that 4 vs. 1 was fragile.
Figure 2. Estimated odds ratios and their 95% confidence intervals of six treatment comparisons in the smoking cessation data during the iterations for deriving the fragility index for each comparison.
The odds ratios are presented on a logarithmic scale. The fragility index for each comparison is in bold on the horizontal axis, and the bold vertical lines represent treatment comparisons for which the fragility indexes are derived.
Figure 3 presents the specific study and treatment arm whose event status was modified in each iteration. Because each iteration modified only a single event status, the total event counts of the associated treatment arms changed by 0 or 1 at each step. The event status modifications usually occurred in a few studies. For example, treatment 4 was compared in six studies; only events in studies 1 and 24 were modified for deriving the FIs for the three comparisons involving treatment 4.
Figure 3. Event status modifications in the associated treatment arms across studies in the smoking cessation data during the iterations for deriving the fragility index for each comparison.
One event status is modified in each iteration. The number shown at each event status modification gives the specific study corresponding to the modified event. An asterisk indicates that the event status modification occurs in the same study’s treatment arm as in the previous iteration.
Figure 4 shows the changes of treatment ranking during the process of deriving the FIs. After altering the non-significance of 2 vs. 1, the order of treatment ranking measures of treatments 2 and 3 switched, while treatments 1 and 4 continued to be the worst and best treatments, respectively (Figure 4A). When deriving the FIs for other five comparisons, the order of treatment ranking measures did not change, while the measures of some treatments became closer. For example, to alter the non-significance of 3 vs. 2, the treatment ranking measure of treatment 1 increased from 0.05 to 0.17, indicating better performance, while that of treatment 2 decreased from 0.41 to 0.18, indicating worse performance (Figure 4C).
Figure 4. Changes of treatment ranking measures in the smoking cessation data during the iterations for deriving the fragility index for each comparison.
The lines representing associated treatment arms of each comparison are in bold.
3.4. Significance levels of 0.01 and 0.1
Supplemental Table S1 presents the FIs at α=0.01 and 0.1. When α decreased from 0.05 to 0.01, only 3 vs. 1 was significant, and its FI decreased from 32 to 15; 4 vs. 1 was still the most fragile, with FI=5. When α increased from 0.05 to 0.1, 3 vs. 1 and 4 vs. 1 continued to be significant, and their FIs were larger than the corresponding FIs at α=0.05. The FIs of the other four non-significant comparisons became smaller.
3.5. Sensitivity analysis
In the previous primary analysis, the FI of a comparison was derived by restricting event status modifications in the associated two treatment arms. The sensitivity analysis permitted modifications in all treatment arms. Supplemental Figure S2 presents the step plots. The FIs of all comparisons remained the same, except the FI of 2 vs. 1 slightly decreased from 18 to 17. However, the computation time was over three times that of the primary analysis.
4. Discussion
4.1. Main findings
This article proposed the FI in NMAs. We introduced an efficient iterative process to derive the FI and several visualization methods, illustrated by a worked example of the smoking cessation data. The statistical significance of some treatment comparisons in the NMA was fragile, which could be altered by modifying a few events’ status.
4.2. Strengths and limitations
First, our proposed FI adapts to the setting of NMAs, where researchers are interested in multiple treatment comparisons, differing from the setting of individual studies or pairwise meta-analyses with only a single comparison of interest. Therefore, the proposed FI is defined as a comparison-specific metric, rather than for the whole NMA. Second, the implementation of NMAs is more computationally demanding than that of individual studies or pairwise meta-analyses; our proposed iterative algorithm is efficient to derive the FI. Third, researchers may need information beyond the FI values to assess treatment comparisons’ fragility, and the proposed visualization methods may serve as additional tools to help the assessment. For example, the plots of step-by-step changes of event status modifications for deriving the FIs help researchers identify the sensitive modifications that can alter conclusions. The plots of treatment ranking measures inform the changes of each treatment’s performance during the modifications. Clinicians may incorporate empirical evidence to assess whether such changes are likely in practice. If these changes are unlikely, then it may be difficult to alter the significance, indicating robust comparisons.
Our study has some limitations. First, the FI in NMAs has the same drawbacks possessed by that of individual studies or pairwise meta-analyses. In some cases, the FI may be highly correlated with P-value and sample size [61–63]; it may not provide much more information than these well-reported metrics. The concept of FI considered in this article is restricted to studies with binary outcomes, and it might not be suitable for time-to-event data [64]. Johnson et al. [65] extended the FI to survival analysis; more work is needed to explore the FI’s applications in this direction [66]. In addition, the FI considered in this article only reflects significance stability; Walter et al. [67] recently suggested that clinical importance and quantitative stability should be taken into account when assessing the fragility of trial results.
Second, this article used a frequentist NMA model to derive the FI. Alternative frequentist NMA methods are available [68]; they may lead to different treatment effect estimates, P-values, and CIs, and thus different FIs. Researchers should report FIs alongside the method used to perform the NMA. Frequentist methods typically treat comparisons’ standard errors within studies as fixed, known values, leading to potential biases [69, 70]; they also need ad hoc corrections for zero counts. Bayesian NMA methods are widely used in current practice [71], and they may avoid the aforementioned problems. They are also advantageous because they permit flexible model structures and can incorporate informative priors [72, 73]. However, it may not be feasible to derive the FI based on Bayesian NMA methods, because of the limitation of computational capacity.
Lastly, researchers should carefully check several assumptions (e.g., treatment transitivity, consistency between direct and indirect evidence) for an NMA [4, 60]. This article did not distinguish direct and indirect evidence when modifying event status to derive the FI. Different sources of evidence may have different extents of credibility [74, 75]; event status modifications in comparisons from difference sources of evidence may not have an equivalent role. In future studies, it is worthwhile to account for such additional information when assessing the fragility of an NMA.
Supplementary Material
What is new?
Key findings
The fragility index has been recently proposed to assess the robustness (or fragility) of results from clinical studies and from conventional pairwise meta-analyses; we extend it to network meta-analyses with binary outcomes, propose a heuristic algorithm to derive this index, and introduce methods to visualize the fragility.
The fragility index of network meta-analyses quantifies the minimal event status modifications for altering a specific treatment comparison’s statistical significance.
We illustrate the proposed methods using a worked example of smoking cessation data, in which some treatment comparisons’ significance can be altered after modifying a few events’ status.
What this adds to what is known?
Assessing the fragility of network meta-analyses involves more complex mechanisms than that of individual clinical studies or pairwise meta-analyses, as event status modifications in one treatment arm can affect the estimates of all treatment comparisons.
What is the implication and what should change now?
Many network meta-analyses may include treatments that are compared by only a few studies, rare events, or small sample sizes; they are potentially fragile, and the associated treatment comparisons’ robustness can be evaluated by the fragility index.
Financial support:
This research was supported in part by the U.S. National Institutes of Health/National Library of Medicine grant R01 LM012982 (HC and LL) and National Institutes of Health/National Center for Advancing Translational Sciences grant UL1 TR001427 (LL). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of interest: None.
References
- [1].Dias S, Sutton AJ, Ades AE, Welton NJ. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. Medical Decision Making. 2013;33:607–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].König J, Krahn U, Binder H. Visualizing the flow of evidence in network meta-analysis and characterizing mixed treatment comparisons. Statistics in Medicine. 2013;32:5414–29. [DOI] [PubMed] [Google Scholar]
- [3].Zhang J, Carlin BP, Neaton JD, Soon GG, Nie L, Kane R, et al. Network meta-analysis of randomized clinical trials: reporting the proper summaries. Clinical Trials. 2014;11:246–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Indirect Salanti G. and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Research Synthesis Methods. 2012;3:80–97. [DOI] [PubMed] [Google Scholar]
- [5].Lin L, Xing A, Kofler MJ, Murad MH. Borrowing of strength from indirect evidence in 40 network meta-analyses. Journal of Clinical Epidemiology. 2019;106:41–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Lumley T Network meta-analysis for indirect treatment comparisons. Statistics in Medicine. 2002;21:2313–24. [DOI] [PubMed] [Google Scholar]
- [7].Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Statistics in Medicine. 2004;23:3105–24. [DOI] [PubMed] [Google Scholar]
- [8].Hong H, Chu H, Zhang J, Carlin BP. A Bayesian missing data framework for generalized multiple outcome mixed treatment comparisons. Research Synthesis Methods. 2016;7:6–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Zhang J, Chu H, Hong H, Virnig BA, Carlin BP. Bayesian hierarchical models for network meta-analysis incorporating nonignorable missingness. Statistical Methods in Medical Research. 2017;26:2227–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Liu Y, DeSantis SM, Chen Y. Bayesian mixed treatment comparisons meta-analysis for correlated outcomes subject to reporting bias. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2018;67:127–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Lin L, Chu H, Hodges JS. Sensitivity to excluding treatments in network meta-analysis. Epidemiology. 2016;27:562–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Jansen JP, Fleurence R, Devine B, Itzler R, Barrett A, Hawkins N, et al. Interpreting indirect treatment comparisons and network meta-analysis for health-care decision making: report of the ISPOR Task Force on Indirect Treatment Comparisons Good Research Practices: part 1. Value in Health. 2011;14:417–28. [DOI] [PubMed] [Google Scholar]
- [13].Lin L, Chu H, Hodges JS. On evidence cycles in network meta-analysis. Statistics and Its Interface. 2019:In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting P values in the biomedical literature, 1990–2015. JAMA. 2016;315:1141–8. [DOI] [PubMed] [Google Scholar]
- [15].Furuya-Kanamori L, Xu C, Lin L, Doan T, Chu H, Thalib L, et al. P value–driven methods were underpowered to detect publication bias: analysis of Cochrane review meta-analyses. Journal of Clinical Epidemiology. 2020;118:86–92. [DOI] [PubMed] [Google Scholar]
- [16].Sterne JAC, Davey Smith G. Sifting the evidence—what’s wrong with significance tests? BMJ. 2001;322:226–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Rozeboom WW. The fallacy of the null-hypothesis significance test. Psychological Bulletin. 1960;57:416–28. [DOI] [PubMed] [Google Scholar]
- [18].Rothman KJ. Significance questing. Annals of Internal Medicine. 1986;105:445–7. [DOI] [PubMed] [Google Scholar]
- [19].Goodman SN. Toward evidence-based medical statistics. 1: the P value fallacy. Annals of Internal Medicine. 1999;130:995–1004. [DOI] [PubMed] [Google Scholar]
- [20].Johnson VE. Revised standards for statistical evidence. Proceedings of the National Academy of Sciences. 2013;110:19313–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. The American Statistician. 2016;70:129–33. [Google Scholar]
- [22].Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, et al. Redefine statistical significance. Nature Human Behaviour. 2018;2:6–10. [DOI] [PubMed] [Google Scholar]
- [23].Ioannidis JPA. The proposal to lower P value thresholds to .005. JAMA. 2018;319:1429–30. [DOI] [PubMed] [Google Scholar]
- [24].Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567:305–7. [DOI] [PubMed] [Google Scholar]
- [25].Ioannidis JPA. The importance of predefined rules and prespecified statistical analyses: do not abandon significance. JAMA. 2019;321:2067–8. [DOI] [PubMed] [Google Scholar]
- [26].Koletsi D, Solmi M, Pandis N, Fleming PS, Correll CU, Ioannidis JPA. Most recommended medical interventions reach P<0.005 for their primary outcomes in meta-analyses. International Journal of Epidemiology. 2019:In press. [DOI] [PubMed] [Google Scholar]
- [27].Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine. 2008;358:252–60. [DOI] [PubMed] [Google Scholar]
- [28].Baker M Is there a reproducibility crisis? Nature. 2016;533:452–4. [DOI] [PubMed] [Google Scholar]
- [29].Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349:aac4716. [DOI] [PubMed] [Google Scholar]
- [30].Whittington CJ, Kendall T, Fonagy P, Cottrell D, Cotgrove A, Boddington E. Selective serotonin reuptake inhibitors in childhood depression: systematic review of published versus unpublished data. The Lancet. 2004;363:1341–5. [DOI] [PubMed] [Google Scholar]
- [31].Ioannidis JPA. The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. The Milbank Quarterly. 2016;94:485–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Hacke C, Nunan D. Discrepancies in meta-analyses answering the same clinical question were hard to explain: a meta-epidemiological study. Journal of Clinical Epidemiology. 2020;119:47–56. [DOI] [PubMed] [Google Scholar]
- [33].Sun X, Ioannidis JPA, Agoritsas T, Alba AC, Guyatt G. How to use a subgroup analysis: users’ guide to the medical literature. JAMA. 2014;311:405–11. [DOI] [PubMed] [Google Scholar]
- [34].Palpacuer C, Hammas K, Duprez R, Laviolle B, Ioannidis JPA, Naudet F. Vibration of effects from diverse inclusion/exclusion criteria and analytical choices: 9216 different ways to perform an indirect comparison meta-analysis. BMC Medicine. 2019;17:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Naudet F, Schuit E, Ioannidis JPA. Overlapping network meta-analyses on the same topic: survey of published studies. International Journal of Epidemiology. 2017;46:1999–2008. [DOI] [PubMed] [Google Scholar]
- [36].Seidler AL, Hunter KE, Cheyne S, Ghersi D, Berlin JA, Askie L. A guide to prospective meta-analysis. BMJ. 2019;367:l5342. [DOI] [PubMed] [Google Scholar]
- [37].Watt CA, Kennedy JE. Options for prospective meta-analysis and introduction of registration-based prospective meta-analysis. Frontiers in Psychology. 2017;7:2030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Peat G, Riley RD, Croft P, Morley KI, Kyzas PA, Moons KGM, et al. Improving the transparency of prognosis research: the role of reporting, data sharing, registration, and protocols. PLOS Medicine. 2014;11:e1001671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Annals of Internal Medicine. 2009;151:264–9. [DOI] [PubMed] [Google Scholar]
- [40].Hutton B, Salanti G, Caldwell DM, Chaimani A, Schmid CH, Cameron C, et al. The PRISMA extension statement for reporting of systematic reviews incorporating network meta-analyses of health care interventions: checklist and explanations. Annals of Internal Medicine. 2015;162:777–84. [DOI] [PubMed] [Google Scholar]
- [41].Dechartres A, Ravaud P, Atal I, Riveros C, Boutron I. Association between trial registration and treatment effect estimates: a meta-epidemiological study. BMC Medicine. 2016;14:100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Tricco AC, Cogo E, Page MJ, Polisena J, Booth A, Dwan K, et al. A third of systematic reviews changed or did not specify the primary outcome: a PROSPERO register study. Journal of Clinical Epidemiology. 2016;79:46–54. [DOI] [PubMed] [Google Scholar]
- [43].Sideri S, Papageorgiou SN, Eliades T. Registration in the international prospective register of systematic reviews (PROSPERO) of systematic review protocols was associated with increased review quality. Journal of Clinical Epidemiology. 2018;100:103–10. [DOI] [PubMed] [Google Scholar]
- [44].Walsh M, Srinathan SK, McAuley DF, Mrkobrada M, Levine O, Ribic C, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. Journal of Clinical Epidemiology. 2014;67:622–8. [DOI] [PubMed] [Google Scholar]
- [45].Feinstein AR. The unit fragility index: an additional appraisal of “statistical significance” for a contrast of two proportions. Journal of Clinical Epidemiology. 1990;43:201–9. [DOI] [PubMed] [Google Scholar]
- [46].Walter SD. Statistical significance and fragility criteria for assessing a difference of two proportions. Journal of Clinical Epidemiology. 1991;44:1373–8. [DOI] [PubMed] [Google Scholar]
- [47].Docherty KF, Campbell RT, Jhund PS, Petrie MC, McMurray JJV. How robust are clinical trials in heart failure? European Heart Journal. 2016;38:338–45. [DOI] [PubMed] [Google Scholar]
- [48].Narayan VM, Gandhi S, Chrouser K, Evaniew N, Dahm P. The fragility of statistically significant findings from randomised controlled trials in the urological literature. BJU International. 2018;122:160–6. [DOI] [PubMed] [Google Scholar]
- [49].Tignanelli CJ, Napolitano LM. The fragility index in randomized clinical trials as a means of optimizing patient care. JAMA Surgery. 2019;154:74–9. [DOI] [PubMed] [Google Scholar]
- [50].Del Paggio JC, Tannock IF. The fragility of phase 3 trials supporting FDA-approved anticancer medicines: a retrospective analysis. The Lancet Oncology. 2019;20:1065–9. [DOI] [PubMed] [Google Scholar]
- [51].Lin L Factors that impact fragility index and their visualizations. Journal of Evaluation in Clinical Practice. 2020:In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Ridgeon EE, Young PJ, Bellomo R, Mucchetti M, Lembo R, Landoni G. The fragility index in multicenter randomized controlled critical care trials. Critical Care Medicine. 2016;44:1278–84. [DOI] [PubMed] [Google Scholar]
- [53].Atal I, Porcher R, Boutron I, Ravaud P. The statistical significance of meta-analyses is frequently fragile: definition of a fragility index for meta-analyses. Journal of Clinical Epidemiology. 2019;111:32–40. [DOI] [PubMed] [Google Scholar]
- [54].Rücker G Network meta-analysis, electrical networks and graph theory. Research Synthesis Methods. 2012;3:312–24. [DOI] [PubMed] [Google Scholar]
- [55].Lin L, Zhang J, Hodges JS, Chu H. Performing arm-based network meta-analysis in R with the pcnetmeta package. Journal of Statistical Software. 2017;80:1–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions. Chichester, UK: John Wiley & Sons; 2008. [Google Scholar]
- [57].Salanti G, Ades AE, Ioannidis JPA. Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial. Journal of Clinical Epidemiology. 2011;64:163–71. [DOI] [PubMed] [Google Scholar]
- [58].Rücker G, Schwarzer G. Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Medical Research Methodology. 2015;15:58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Hasselblad V Meta-analysis of multitreatment studies. Medical Decision Making. 1998;18:37–43. [DOI] [PubMed] [Google Scholar]
- [60].Lu G, Ades AE. Assessing evidence inconsistency in mixed treatment comparisons. Journal of the American Statistical Association. 2006;101:447–59. [Google Scholar]
- [61].Carter RE, McKie PM, Storlie CB. The fragility index: a P-value in sheep’s clothing? European Heart Journal. 2016;38:346–8. [DOI] [PubMed] [Google Scholar]
- [62].Kruse BC, Vassar BM. Unbreakable? An analysis of the fragility of randomized trials that support diabetes treatment guidelines. Diabetes Research and Clinical Practice. 2017;134:91–105. [DOI] [PubMed] [Google Scholar]
- [63].Niforatos JD, Zheutlin AR, Chaitoff A, Pescatore RM. The fragility index of practice changing clinical trials is low and highly correlated with P-values. Journal of Clinical Epidemiology. 2020;119:140–2. [DOI] [PubMed] [Google Scholar]
- [64].Desnoyers A, Nadler MB, Wilson BE, Amir E. A critique of the fragility index. The Lancet Oncology. 2019;20:e552. [DOI] [PubMed] [Google Scholar]
- [65].Johnson KW, Rappaport E, Shameer K, Glicksberg BS, Dudley JT. fragilityindex: An R package for statistical fragility estimates in biomedicine. bioRxiv. 2019: preprint. doi: 10.1101/562264. [DOI] [Google Scholar]
- [66].Bomze D, Meirson T. A critique of the fragility index. The Lancet Oncology. 2019;20:e551. [DOI] [PubMed] [Google Scholar]
- [67].Walter SD, Thabane L, Briel M. The fragility of trial results involves more than statistical significance alone. Journal of Clinical Epidemiology. 2020;124:34–41. [DOI] [PubMed] [Google Scholar]
- [68].Efthimiou O, Debray TPA, van Valkenhoef G, Trelle S, Panayidou K, Moons KGM, et al. GetReal in network meta-analysis: a review of the methodology. Research Synthesis Methods. 2016;7:236–63. [DOI] [PubMed] [Google Scholar]
- [69].Lin L Bias caused by sampling error in meta-analysis with small sample sizes. PLOS ONE. 2018;13:e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Jackson D, White IR. When should meta-analysis avoid making hidden normality assumptions? Biometrical Journal. 2018;60:1040–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Nikolakopoulou A, Chaimani A, Veroniki AA, Vasiliadis HS, Schmid CH, Salanti G. Characteristics of networks of interventions: a description of a database of 186 published networks. PLOS ONE. 2014;9:e86754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Lu G, Ades AE. Modeling between-trial variance structure in mixed treatment comparisons. Biostatistics. 2009;10:792–805. [DOI] [PubMed] [Google Scholar]
- [73].Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JPT. Predicting the extent of heterogeneity in meta-analysis, using empirical data from the Cochrane Database of Systematic Reviews. International Journal of Epidemiology. 2012;41:818–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [74].Puhan MA, Schünemann HJ, Murad MH, Li T, Brignardello-Petersen R, Singh JA, et al. A GRADE Working Group approach for rating the quality of treatment effect estimates from network meta-analysis. BMJ. 2014;349:g5630. [DOI] [PubMed] [Google Scholar]
- [75].Nikolakopoulou A, Higgins JPT, Papakonstantinou T, Chaimani A, Del Giovane C, Egger M, et al. CINeMA: an approach for assessing confidence in the results of a network meta-analysis. PLOS Medicine. 2020;17:e1003082. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.