Abstract
Objective:
The fragility index is a clinically interpretable metric increasingly used to interpret the robustness of clinical trials results that is generally not incorporated in sample size calculation and applied post-hoc. In this manuscript, we propose to base the sample size calculation on the fragility index in a way that supplements the classical pre-fixed alpha and power cutoffs and we provide a dedicated R software package for the design and analysis tools.
Study Design and Setting:
This approach follows from a novel hypothesis testing framework that is based on the fragility index and builds on the classical testing approach. As case studies, we re-analyse the design of two important trials in cardiovascular medicine, the FAME and FAMOUS-NSTEMI trials.
Results:
The analyses show that approach returns sample sizes which results in a higher power for the p value based test and most importantly a lower and context dependent Type I error rate for the fragility index based test compared to standard tests.
Conclusion:
Our method allows clinicians to control for the fragility index during clinical trial design.
Keywords: fragility index, p value, statistical significance, research methods, sample size calculation, trial design
1. Introduction
The fragility index is now generally defined for 2 × 2 tables as the minimum number of patients whose status would need to switch from non-event to event or vice versa to change the statistical significance of trial results. It is a measure of the fragility of randomized controlled trials (RCTs). The smaller the index, the more fragile the trial results are, and the larger the index, the less fragile the trial results are [1, 2]. Since the landmark article by Walsh et al. [2] researchers have applied the fragility index to assess trial robustness in many areas of clinical research (see, for instance, [3, 4, 5, 6, 7, 8, 9, 10]).
Traditionally, the statistical significance of RCTs has been evaluated using the p value, a rigid metric with a fixed threshold, by convention α = 0.05. However, by themselves, p values are of limited clinical utility. Short-comings of p values as a single summary measure were been detailed in the ASA’s statement on p values [11, 12], where it was stressed that the current practice of relying on p values should be updated. In addition to being limited by the assumptions inherit in frequentist statistics, p value based statistical significance can be lost or gained with alteration of few events in a trial’s arm [13, 3].
The fragility index was created to partially overcome some of these limitations and to intuitively quantify an interpretable measure of trial robustness. While the fragility index is still related to the p value and associated with its limitations, it provides an interpretable retrospective sensitivity analysis to stress-test the significant result of an RCT and to contextualize its meaning to clinicians. The stress test associated with the fragility index is comparable to using a statistical test based on rejecting the null hypothesis when the fragility index is large. While it is true that a higher fragility index is not conceptually different than a lower p value, the former is more intuitive for the clinical audience in part since the units of “patients” are more interpretable than the units of probability.
Published evidence demonstrates that the fragility index of RCTs in different medical fields is generally low. In an analysis of 399 RCTs published in high-impact medical journals from 2004 to 2010, Walsh et al. [2] found the median fragility index to be 8; that is 8 patients changing event status would change the conclusions, with 25% of the trials having a fragility index of ≤ 3. Other empirical investigations [14, 15, 16] across different fields of science have suggested that there is a substantial concentration of p values in the .041 to .049 range which indicates the prevalence of nearly insignificant studies.
Specifically, more than a quarter of RCTs supporting current guidelines on myocardial revascularization have a fragility index of 3 or lower, and for over 40% of the trials, the fragility index is lower than the number of patients lost to follow-up [3]. The median fragility indices was 5 for the RCTs supporting the 2016 Chest Guidelines and Expert Recommendations for Venous Thrombo-embolism [4]. Similarly low fragility indices have been reported in trauma (median FI=3), critical care (median FI=2), nephrology (median FI=3), spine and sport surgery (median FI=2 for both), and anesthesiology (median FI=4) [5, 6, 7, 8]. However, the RCTs supporting the 2017 guidelines for the treatment of diabetes mellitus have a median fragility index relatively higher at 16 [9]. These values are summarized in Table 1. Traditional sample size calculations are blind to trial fragility and hence are inadvertently designed to give fragile trials.
Table 1:
The median fragility indices reported from analyses of several medical areas.
| Trial area | Median fragility index |
|---|---|
| Venous Thrombo-embolism | 5 |
| Trauma | 3 |
| Critical care | 2 |
| Nephrology | 3 |
| Spine and sport surgery | 2 |
| Anesthesiology | 4 |
| Diabetes mellitus | 16 |
Until now, the fragility index has only been retrospectively applied to assess the robustness of published RCTs. While this facilitates interpretation of existent trial data, the current fragility index model lacks utility for guiding design of prospective trials which remain susceptible to weak sample size calculation and fragile conclusions. Sample size calculation for RCTs is a complex endeavor that has clinical, ethical, and resource implications. Fragile RCTs are inconclusive and expose patients to the risks and burden of clinical research without providing definitive answers.
We propose that the fragility index should be applied to the sample size calculation of RCTs at the design stage. We implement the sample size calculation by simultaneously designing for the classical p value based test and a novel fragility index based test. The fragility index based sample size calculation has two main advantages: it assures trial robustness at a design stage and it is based on a statistical threshold that is tailored to the individual trial and research question, rather than based on predetermined and generic alpha and power levels. In Section 2, we describe the proposed sample size calculation algorithm. The algorithm is efficiently implemented in the R package FragilityTools [17, 18]. In Sections 3 and 4, we illustrate the algorithm to redesign the FAME and FAMOUS-NSTEMI trials. In Section 5, we conclude the paper by highlighting the important points and contextualizing the fragility index.
2. Methods
In this section we will introduce the proposed sample size calculation algorithm, discuss how to contextualize its output, and present the elements to consider to choose values for the required input. The next two paragraphs define the notation and underlying concepts used throughout the paper.
Classical statistical testing is based on having a rejection region R for which a discovery (true positive result) is made whenever a test statistic T ∈ R. Rejections declare that an effect size parameter δ of a model is not specified through the null hypothesis H0 : δ = δ0 (where usually δ0 = 0) and instead is in the set specified by the alternative hypothesis Ha. Here we focus on a model for 2 × 2 tables with fixed row sums. Define X ∼ Binomial(n1, θ1) as the number events in the treatment group (i.e. first row) and Y ∼ Binomial(n2, θ2) similarly for the control group (i.e. second row). The parameter of interest is the effect size δ = θ1−θ2. Two examples of classical statistical tests for this model are Pearson’s chi-squared test and a test which rejects when the fragility index is large. Despite focusing on 2 × 2 contingency tables, the data type originally studied for the fragility index, in this article, the sample size calculation technique described in this section can be applied to any data type for which it is possible to (1) simulate from an alternative distribution, (2) calculate a fragility index, (3) calculate a p value.
This model and its associated probabilities at parameters θ2 and δ provide a way to evaluate the errors associated with the rejection region: the Type I error rate, , is the probability of falsely rejecting a true null hypothesis; and the power is the probability of correctly rejecting a false null hypothesis with the effect size δ. (For ease of presentation we are using the symbol π for power rather than the traditional 1 − β.) In the case that R consists of large values of T, an observed test statistic T = t can be transformed to produce the p value, , the probability under the null hypothesis H0 that the test statistic is more extreme than observed [19]. It is important to note that since the fragility index is a function of the test statistic T it is a random variable with a distribution that depends on the underlying probability model .
2.1. Sample size calculations
A major design step of any study is determining the sample size required to produce meaningful inferences. Sample size calculations are commonly done through a power analysis [20]. Power analyses find the fewest number of samples (i.e. patients) for which the desired statistical test has a given Type I error rate and high enough power. In efficacy trials the Type I error rate is constrained because falsely rejecting the null hypothesis (a false positive) is typically considered to have a higher cost than failing to reject a false null hypothesis. In safety trials the focus should be on the control of false negatives which is the equivalent of a power constraint.
Performing a power analysis relies on specifying several terms. A maximum Type I error rate α′ needs to be fixed. The default choice tends to be α′ = 0.05 [21]. Estimates of the event probability θ2 of the control group and the effect size δ need to be found, which usually require referencing the results of similar studies. A minimum power π′ at the estimated parameters needs to be specified. The default choice tends to be π′ = 0.80 or π′ = 0.90. The output of a power analysis is then the sample size n* which satisfies that smaller studies (n < n*) with small enough Type I error rate (α ≤ α′) must have that the power is not high enough (π < π′).
We propose to extend power analyses to simultaneously design for both the p value based test and additionally a second, novel fragility index based test. Viewing the fragility index procedure as a test is crucial for the construction of a sample size calculation which follows the same principles as traditional power analyses. We let 1 − τ′ be the designed power of the fragility index based test, which is analogous to the power π′ of the p value based test. However, instead of directly specifying the Type I error rate of the fragility index based test, we instead specify the fragility index cutoff φ′ of the test. A higher value of φ′ corresponds to a more stringent test and a lower Type I error rate. Notice that the constraint in the power analysis is equivalent to the 1 − τ′ quantile of the fragility index under the alternative distribution being at least φ′. Therefore the power specification of the fragility index based test can alternatively be viewed as constraining the τ′ quantile of the fragility index. We usually take τ′ = 0.50 so that we are constraining the median fragility index, although other, possibly context dependent, values could be used.
The traditional power analysis can be performed through simulation [22] by fixing the model parameters at the alternative values θ2 and δ and the Type I error rate at α′ then increasing the sample size n until a precise estimate of the power reaches π′. The simulation approach allows for estimating the power through the proportion of rejections among many independent draws of samples from the model with the given parameters θ2 and δ.
We implement the proposal by modifying this algorithm to run for a second time which considers the fragility index in place of the usual p value. The second run of the algorithm fixes 1 − τ′ as the power of the fragility index based test and increases the sample size until the fragility index quantile reaches the fragility index threshold φ′. Notice that this is an inversion compared to the traditional p value based power analysis: instead of fixing the Type I error rate and achieving the power, the second run instead fixes the power and achieves the fragility index cutoff (in place of the Type I error rate). Like p value based testing which has a separate significance threshold (corresponding to the type I error rate) and power, fragility index based testing also has a separate specification of the fragility cutoff φ′ (corresponding to the type I error rate) and the power. Whichever fragility cutoff is chosen, the power of the fragility index based test can be chosen to be any desired value, and vice versa.
The overall output of the algorithm is then the maximum of the sample sizes between both runs, the p value based sample size and the fragility index-based sample size. Notice that taking the maximum shows that the procedure is simultaneously designing for both the p value based test and the fragility index based test and maintaining the guarantees for both tests: the output sample size satisfies the cutoffs considered in both tests.
Since simulations only return noisy estimates of the power and the fragility index quantile, in both cases we use the Polyak-Ruppert averaging algorithm [23] to precisely find the smallest satisfactory sample size. Details of the proposed algorithm are shown in the Algorithm in Figure 1. Note, we calculate fragility indices in the standard way by modifying events or non-events in the control or treatment group with the fewest events. We define negative fragility indices to be the negative of a “reverse” fragility index [24].
Figure 1:

An algorithm to calculate sample size taking into account the fragility index. We run the algorithm twice: first not touching the fragility index and second not touching the p-value. We then output n* as the maximum between those two.
2.2. Finding the resulting Type I error rate and power
Unlike usual power analyses, the Type I error rate and power of the designed statistical test are not roughly equal to the input values α′ and π′, respectively. In fact, there is more than one test being considered—the p value based test and the fragility index based test. In order to interpret the output of the sample size calculation algorithm, we find the Type I error rate and power of the tests which either (1) reject when the p value is less than the significance threshold α′ or (2) reject when the fragility index is larger than the fragility index cutoff φ′. We do this numerically by reporting the proportion of rejections over many independent simulations with the appropriate null or alternative data model. In practice, neither of the two tests will be used exactly, as both the p value (and hence the usual threshold for statistical significance) and the fragility index will be taken into account by practitioners to declare that a discovery is real. However, the procedure used will likely be vaguely between the two.
For the p value based test, we know by design that the Type I error rate is approximately α′. The power is at least π′ but often larger due to simultaneously designing for a more stringent fragility index based test. For the fragility index based test, the power of the test is approximately 1 − τ′ due to the choice of constraining the τ′ quantile of the fragility index FI. We caution against choosing quantiles too high since we feel that by default the consequent sample size inflation is not justified in exchange for making the more stringent test excessively high powered. By default, we will choose τ′ = 0.50. The Type I error rate of the second test will always be less than α′ but is often much smaller due to constraining the strength of evidence to be high enough that the fragility index is larger than φ′. An example is shown in Table 2 in Section 3.
Table 2:
The estimated Type I error rate α and power in the FAME trial for the p value based test and the fragility index based test. The sample sizes were designed α′ = 0.05 and π′ = 0.80. The estimated rates are based on 1, 000, 000 Monte Carlo samples.
| Reject if p < 0.05 | Reject if FI > φ′ | |||||
|---|---|---|---|---|---|---|
|
| ||||||
| τ′ | φ′ | n* | α | Power | α | Power |
| 0.50 | 0 | 851 | 0.05013 | 0.80577 | 0.05013 | 0.80577 |
| 0.50 | 5 | 851 | 0.05022 | 0.80499 | 0.01266 | 0.60180 |
| 0.50 | 10 | 987 | 0.04955 | 0.85882 | 0.00327 | 0.48256 |
| 0.50 | 15 | 1235 | 0.05028 | 0.92563 | 0.00106 | 0.48337 |
| 0.50 | 20 | 1478 | 0.05013 | 0.96070 | 0.00037 | 0.48469 |
| 0.50 | 25 | 1706 | 0.05031 | 0.97911 | 0.00016 | 0.48369 |
| 0.50 | 30 | 1934 | 0.04992 | 0.98898 | 0.00006 | 0.48377 |
|
| ||||||
| 0.50 | 15 | 1232 | 0.04993 | 0.92397 | 0.00108 | 0.47956 |
| 0.35 | 15 | 1431 | 0.05021 | 0.95539 | 0.00154 | 0.63000 |
| 0.20 | 15 | 1697 | 0.05032 | 0.97836 | 0.00222 | 0.78014 |
In addition we also interpret the sample sizes through finding alternative choices of the parameters α′, π′, and δ which produce the same sample size instead through a traditional power analysis. These new parameter choices are connected to the traditional p value based test and will show how analysts using traditional power analyses would arrive to the same sample size recommendations as they would when taking into account the fragility index. Note that these new parameter choices do not reflect the approximate properties of tests which involve the fragility index: they instead help tie together the proposed sample size calculation and traditional sample size calculations. An example is shown in Table 3 in Section 3.
Table 3:
The new values for traditional sample calculations which produce the same sample size as the algorithm when choosing one new value and the rest from the old values α′ = 0.05, π′ = .80, and δ= 0.06. The estimated rates are based on 1,000,000 Monte Carlo samples.
| τ′ | φ′ | n* | δ trad | ||
|---|---|---|---|---|---|
| 0.50 | 0 | 851 | 0.04799 | 0.80473 | 0.06088 |
| 0.50 | 5 | 851 | 0.04764 | 0.80597 | 0.06096 |
| 0.50 | 10 | 987 | 0.02800 | 0.85853 | 0.05704 |
| 0.50 | 15 | 1235 | 0.01061 | 0.92560 | 0.05140 |
| 0.50 | 20 | 1478 | 0.00419 | 0.96057 | 0.04781 |
| 0.50 | 25 | 1706 | 0.00167 | 0.97934 | 0.04553 |
| 0.50 | 30 | 1934 | 0.00068 | 0.98902 | 0.04326 |
2.3. Choosing the fragility index cutoff φ′
An important element of the proposed sample size calculation is specifying a fragility index cutoff φ′. The value will be used to draw a line for the purposes of the trial design between fragility indices which are “too small” and those which are large enough. The determination of the cutoff is complicated and driven by several elements which do not have standard guidelines. This puts choosing the fragility index cutoff on the same footing that was intended for choosing a p value cutoff. It was pointed out by Neyman and Pearson, the researchers that proposed the hypothesis testing framework with Type I and Type II errors, that the p value cutoff should be chosen with “discretion and understanding” [25] and that how “the balance [between the two kinds of error] should be struck must be left to the investigator” [26].
The fragility index cutoff is a compromise between the importance of achieving solid results and the ethical concerns of exposing randomized patients to a potentially less effective (or more harmful) treatment. A cutoff should take into account
the clinical question that is being addressed,
how widely the result will be applied,
the type of intervention, and
the type of outcome
considered in the individual trials. In trials that test interventions that may affect survival or lead to major clinical events (coronary or cardiac valve interventions for example) a lower fragility index is likely acceptable, whereas for trials that focus on interventions that affect less catastrophic or enduring clinical events (re-hospitalization, in-hospital stay) a higher fragility index may be appropriate. The financial cost of the intervention can also be considered as part of the impact on a patient. For each trial the fragility index cutoff to be used for sample size calculation must be chosen based on the above considerations and on the published evidence as well as experts’ and, ideally, patients’ opinions, similarly to what is done for the choice of the non-inferiority margin in non-inferiority trials.
Based off of our review of the fragility indices in the literature, we choose to consider fragility indices cutoffs between 0 and 30, depending on the elements presented above. After specifying a fragility index cutoff, we consider the properties of the designed tests to ensure they are satisfactory. We do not propose a universal standard here; however we anticipate that in time the framework to choose a cutoff will mature.
3. FAME Trial Example
This section contains the first example of the use of the proposed sample size calculation. The Fractional Flow Reserve versus Angiography for Multivessel Evaluation (FAME) trial was the first large RCT to show the efficacy of using fractional flow reserve measurements to guide percutaneous coronary interventions and is a landmark study in interventional cardiology [27]. The trial compared coronary angioplasty with implantation of drug-eluting stents guided by angiography alone or guided by fractional flow reserve (FFR) measurements in addition to angiography. The primary end point was a set of prespecified major adverse cardiac events (MACE), which is the occurrence of death, nonfatal myocardial infarction, or repeat revascularization, at 1 year.
In this section, we reproduce and extend the sample size calculations based on MACE in the FAME study. Their sample size calculation used the reasonable default values α′ = 0.05 and π′ = 0.80, their assumed event probabilities were θ2 = 0.14 for the angiography group and θ1 = 0.08 for the FFR group, and their analyses were based on a two-sided chi-square test. With these values, they found a total sample size of 852 patients was needed. Our calculations use the same prespecified clinically important values.
3.1. The sample sizes
Figure 2 highlights the possible total sample sizes that the algorithm can output. Throughout, we use α′ = 0.05. The horizontal lines in the figure show the output of standard power analyses. The solid line (for power π′ = 0.80) cuts the vertical axis at the sample size 851, approximately the same that their study calculated.
Figure 2:

A sample size plot for FAME showing the total sample size output of the algorithm for various choices of the minimum tolerable fragility index φ′. The α′ and π′ parameters contribute to the horizontal lines through the p value based test, and the τ′ and φ′ parameters contribute to the points through the FI based test. The maximum over both give the designed sample size.
Along the horizontal axis of Figure 2, many possible choices of the fragility index cutoff φ′ are considered. The points show the output of the second component of the algorithm which finds the sample size so that the τ′ quantile of the fragility index is larger than φ′. The vertical axis consequently shows the sample size needed for the fragility index to be higher than the cutoff φ′ with probability 1−τ′. Recall that the overall output of the algorithm is the maximum of the value of the point and the horizontal line.
For example, if we specify the parameter π′ = 0.80, then the fragility index cutoff φ′ = 10 produces a sample size of max{851, 987} = 987 patients when 1 − τ′ = 0.50 (circular) and max{851, 1424} = 1424 patients when 1 − τ′ = 0.80 (triangular). As the fragility index cutoff is constrained to be higher so that the demanded strength of evidence grows, the calculated sample size also grows. The curve generated by the circular dots (for τ′ = 0.50) intersects the 851 solid horizontal line when the minimum median FI is 8, showing that in this case the default sample size calculation designs a study with a median fragility index of 8. However, at the end of the FAME trial, the fragility index was only 4, perhaps due to random chance or misspecifying the effect sizes at design.
The curve on which the points for each τ′ lie is approximately linear, meaning that we can reasonably interpret the slope. For the circular points corresponding to 1 − τ′ = 0.50 and the median fragility index, the slope is approximately 48. Therefore, on average each additional fragility index costs 48 patients at design. This metric is very helpful to consider when weighing expanding a trial for the sake of decreasing its fragility. For 1−τ′ = 0.80, the slope of the triangular points is approximately 52, showing that the cost of designing for a higher fragility index grows as the designed power of the fragility index based test grows.
3.2. The resulting tests
Table 2 shows the Type I error rate and power for two natural tests with sample sizes calculated using the algorithm with parameters α′ = 0.05 and π′ = 0.80. The first test is based on the practitioner designing based on the fragility index but only rejecting based on the p value being less than α′, while the second test is based on the practitioner being more selective and only rejecting when the fragility index exceeds their chosen cutoff φ′.
The Type I error rate of the FI-based test gets very small as the fragility index grows. When φ′ = 30, the Type I error rate of the FI-based test is roughly 0.00006, which is three orders of magnitude smaller than the default α′ = 0.05. The Type I error rate being so low for high choices of the cutoff φ′ means that using the sample size calculation then using the FI-based test will lead to a very low chance of falsely overturning standard procedures specified by the null hypothesis. The Type I error rate of the first (p value based) test stays constant at around 0.05.
The power of the FI-based test starts at around 0.80 when the FI-cutoff φ′ = 0 then stabilizes at around 0.50 for φ′ ≥ 10. This behavior is predicted because the sample size output from the algorithm is the maximum of two sample sizes: that which makes the power of the first test at least π′ = 0.80 and that which makes the τ′ quantile of FI under the alternative at least φ′. For φ′ = 0 or 5, the maximum is from the power-based calculation; however, for φ′ ≥ 10, the maximum is from the FI-based calculation. This again shows that the median FI expected by a trial that is designed without taking into account fragility index is around 8. In fact, the power of the test when φ′ ≥ 10 stabilizes at a value slightly less than 0.50 due to the rejection region relying on the discrete test statistic FI.
The power of the p value based test grows from 0.80 to higher than 0.98. The power gets so high since the test designs for the fragility index FI to be high but then only rejects based on whether FI > 0 (i.e. p < 0.05). In Table 3, we explore which values could be used to make a traditional sample size calculation (based on the p value test alone) produce the same output as the calculation (which takes into account the fragility index). Notice that the terms all get more extreme as φ′ increases. The column and the power of the p value based test column in Table 2 are independent realizations of the same estimator for the same underlying parameter. Their closeness illustrates that the estimates have low noise. However, the Type I error rate of the FI based test and the new Type I error rate having a discrepancy reflects fundamental differences between the fragility index and p values. The effect size δtrad reduces to three-quarters of its original size to give the same sample size as the algorithm when φ′ = 30 and τ′ = .50.
The last three rows of Table 2 shows the result of fixing α′ but varying the power 1 − τ′ of the FI based test. Designing for the power of the FI based test being larger makes the power of the p value based test larger. The Type I error rate of the FI based test increases as the FI based power 1 − τ′ increases because the sample size increases. As the sample size grows larger, fragility indices which are larger become increasingly common.
Designing without considering the fragility index then testing with the fragility index is common in the literature, since there is currently no method to design taking into account the fragility index. For the FAME trial, the sample size 852 was found without considering the fragility index. Using that sample size but testing based on φ′ = 15 leads to a power of approximately 0.17, which is very low. When the Type I error rate and the power of the test are roughly equal, little can be learned from a rejection.
For φ′ = 15 and 1 − τ′ = 0.50, Table 2 shows that a FI based test with sample size 1235 has Type I error rate approximately α ≈ 0.00106 and power approximately 0.48337. A p value based test with the same sample size and Type I error rate has power approximately 0.542 (estimated over one million Monte Carlo samples). This higher value suggests that using the p value based test could be preferred in this case. The p value based test is the inversion of a statistical procedure that has certain frequentist optimality properties [28], whereas the fragility index based test is not rooted in any such optimality theory.
3.3. Our determination of φ′
The FAME trial addresses a highly relevant research question, as coronary artery disease is among the leading causes of death in the western world and percutaneous interventions are the first-line treatment applied annually in hundreds of thousands of patients. The primary outcome of the trial included highly important clinical events, such as death and myocardial infarction. Based on the above consideration we postulated that a fragility index cutoff of 15 would be appropriate for this trial. (φ′ = 15).
Such a constraint leads the calculated sample to be 1235, which is roughly 45% higher than without the constraint. At that sample size, the resulting p value based test has power roughly 0.92 and the resulting FI-based test has Type I error rate roughly 0.001. The Type I error rate of the FI-based test is in line with (but slightly more stringent than) the order of magnitude reduction in the default Type I error rate advocated for through a calibration Bayesian arguments [29, 30].
4. FAMOUS-NSTEMI Trial Example
This section contains the second example of the use of the proposed sample size calculation. The fractional flow reserve vs. angiography in guiding management to optimize outcomes in non-ST- segment elevation myocardial infarction: British Heart Foundation FAMOUS–NSTEMI randomized trial was designed to assess the management and outcomes of patients with non-ST segment elevation myocardial infarction. Patients were randomly assigned to fractional flow reserve-guided management or angiography-guided standard care and the primary outcome was the between-group difference in the proportion of patients allocated to medical management.
In this section, we reproduce and extend the sample size calculations in the FAMOUS-NSTEMI study. Their sample size calculation used the reasonable default values α′ = 0.05 and π′ = 0.90, their assumed event probabilities were θ2 = 0.15 for the angiography group and θ1 = 0.30 for the FFR group, and their analyses were based on a two-sided chi-square test. With these values, they found that a total sample size of 322 patients was needed. Our calculations use the same prespecified clinically important values.
4.1. The sample sizes
The total sample size plot in Figure 3 shows largely the same behavior as the corresponding plot for the FAME trial. The dashed horizontal line (corresponding to π′ = 0.80) shows that a sample size calculation with φ′ = 0 produces the sample size 322, roughly the same as was originally designed. The median fragility index of a trial designed without considering the fragility index is around 10. At the end of their study, however, the fragility index for MACE was only 4. The difference could be due to incorrectly specifying the effect sizes at design or perhaps just random variation. The study has no patients lost to follow-up. Also, the curve following the circular points (for τ′ = 0.50) is less steep than in FAME and now only has slope 20 patients per fragility index. Therefore, when designing the FAMOUS-NSTEMI trial each additional fragility index costs 20 patients at design.
Figure 3:

A sample size plot for FAMOUS-NSTEMI showing the output of the algorithm for various choices of the minimum tolerable fragility index φ′.
4.2. The resulting tests
Table 4 shows the same general trend as in the previous section. However, here the Type I error rate of the fragility index based test decreases more rapidly. Increasing the fragility index cutoff φ′ from 0 to 5 reduces the Type I error rate by roughly an order of magnitude, which happens again when φ′ is increased from 5 to 10.
Table 4:
The estimated Type I error rate α and power in the FAMOUS-NSTEMI trial for the p value based test and the fragility index based test. The sample sizes were found using the algorithm when α′ = 0.05 and π′ = 0.90. The estimated rates are based on 1, 000, 000 Monte Carlo samples.
| Reject if p < 0.05 | Reject if FI > φ′ | |||||
|---|---|---|---|---|---|---|
|
| ||||||
| τ′ | φ′ | n* | α | Power | α | Power |
| 0.50 | 0 | 322 | 0.05076 | 0.90654 | 0.05076 | 0.90654 |
| 0.50 | 5 | 322 | 0.05138 | 0.90622 | 0.00483 | 0.71995 |
| 0.50 | 10 | 332 | 0.04983 | 0.91419 | 0.00025 | 0.47410 |
| 0.50 | 15 | 429 | 0.05101 | 0.96588 | 0.00003 | 0.47252 |
| 0.50 | 20 | 521 | 0.05072 | 0.98622 | 0.00001 | 0.47508 |
| 0.50 | 25 | 611 | 0.05024 | 0.99453 | < 0.00001 | 0.48136 |
| 0.50 | 30 | 697 | 0.05057 | 0.99782 | < 0.00001 | 0.47790 |
|
| ||||||
| 0.50 | 25 | 611 | 0.05070 | 0.99462 | < 0.00001 | 0.48141 |
| 0.35 | 25 | 679 | 0.05067 | 0.99744 | < 0.00001 | 0.63043 |
| 0.20 | 25 | 765 | 0.05064 | 0.99900 | < 0.00001 | 0.77962 |
Table 5 considers the p value based test and shows which possible values could used with a traditional power analysis to output the same sample size as the fragility index aware algorithm. It also shows the same general trend as in the previous section.
Table 5:
The new values for traditional sample calculations which produce the same sample size as the algorithm when choosing one new value and the rest from the old values α′ = 0.05, π′ = .90, and δ = 0.06. The estimated rates are based on 1, 000, 000 Monte Carlo samples.
| τ′ | φ′ | n* | δ trad | ||
|---|---|---|---|---|---|
| 0.50 | 0 | 322 | 0.04650 | 0.90625 | 0.14807 |
| 0.50 | 5 | 322 | 0.04650 | 0.90648 | 0.14796 |
| 0.50 | 10 | 332 | 0.04046 | 0.91484 | 0.14592 |
| 0.50 | 15 | 429 | 0.01226 | 0.96601 | 0.12663 |
| 0.50 | 20 | 521 | 0.00385 | 0.98623 | 0.11467 |
| 0.50 | 25 | 611 | 0.00125 | 0.99454 | 0.10458 |
| 0.50 | 30 | 697 | 0.00041 | 0.99780 | 0.09772 |
The power of the test designed by considering φ′ = 0 (ie using a traditional power analysis) but testing with the cutoff φ′ = 25 is only 0.01, extremely small.
4.3. Our determination of φ′
The FAMOUS-NSTEMI trial addresses the same question than the FAME trial (the treatment by percutaneous interventions of patients with coronary artery disease). However, a key difference with FAME is that in FAMOUS-NSTEMI the primary outcome is made of a clinically less important event (management by medical rather than invasive therapy). Because of this and based on the considerations discussed in Section 2.3 we postulated that a fragility index cutoff of 25 would be appropriate for this trial (φ′ = 25).
The constraint imposed by selecting that φ′ = 25 leads the calculated sample to be 611 patients, which is roughly 89% higher than without the constraint. The tests designed with φ′ = 25 have power greater than 0.99 (for the p value based test) and slightly less than 0.00001 (for the FI based test).
5. Conclusion
The medical literature is increasingly relying on the fragility index to interpret the results of published RCTs. To unify the design and analysis of studies, we proposed an intuitive algorithm to account for the fragility index at the time of study design and sample size calculation.
The algorithm is implemented in an open source R package FragilityTools [18]. Included in the package is a toolbox of functions which implement a general fragility index calculation and sample size calculation. The appendix contains example code with explanations. The algorithm is fast; for example, a sample size calculation takes around ten seconds on a standard PC with a 2.80 GHz processor. The package functions flexibly accept any way to calculate p values as an argument: in both examples above, we used Pearson’s chi-squared test rather than a Fisher exact test. Compared to traditional sample size calculations, the only additional information that needs to be specified are two properties of the fragility index based test: the power 1−τ′ and the fragility index cutoff φ′. The fragility index based power, τ′, is taken by default to be 0.50.
The acceptable fragility indices must be established for each trial based on the available clinical evidence, expert opinions and the potential benefits of the trial findings, similarly to what is generally done for the non-inferiority margin of non-inferiority trials.
The chosen fragility index cutoff is based on clinical considerations on the intervention tested, the clinical scenario and the outcomes considered. Two other elements in choosing the cutoff are the sample size and event probability. The meaningfulness of knowing that a fragility index is larger than a fixed cutoff lessens as the sample size grows [31, 32]. For instance, a study with a sample size of n = 100 and a fragility index of 5 is less fragile than a study with a sample size of n = 1,000,000 and a fragility index of 5. Also, the higher the event probability, the higher the (intuitive) fragility of a trial for the same level of fragility index. For instance, 5 patients having a different outcome gets more likely as their event probability gets closer to 1/2. In our examples we have considered the sample size and event probabilities to be stable enough that we did not directly consider them for our choices, but alternative scenarios are reasonable.
Some authors have pointed out shortcomings of the retrospective use of the fragility index. One simulation study suggested that the fragility index is a repackaging of the p value [33]. Larger fragility indices for a given effect are obtained for larger sample sizes since these are the cases with small p values. Underpowered studies are associated with smaller fragility indices [34].
The relationship between the p value and the fragility index makes sense in our view because they’re both measures of evidence against the null hypothesis. The framing of the fragility index procedure as a statistical test was crucial ingredient in the sample size calculation. However we recommend that practitioners should not apply the fragility index test procedure as a binary decision rule. Introducing the fragility index testing procedure with parameters φ′ and τ′ is an acknowledgement that rigid choices of the comparable parameters α′ and π′ in the p value based test are sufficiently ingrained that another, more intuitive, test needs to be put “on top” of the p value based test to yield a test with flexible properties.
The fragility index based test is always more stringent than the p value based test because the fragility index being positive is equivalent to the p value being less than the significance threshold. The degree of stringency is controlled by the user supplied fragility index cutoff φ′. Since the cutoff should be chosen in a clinically relevant manner (being on the “patient” scale rather than a probability scale), the cutoff drives the Type I error rate of the fragility index testing procedure to be smaller than the default α′ = 0.05 in a clinically meaningful and context dependent way. This lower significance level achieves an important guideline for testing set forth by the statistical community [29, 30]. Further, using the proposed sample size calculation approach, this desirable test will have reasonable power.
Supplementary Material
What is new?
The fragility index is increasingly used to analyse clinical trials. We propose to take this into account during the design of a trial by developing an inferential framework for the fragility index
We provide an open source R package FragilityTools which efficiently implements an algorithm for sample size calculations that takes into account the traditional design parameters (alpha, beta, control group event rate, effect size) as well as the fragility index.
The fragility index based procedure will tend to have low Type I error rate, which achieves a long standing statistical guideline. The lower Type I error rate is derived in an interpretable way for clinicians and so is context dependent.
Highlights.
The fragility index is increasingly used to analyse clinical trials. We propose to take this into account during the design of a trial by developing an inferential framework for the fragility index
We provide an open source R package FragilityTools which efficiently implements an algorithm for sample size calculations that takes into account the traditional design parameters (alpha, beta, control group event rate, effect size) as well as the fragility index.
The fragility index based procedure will tend to have low Type I error rate, which achieves a long standing statistical guideline. The lower Type I error rate is derived in an interpretable way for clinicians and so is context dependent.
Acknowledgements
We thank Irbaz Hameed for helpful conversations and suggesting the FAMOUS-NSTEMI trial for consideration and Robin Alexander for her helpful coding suggestions.
Wells′ research was partially supported by NIH grant U19 AI111143, PCORI IHS-2017C3–8923, and Cornell’s Center for the Social Sciences project on Algorithms, Big Data, and Inequality. The funding sources had no direct role in the paper: the researchers had independence from the funders.
Footnotes
Conflict of interest: The authors have no conflicts of interest to declare.
Ethical approval was not required for this study.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Feinstein AR, The unit fragility index: an additional appraisal of “statistical significance” for a contrast of two proportions, Journal of Clinical Epidemiology 43 (2) (1990) 201–209. [DOI] [PubMed] [Google Scholar]
- [2].Walsh M, Srinathan SK, McAuley DF, Mrkobrada M, Levine O, Ribic C, Molnar AO, Dattani ND, Burke A, Guyatt G, et al. , The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index, Journal of Clinical Epidemiology 67 (6) (2014) 622–628. [DOI] [PubMed] [Google Scholar]
- [3].Gaudino M, Hameed I, Biondi-Zoccai G, Tam DY, Gerry S, Rahouma M, Khan FM, Angiolillo DJ, Benedetto U, Taggart DP, et al. , Systematic evaluation of the robustness of the evidence supporting current guidelines on myocardial revascularization using the fragility index, Circulation: Cardiovascular Quality and Outcomes 12 (12) (2019) e006017. [DOI] [PubMed] [Google Scholar]
- [4].Edwards E, Wayant C, Besas J, Chronister J, Vassar M, How fragile are clinical trial outcomes that support the chest clinical practice guidelines for VTE?, Chest 154 (3) (2018) 512–520. [DOI] [PubMed] [Google Scholar]
- [5].Khan M, Evaniew N, Gichuru M, Habib A, Ayeni OR, Bedi A, Walsh M, Devereaux P, Bhandari M, The fragility of statistically significant findings from randomized trials in sports surgery: a systematic survey, The American Journal of Sports Medicine 45 (9) (2017) 2164–2170. [DOI] [PubMed] [Google Scholar]
- [6].Evaniew N, Files C, Smith C, Bhandari M, Ghert M, Walsh M, Devereaux PJ, Guyatt G, The fragility of statistically significant findings from randomized trials in spine surgery: a systematic survey, The Spine Journal 15 (10) (2015) 2188–2197. [DOI] [PubMed] [Google Scholar]
- [7].Ridgeon EE, Young PJ, Bellomo R, Mucchetti M, Lembo R, Landoni G, The fragility index in multicenter randomized controlled critical care trials, Critical Care Medicine 44 (7) (2016) 1278–1284. [DOI] [PubMed] [Google Scholar]
- [8].Shochet LR, Kerr PG, Polkinghorne KR, The fragility of significant results underscores the need of larger randomized controlled trials in nephrology, Kidney International 92 (6) (2017) 1469–1475. [DOI] [PubMed] [Google Scholar]
- [9].Kruse BC, Vassar BM, Unbreakable? an analysis of the fragility of randomized trials that support diabetes treatment guidelines, Diabetes Research and Clinical Practice 134 (2017) 91–105. [DOI] [PubMed] [Google Scholar]
- [10].Holek M, Bdair F, Khan M, Walsh M, Devereaux P, Walter SD, Thabane L, Mbuagbaw L, Fragility of clinical trials across research fields: A synthesis of methodological reviews, Contemporary Clinical Trials 97 (2020) 106151. [DOI] [PubMed] [Google Scholar]
- [11].Wasserstein RL, Lazar NA, The asa statement on p-values: Context, process, and purpose, The American Statistician 70 (2) (2016) 129–133. [Google Scholar]
- [12].Wasserstein R, Schirm A, Lazar N, Statistical inference in the 21st century: A world beyond p¡ 0.05 [special issue], American Statistician 73. [Google Scholar]
- [13].Lee JJ, Demystify statistical significance—time to move on from the p value to bayesian analysis, Journal of the National Cancer Institute 103 (1) (2010) 2–3. [DOI] [PubMed] [Google Scholar]
- [14].Simonsohn U, Nelson LD, Simmons JP, P-curve: a key to the file-drawer., Journal of experimental psychology: General 143 (2) (2014) 534. [DOI] [PubMed] [Google Scholar]
- [15].Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD, The extent and consequences of p-hacking in science, PLoS Biol 13 (3) (2015) e1002106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Simonsohn U, Nelson LD, Simmons JP, p-curve and effect size: Correcting for publication bias using only significant results, Perspectives on Psychological Science 9 (6) (2014) 666–681. [DOI] [PubMed] [Google Scholar]
- [17].R Core Team, R: A language and environment for statistical computing (2020). URL https://www.R-project.org/
- [18].Baer BR, , Gaudino M, Fremes SE, Charlson ME, Wells MT, Fragilitytools, r package version 1.0.0 (2020). URL https://github.com/brb225/FragilityTools [Google Scholar]
- [19].DeGroot MH, Schervish MJ, Probability and statistics, Pearson Education, 2012. [Google Scholar]
- [20].Chow S-C, Wang H, Shao J, Sample size calculations in clinical research, CRC press, 2007. [Google Scholar]
- [21].Fisher RA, Statistical methods for research workers, 5th Edition, Oliver and Boyd, Edinburgh and London, 1934. [Google Scholar]
- [22].Arnold BF, Hogan DR, Colford JM, Hubbard AE, Simulation methods to estimate design power: an overview for applied research, BMC Medical Research Methodology 11 (1) (2011) 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Polyak BT, Juditsky AB, Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization 30 (4) (1992) 838–855. [Google Scholar]
- [24].Khan MS, Fonarow GC, Friede T, Lateef N, Khan SU, Anker SD, Harrell FE, Butler J, Application of the reverse fragility index to statistically nonsignificant randomized clinical trial results, JAMA Network Open 3 (8) (2020) e2012469–e2012469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Neyman J, Pearson ES, On the use and interpretation of certain test criteria for purposes of statistical inference: Part i, Biometrika (1928) 175–240. [Google Scholar]
- [26].Neyman J, Pearson ES, The testing of statistical hypotheses in relation to probabilities a priori, in: Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 29, Cambridge University Press, 1933, pp. 492–510. [Google Scholar]
- [27].Tonino PA, De Bruyne B, Pijls NH, Siebert U, Ikeno F, vant Veer M, Klauss V, Manoharan G, Engstrøm T, Oldroyd KG, et al. , Fractional flow reserve versus angiography for guiding percutaneous coronary intervention, New England Journal of Medicine 360 (3) (2009) 213–224. [DOI] [PubMed] [Google Scholar]
- [28].Lehmann EL, Romano JP, Testing statistical hypotheses, Springer Science & Business Media: New York, 2006. [Google Scholar]
- [29].Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, et al. , Redefine statistical significance, Nature Human Behaviour 2 (1) (2018) 6–10. [DOI] [PubMed] [Google Scholar]
- [30].Ioannidis JP, The proposal to lower p value thresholds to. 005, Journal of the American Medical Association 319 (14) (2018) 1429–1430. [DOI] [PubMed] [Google Scholar]
- [31].Tignanelli CJ, Napolitano LM, The fragility index in randomized clinical trials as a means of optimizing patient care, Jama Surgery 154 (1) (2019) 74–79. [DOI] [PubMed] [Google Scholar]
- [32].Potter GE, Dismantling the fragility index: A demonstration of statistical reasoning, Statistics in Medicine. [DOI] [PubMed] [Google Scholar]
- [33].Carter RE, McKie PM, Storlie CB, The fragility index: a p-value in sheep’s clothing?, European Heart Journal 38 (5) (2017) 346–348. [DOI] [PubMed] [Google Scholar]
- [34].Chan A-W, Altman DG, Epidemiology and reporting of randomised trials published in pubmed journals, The Lancet 365 (9465) (2005) 1159–1162. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
