Abstract
Background
The number of Phase III trials that include a biomarker in design and analysis has increased due to interest in personalised medicine. For genetic mutations and other predictive biomarkers, the trial sample comprises two subgroups, one of which, say is known or suspected to achieve a larger treatment effect than the other . Despite treatment effect heterogeneity, trials often draw patients from both subgroups, since the lower responding subgroup may also gain benefit from the intervention. In this case, regulators/commissioners must decide what constitutes sufficient evidence to approve the drug in the population.
Methods and Results
Assuming trial analysis can be completed using generalised linear models, we define and evaluate three frequentist decision rules for approval. For rule one, the significance of the average treatment effect in should exceed a pre-defined minimum value, say . For rule two, the data from the low-responding group should increase statistical significance. For rule three, the subgroup-treatment interaction should be non-significant, using type I error chosen to ensure that estimated difference between the two subgroup effects is acceptable. Rules are evaluated based on conditional power, given that there is an overall significant treatment effect. We show how different rules perform according to the distribution of patients across the two subgroups and when analyses include additional (stratification) covariates in the analysis, thereby conferring correlation between subgroup effects.
Conclusions
When additional conditions are required for approval of a new treatment in a lower response subgroup, easily applied rules based on minimum effect sizes and relaxed interaction tests are available. Choice of rule is influenced by the proportion of patients sampled from the two subgroups but less so by the correlation between subgroup effects.
Keywords: Clinical trials, regulatory approval, subgroups, enrichment
1 Background
Since the rise of personalised medicine, the number of Phase III trials that include a biomarker in design and analysis has increased. A biomarker has been defined as “A characteristic that is measured as an indicator of normal biological processes, pathogenic processes or responses to an exposure or intervention.”1 Biomarkers of interest are those which are related to important clinical outcomes and may be prognostic (associated with the clinical outcome independently of treatment) or predictive (interact with treatment).2 Predictive biomarkers in contemporary trials are often related to one or more genetic mutations, gene expression or a function of several genetic markers and may be dichotomous, ordinal or continuous. Despite the loss of information, for practical reasons they are often dichotomised. In this paper, we focus on biomarkers that are dichotomous (naturally or by design), and prior to commencing a confirmatory Phase III trial, are expected to be predictive.
In our context, we define sub-populations of patients as either biomarker positive (B+) or negative (B−). A common situation is that the treatment efficacy is a priori assumed to be better (or at least as good) in B+ compared to B− patients. For example,
a drug treatment may have been developed to target a genetic disorder defining group B+,
there may be an untested clinical hypothesis of better efficacy in B+,
empirical data (biological or early clinical) may indicate higher efficacy in B+.
Even if a treatment has been developed to target the biomarker of interest, some of the efficacy in B+ may obtain in B−. Depending on the situation, one may expect no or minimal efficacy in B−, that a large proportion of the efficacy in B+ is retained, or that efficacy in B− is difficult to predict even if the treatment is known to be efficacious in B+. The effect in may arise, for example, if the biomarker is intrinsically continuous, so that treatment efficacy varies continuously across its levels. Dichotomisation may then lead to positive, but smaller efficacy in B−. Alternatively, some patients in B− may have an unknown genetic defect acting on the same pharmacological pathway as the B+ genetic mutation, which is the target of the intervention. Moreover, few biomarker tests have 100% sensitivity and specificity in practice, leading to diffusion of efficacy from patients incorrectly classified as B+ or B−. As a result, biomarker level will interact with the treatment, with a higher treatment effect in patients than patients.
Although a new treatment may be most effective in the higher responding subgroup, trials often draw patients from both subgroups, since the lower responding subgroup may also gain benefit from the intervention. In this case, a problem emerges when deciding what constitutes sufficient evidence to approve the drug in the population. European guidelines state that “Confirmatory trials should reflect the target population to be treated” so that trials will sample from a target population. However, if the target population is heterogeneous, ensuring that the treatment effect is sufficiently large in a lower responding subgroup may still be warranted. Moreover, in order to improve efficiency of a trial, the higher responding subgroup may be oversampled (enriched sample), resulting in under-representation of lower responders relative to the target population. In such a case, regulators and sponsors are concerned that automatic approval for lower responders based on obtaining an overall significant effect could result in harm, for example if side effects outweigh potential benefit in this subgroup. If the low response subgroup makes up a very large proportion in the population, there may also be substantial cost to healthcare providers, but not the predicted benefits. Therefore, regulators and sponsors may wish to impose additional conditions for approval of the treatment in this subgroup. Current regulatory guidelines acknowledge the importance of heterogeneity in decision making and encourage subgroup analysis in confirmatory trials.3 However, the guidance does not describe specific rules for approval of subgroups when heterogeneity exists.
There is a large literature on subgroup analysis in phase III trials. Much of this literature concerns post hoc exploration of a moderate to large number of subgroups using interaction tests, with issues such as data dredging and multiplicity well documented.4 This study differs in that we are concerned with the situation where there is an overall significant effect, two pre-defined subgroups known to differ in treatment effect, and optimal rules for treatment approval in the lower responding subgroup are required.
Where hypothesis tests are applied to multiple subgroups, it is important to control family-wise error rate (FWER).5 For two subpopulations B+ and B−, there are different multiple testing procedures that control the FWER.6 In most applications, formal testing focuses on F and B+ using either a hierarchical approach (F followed by B+, or B+ followed by F) or by splitting type I error between parallel tests; testing of B− is rarely included in applications for regulatory approval, as power is considered to be limited.7,8 In this study, we concentrate on the situation where the intervention has statistically significant efficacy in F, with significance in the B+ group assumed to follow due to higher efficacy in this subpopulation; conditions for approval in B−are then developed and assessed. A strategy that conditions on significance in B+ rather than F is closely related mathematically and results are expected to be very similar.
Although the study is motivated by trials of drugs targeting specific genetic mutations (see Gonzalez-Martin et al.9 for a recent example), other examples of trials with similar structures define subgroups according to age (adults and children),10 mild and severe disease,11 early and late stage cancers,12 as well as other biomarkers.13 Proposed methods should apply to any trial including two subgroups with known or suspected treatment effect inequality.
We provide a brief literature review of methods and practice in this context in section 2 before describing proposed rules for exponential family models in section 3. Conditional power of the rules is explored in section 4 and applied retrospectively to two published phase III trials in section 5, before briefly discussing implications for future trial design. A discussion completes the paper (section 6).
2 Existing literature
A review of FDA drug approvals with required biomarker testing found that biomarker negative patients were simply excluded from the majority of trials.14 Since exclusion was often not based on clinical evidence, these patients could be denied potential benefit from novel treatments. Moreover, provided that there is a sound biological basis for some benefit in biomarker negative patients, including them may also confirm the clinical utility of the biomarker itself.
Heterogeneity within a target population was also recognised in updated EMA guidance on the investigation of subgroups in confirmatory clinical trials, published in January 2019.3 Whilst the guidance suggests that restriction of a trial population to a sub-population is justified if there are safety concerns or an anticipated lack of efficacy, it also calls for additional trials including the full breadth of the population to provide the best evidence of effect modifiers. Inclusion of the biomarker negative subgroup was highlighted as important, though it can create difficulties in analysis if the treatment effect is small or there are only a small number of such patients, resulting in low power to detect a significant treatment effect. Despite this, patients in this subgroup may still benefit from treatment, and are therefore harmed if the result is discarded for non-significance.
A 2016 review by Ondra et al.15 found that two concepts underpin current methods for assessing subgroup effects, influence and interaction. An ‘influence’ condition sets a threshold that must be met by the treatment estimate of the subgroup of interest, whilst an interaction test sets a difference between treatment effects for two (or more) subgroups, in effect requiring that effects are sufficiently close. These methods may be used for approval of a treatment or as conditions which must be met for the subpopulation to be included in the next stage of analysis (adaptive designs). For example, Stallard et al.16 compared different strategies for choosing which hypotheses to test in the second stage of analysis (either the full population or a subgroup, or both), which used either an influence or interaction test approach. Similarly, Matsui and Crowley17 proposed a sequential design where in the first stage of their analysis they use superiority and futility boundaries to decide which populations go forward for further analysis. This preserves statistical power for detecting various profiles of treatment effects across the subgroups, and allows the biomarker negative population to be tested again if they do not cross the futility boundary.
Despite the development of different adaptive designs, interaction tests appear to be the main method used to assess subgroup heterogeneity. In our (unpublished) targeted systematic review of large clinical trials that carried out subgroup analyses in the New England Journal of Medicine, we found that approximately two thirds used interaction tests to decide whether there was significant treatment effect heterogeneity. Almost all other articles summarised within-subgroup effects and used significance tests with 5% type I error.
Although most of the literature rests on the frequentist paradigm, a Bayesian approach could also be considered.18 By specifying a two-dimensional prior for efficacy in B+ and B−, one can explicitly borrow information from one subpopulation when evaluating the other. This prior should reflect the clinical plausibility of a range of differential treatment effects between the two subpopulations.
This study was partly motivated by the design of the APEX trial, which compared betrixaban with standard dose enoxaparin in medically ill patients at risk of venous thrombosis. They carried out sequential analyses, the first on a subgroup defined by the biomarker D-dimer, the second on a subgroup defined by a combination of D-dimer level and age, and the third of the full population. If any result was negative, then subsequent tests were treated as exploratory. The first subgroup analysis was just above the pre-defined threshold for statistical significance of 5% (p = 0.054), so that the subsequent subgroup analysis (elevated D-dimer level and age ) and full population analysis had to be treated as exploratory, although hypothesis test statistics were ‘significant’ at p = 0.03 and p = 0.006, respectively. Clearly, such an analysis may have substantial implications for approval of the experimental treatment. A more traditional analysis plan would be to consider the full population first followed by the subgroups; however, conditions for approval in subgroups are less well established.
In our context, we may accept some heterogeneity between subgroups, provided that there is sufficient benefit in the B− subgroup. The issue is in choosing an acceptable difference between the treatment effect in the two subgroups, or equivalently, choosing a relaxed (higher) significance level for the interaction test. On the other hand, decision rules that focus on influence rely solely on the data in the B− subgroup, but require us to pre-specify a minimum bound for the acceptance threshold. In order to avoid the need to specify either a minimum treatment effect or a more relaxed interaction level, a decision rule that does not require additional parameters may also be attractive.
The question of how to deal with approval in a limited sub-population has important implications for maximising the patients who could benefit, which is particularly important for conditions where there are few treatment options. There remains uncertainty about how to address this issue, and how different decision rules perform according to issues such as prevalence of the high responder subgroup in the population and in the trial. We outline a simple strategy to choose appropriate and efficient decision rules in a frequentist framework.
3 Methods
3.1 Generalised linear model and subgroup notation
In practice, phase III clinical trials that have a biomarker-treatment interaction are analysed using linear, generalised linear or survival regression models. We restrict attention in this paper to the wide range of trial outcomes that have Normal, Binomial or Poisson distributions and review the general framework here, defining estimands of interest, estimators and statistics.
Generalised linear models that describe different treatment effects in the two subgroups have a linear predictor of the form
(1) |
where for patient for control and experimental treatments, for subgroups (biomarker negative), (biomarker positive) and is a vector of baseline covariates, usually minimisation or stratification factors, included to increase precision of the treatment effect estimate or to adjust for chance imbalance. We assume that there are N patients in the trial overall, in each trial arm and that πn in each treatment arm are drawn from sub-population , the remaining are drawn from sub-population .
For the exponential family of distributions, the expected response and the linear predictor are connected through the link function . For trial outcomes that have Normal, Binomial or Poisson distributions, canonical link functions are the identity, logit and log functions, respectively.
We define the estimands of interest in the two sub-populations as the treatment effects for the biomarker positive subgroup and for the biomarker negative subgroup. Without loss of generality, positive values of and indicate that the treatment is beneficial. For Normal response variables, these are mean treatment effects in the two subgroups, for Binomial responses they are log odds-ratios and for Poisson responses they are log rate-ratios. Estimators of these estimands can be obtained using maximum likelihood as and . We can also estimate the approximate maximum likelihood (co-)variance components from the information matrix, so that for generalised linear models
For inference for each group separately, we define Z-statistics
and where and are estimates of the standard errors of the estimands taken from the information matrix.
3.2 Likelihood assuming no correlation between and
In order to gain insight into the contribution of π, the proportion of trial patients drawn from the high-responding subgroup , it is useful to consider the case where the trial analysis is not adjusted for baseline factors, so that there is zero correlation between and . In this case we can write
and independently
Specifically, for the response Yi, , with canonical link and assuming 1:1 randomisation between treatment arms, the variance components can be approximated by:
Normal distribution-Identity link and , where is the sampling variance.
Binomial distribution-Logit link and where θjk is the probability of an event in treatment arm j and subgroup k.
Poisson distribution-Log link and where λjk is the event rate in treatment arm j and subgroup k.
We note that in all three cases the variance includes the term or , which will facilitate investigation of the influence of the distribution of the sub-populations in the trial.
3.2.1 Making inferences in the full population in the general case.
We write the full population treatment effect as
(2) |
That is, the estimand for the full population is a weighted average of the subgroup specific estimands and , with weights given by the proportion of the trial sample drawn from each sub-population, π and . Note that, for μF to be directly interpretable, these proportions should hold in the target population, otherwise some translation is required.
Since and , we have
If ρ is the correlation between the estimands, the Z-statistic for the full population is
(3) |
If there are no additional covariates in the model (), then , so that and the correlation ρ is zero.
Alternatively, if then , and the correlation ρ describes the association between estimated effects due to adjustment for covariates. In general, the correlation induced by covariance adjustment is expected to be small.
As an aside, we note that the correlation between the statistics and is also equal to ρ.
3.3 Proposed rules for approval of the drug in
3.3.1 Sequential testing and conditional power.
We define and evaluate three proposed rules for approval in the lower response population conditional on significance in the full population. Recall that we expect the treatment to be as effective or less effective in the population, but nevertheless it may be sufficiently effective to warrant approval. In this situation, evaluation of is only worthwhile if a significant effect has been established in the full population. Thus, we adopt a sequential testing strategy, first evaluating treatment in the full population and, conditional on a significant result, evaluating the treatment in . The conditional power of a decision rule is a natural method for assessing its value in this context; for decision rule Rn, conditional power is defined as
(4) |
(As an aside, for binary data where the analysis adjustment for baseline covariates is not required, this conditional probability can be calculated in closed form.)
The denominator does not depend on the form of any proposed rule and is given by the observed statistic in the full population
(5) |
Note that the right-hand side is a function of three location parameters and π, and three variance parameters and ρ (see equations (2) and (3)), and conditional on these the standard normal deviate can be obtained from any statistical software.
We now consider conditional power of three classes of decision rule for approval of the drug in , summarised in Table 1.
Table 1.
Rule | Condition | Explanation |
---|---|---|
1 | The Z-statistic in must exceed a pre-defined threshold L | |
2 | Results from must increase overall significance | |
3 | No significant subgroup-treatment interaction at αI level |
Rules 1 and 2 are different types of influence rule, whilst Rule 3 is an interaction test. The algebra for calculating the conditional power of each rule is given in full in Appendix 1.
3.4 Rule 1: the statisticexceeds a pre-defined threshold L
For the treatment to be acceptable in the subgroup, the significance of the average treatment effect should exceed a pre-defined minimum value, say . Given prior estimates of and its standard deviation, we could calculate the sample size to ensure this threshold is achieved with a given probability. As an absolute minimum, the treatment effect and associated statistic should be positive (L > 0, assuming without loss of generality that positive effects signify treatment benefit), although such a mild condition is unlikely to be acceptable unless the drug has negligible adverse effects and little cost. Alternatively, setting L > 1.96 is a strict condition requiring a significant treatment effect in the subgroup at traditionally accepted levels. This is tantamount to repeating the trial in the sub-population and will not be feasible if either this sub-population is small or difficult to recruit from, or the treatment effect is modest. Despite this, for serious diseases with no alternative effective treatments, a smaller expected treatment effect may be sufficient to outweigh any concerns regarding safety and cost; in this case a value should be pre-defined. In general, we may accept an intermediate value for L.
For rule 1, conditional power is given by the expression
where the denominator is defined in equation (5) and can be obtained from standard statistical software.
For the numerator , writing and we show in Appendix 1 that the joint distribution of (X,Y) is
Again, the numerator of the conditional power is a function of three location and three variance parameters , π, , ρ, through μF and σF. Conditional on these, the numerator is and can be obtained from any statistical software.
3.5 Rule 2: The data should increase statistical significance
For interventions with a low adverse event profile, approval may be acceptable provided that the data in are not in conflict with those in . More formally, we might approve in provided that the data increase statistical significance, that is, on condition that . For rule 2 the conditional power has denominator defined in equation (5) and numerator given by .
Making the transformations, and , we show in Appendix 1 that the joint distribution of X and Y is
For the conditional power numerator, we calculate using standard statistical software, conditional on subgroup-specific parameters.
3.6 Rule 3: No significant subgroup-treatment interaction at αI level
When a range of subgroup effects are explored (often post hoc), it is customary to perform interaction tests to identify specific subgroups for which the treatment appears particularly effective/ineffective for further investigation. From our targeted systematic review of literature, the type I error rate αI is almost invariably set to 5%, with no adjustment for multiplicity. Our objective here is quite different; specifically we use αI as a measure of how confident we are that the two subgroups have different treatment effects, in order to decide whether approval in is warranted. In this case, we might choose a value for αI that is greater than 5%, depending on our knowledge of the variation of the treatment effects and the number of trial participants in each of the four subgroup-treatment combinations.
For this rule, the numerator is , where is the quantile from the standard normal distribution and SD is the standard error of the difference in treatment effects in the two subgroups. We make the transformations, and , and write . Then we show in Appendix 1 that the joint distribution of X and Y is
Again, the numerator for conditional power is , which we obtain from standard statistical software, conditional on subgroup-specific parameters.
3.7 Illustration of proposed rules
Figure 1 illustrates the proposed rules for the (hypothetical) case of independent normally distributed estimands for the two subgroups (on the scale of analysis e.g. linear, log, logistic). The statistics and have a bivariate normal distribution with mean (2, 1) and variance diag(1), represented by the unlabelled contours. The condition that the hypothesis test is significant in the full population is represented by the volume under the () joint density that lies above and to the right of the red line. The volumes under the () joint density that lie above the purple, blue and green lines represent Rule 1 for the cases where L = 1.96 (treatment effect in is significant), L = 1 (treatment effect in is one standard error) and L = 0 (treatment effect is positive), respectively. The volume that lies above the orange line represents Rule 2 ( results add to the overall significance) and the volume that lies above the pink line represents Rule 3 (interaction not significant at the one-sided 10% significance level). Conditional power for a certain rule, e.g. , is the proportion of the total probability above the red curve (), that also lies above the boundary defined by the rule (e.g. above the blue line, so that ).
In general, the conditional power for each of these three rules is given by the proportion of the density of (X, Y) that is consistent with the condition . As a specific example, consider Rule 1. Figure 2 shows the conditional power for Rule 1 for the case where estimands for and are independent (ρ = 0), participants are drawn in equal numbers from the two sub-populations () and thresholds for approval set at .
Because enters into the power calculation only through , it is independent of the approval threshold L, so that the Y-axis does not change position for different values of L. In contrast, as L increases, the joint density of (X, Y) shifts down the Y-axis and the proportion above zero decreases, thus decreasing the conditional power as expected. As the two estimands are assumed independent, the contours are circular. From the covariance matrix for Rule 1, X and Y will be positively correlated if the two estimands and are () and vice versa.
Similar patterns can be found for Rules 2 and 3.
4 Results
4.1 Comparison of conditional power for the proposed rules
The conditional power for all three rules depends on the relative treatment effects in the two subgroups, which is driven by the biological mechanisms of the treatment. Above this, we explore how the proportion sampled from each sub-population π and and the correlation between estimands ρ affects the power of the proposed rules.
4.2 Comparison of conditional power for proposed rules when and ρ = 0
The top row of Figure 3 shows the relationship between the subgroup-specific statistics for treatment effect and conditional power in the simple case of equal standard errors, equal numbers in the two subgroups and independent estimands. For any value of , power decreases as increases, due to conditioning on observed ; that is, increases as increases, thereby increasing the denominator in equation (4). We note that the contour lines curve for Rules 1 and 2, but for Rule 3, which is based on the interaction test, they are straight. The linear contours for Rule 3 do not hold if the estimands for the subgroups are non-independent, nor will they hold if data are binary or counts, see Appendix 1 for further details.
For this illustration, we set L = 1 for Rule 1 and one-sided significance of the interaction to 0.1 for Rule 3. Changing these thresholds will result in a shift in the contour plots, whilst Rule 2 does not require specification of an additional parameter and is fixed. It is possible to closely align the three rules by choosing appropriate values for L and the interaction type I error if necessary.
4.3 Effect of correlation between estimands when and ρ = 0.5
Including additional covariates in the trial analysis in equation (1) induces correlation between the treatment effect estimands (and therefore the statistics) for the two subgroups. The bottom row of Figure 3 shows how the conditional power changes when there is moderate/strong correlation () between the treatment effect estimates, compared with no correlation in the top row. In all cases, the contours are closer together. Conditional power for Rule 1 changes only slightly since it relies largely on the absolute size of , whilst Rules 2 and 3 rely on both groups to a greater extent.
We note here that, if the two estimands are independent (ρ = 0) and arise from normally distributed data with common sampling variance in the two subgroups, then conditional and unconditional power for Rule 3 are identical, since and in Rule 3 will also be independent. A brief proof is given in Appendix 1. However, for non-normal data or normal data with differential variance in the two subgroups or correlated and , conditional and unconditional power for the interaction test will not be identical.
4.4 Effect of relative subgroup size when sampling variance is homoscedastic across subgroups and ρ = 0.5
We illustrate the influence of subgroup size on conditional power for the case where sampling variance is homoscedastic across biomarker subgroups. This will occur for normally distributed outcomes with the same sampling variance in each subgroup, but does not necessarily hold for other distributions. Figure 4 shows how conditional power for each rule varied if either 20% or 80% of patients came from population . The proportion of patients sampled from the two sub-populations has a greater impact on conditional power than correlation between the two treatment effects.
If the trial sample contains a high proportion drawn from the population (bottom row of Figure 4), then Rule 1 (with L = 1) has lower power for small values of . This arises because we are conditioning on significance in the full population, and the patients contribute only a small amount to the overall. Conversely, because patients contribute a large amount to the overall analysis in the top row of Figure 4, the overall significance only occurs if there is good power that the observed exceeds L . Similar effects are observed for Rules 2 and 3, in that the overall-significance condition induces higher conditional power for approving treatment in at lower values of .
To provide further insight into the effect of relative sample size of the two subgroups, we plot conditional power against for three different sampling proportions, 20:80, 50:50 and 80:20 with (ρ = 0) in Figure 5. This shows that, if the proportion of patients in this group is low (20%) or equal (50%), Rule 1 (with L = 1) has highest conditional power across the plausible range of values of . Conversely, if the proportion of patients from is high (80%), Rule 3 has uniformly highest conditional power. Rule 2 never has highest power in these scenarios, but we note that results are dependent on the chosen values of L for Rule 1 and the Type I error for interaction test, whilst Rule 2 has the advantage of not requiring additional parameters.
5 Illustrative applications
In order to illustrate how these rules might be used in practice, we retrospectively apply them to two completed phase III trials: a small cardiac surgery trial of 352 patients and equal size subgroups,19 and a much larger stroke trial (n = 7513) which evaluated betrixaban.13
5.1 AMAZE trial in cardiac surgery
AMAZE was a cardiac surgical trial in patients with atrial fibrillation (rapid/irregular heart rhythm).19 This multi-centre RCT randomised 352 patients 1:1 to a technique called ablation in addition to planned surgery, or to planned surgery alone (control arm). The primary outcome was return to sinus rhythm at one year post-surgery (binary). Although not part of the intervention, at the discretion of the operating surgeon 150/352 (42.6%) randomised patients had a section of the heart, called the left atrial appendage (LAA), removed during the procedure. This may be considered a more extensive procedure, conferring higher probability of a positive outcome. We define subgroups by whether the LAA was removed or left intact . Selected results from the AMAZE trial are shown in Table 2.
Table 2.
Patient Group | LAA left intact | LAA removed | Full population F |
---|---|---|---|
Estimand | across subgroups | ||
Log odds ratio (standard error) | 0.461 (0.378) | 1.406 (0.472) | 0.863 (0.320) |
Z-statisic | 1.220 | 2.981 | 2.697 |
We estimate conditional power for approval of ablation in assuming that actual trial results were our estimates of group-specific effects before the trial started. For Rule 1, there would 74% power to observe a treatment effect in if the threshold was set at L = 1. Rule 2 is not met in AMAZE since ; assuming observed trial results were “true”, conditional power was only 38%. Finally, the interaction Rule 3 was also just met if we set the decision threshold for the two-sided test statistic to 10% (interaction test p = 0.119). Conditional power based on observed results was 37% for Rule 3; had this been specified during design of the trial, sample size could have been increased to increase confidence that Rule 3 would be met.
5.2 APEX trial in patients at high risk of stroke
Recall that one of the trials motivating this study compared treatment with the anticoagulant betrixaban with standard treatment of enoxaparin amongst hospitalised medically ill patients.13
The primary outcome was a composite of clinical events caused by blood clotting (deep vein thrombosis, non-fatal pulmonary embolism or death from thromboembolism) up to day 42 post-randomisation.
The planned analysis took a sequential testing approach, but rather than starting with the full trial population, the order of testing began with a subgroup with a high chance of treatment response (but smaller treatment effect), followed by testing in two other pre-specified, progressively inclusive cohorts as follows:
Compare treatment arms in patients with elevated D-dimer level (for illustration can be considered ).
Compare treatment arms in patients with elevated D-dimer or age (for illustration can be considered extended ).
Compare treatment arms in all enrolled patients (full population, n = 7513).
If any test was negative, all subsequent tests were reported as exploratory. We provide selected results from the original trial publication in Table 3.
Table 3.
Patient group | Elevated D-dimer | Non-elevated D-dimer | Full Population F |
---|---|---|---|
Events in treated patients | 132/1914 (6.9%) | 33/1198 (2.8%) | 165/3112 (5.3%) |
Events in control patients | 166/1956 (8.5%) | 67/1218 (5.5%) | 223/3174 (7.0%) |
Relative risk (95%CI) | 0.81 (0.65, 1.00) | 0.50 (0.33, 0.75) | 0.76 (0.63, 0.92) |
Log Relative Risk(Standard Error) | −0.21 (0.11) | −0.69 (0.21) | -0.28 (0.10) |
Z-statistic | 1.85 | 3.31 | 2.83 |
As the table shows, the first analysis including patients was not significant at the traditional threshold p = 0.054, so that subsequent analyses were treated as exploratory, even though the experimental treatment effect was greater in the subgroup and overall analyses. Adopting a sequential strategy, conditional on a significant effect of betrixaban in the full population, our proposed rules for the elevated subgroup would result in the following recommendations:
Rule 1: would be approved if a one standard error threshold (L = 1) was defined a priori as a clinically acceptable treatment effect. Assuming the trial result was “true” as for the previous example, the conditional power of this rule was 91%. If a significant effect in this subgroup was necessary (), as suggested by the original trial analysis, then Rule 1 was not met.
Rule 2 was also not met because group data resulted in a decrease in the Z-statistic for the full population compared to . This has arisen because patients had a much higher treatment effect despite the lower event rate. The conditional power of this rule was only 34%.
There was a significant interaction between subgroup and treatment (one-sided p = 0.0184) so that this trial also fails Rule 3 – the conditional power was 43%.
In summary, using our proposed sequential testing procedure, betrixaban would be recommended for treatment in elevated d-dimer patients only if we were prepared to accept a lower treatment effect compared with non-elevated d-dimer patients and that lower treatment effect resulted in a Z-statistic of at most 1.85 standard errors ().
5.3 Implications for trial design
In this context, investigators define and document decision rules for the primary trial analysis during the design stage. Our proposal is that, should a separate decision on approval of a subgroup be required, then a decision rule should be agreed in discussion with regulators or other appropriate decision makers during the design phase. Our evaluation of three potential rules illustrates how to investigate the efficiency of different rules, although parameter inputs will be specific to each trial and will depend on available information around potential efficacy.
Although our rules rely on Z-statistics for hypothesis tests, it is more usual to work with potential treatment effects and their standard deviations when designing a trial. Empirical estimates of variation in the primary outcome are typically available, particularly for the control arm of the trial. This may be a standard deviation for a continuous outcome, or the baseline risk of an event for patients receiving the current best treatment. Given these estimates, the sample size required for an overall significant treatment effect, the proposed sampling proportion, and the Rule 1 threshold L can be decided to ensure that the treatment effect in subgroup lies above a minimum treatment effect. In a similar way, the Rule 3 significance level can be chosen to ensure there is sufficient power to find an interaction if the estimate is much lower than .
The stages of design are as follows:
Using initial estimates of design parameters, including the sampling proportion π and correlation ρ, calculate the power of the test for the expected value of the treatment effect in the full trial population.
Choose a decision rule for recommending treatment in based on considerations of clinically important treatment effects, safety and biological mechanisms.
Given the sampling proportion π and the expected treatment effect sizes in the two sub-populations, calculate the power of your preferred rule, conditional on a significant overall test.
Calculate the power of the sequential testing strategy as the product of conditional and unconditional power in 1 and 3.
In practice the final power calculations will require an iterative process between calculation and elicitation of expert clinical knowledge of treatment effects and associated variance components, finalised in discussion with regulators or other decision makers.
6 Discussion
6.1 Overview of results
Frequentist rules to assess whether approval of a new treatment should be accepted in a lower response subgroup, conditional on a significant effect overall, have been developed and evaluated. Approval based solely on a significant overall test may be unacceptable if there are severe side effects and/or if the subgroup drawn from the low response population is under-represented due to enrichment sampling. Rules are based either on measures of influence, such as the size of the effect in this subgroup, or the increase in significance due to inclusion of the subgroup, or on the difference in effect size between the groups (interaction). When choosing a rule during trial design, as well as specifying estimates of the expected outcomes and their variance components, investigators must either take a random sample from the full population, in which case the trial will represent clinical practice, or decide the proportion of patients to be sampled from each sub-population. Using conditional power as a measure of efficiency, the proportion of patients drawn from each sub-population had a large impact, but correlation between the groups induced by covariate adjustment was less important. For all rules, conditional power decreased as μB+/σB+ increased for fixed μB−/σB−.
6.2 Discussion of individual rules
After ensuring that , the simplest approach is to perform tests in and separately as part of a closed testing procedure, and allow a more relaxed significance level for (Rule 1). This significance level is related to both the proportion of the trial sample in and the effect size that is acceptable given the safety profile of the treatment. Hence, the level can be set based on prior knowledge of treatment effect and prevalence of low responders in the population. In our illustration, for trials with of trial patients in , Rule 1 had the highest power for our chosen threshold of .
Rule 2 ( patients should increase significance) is rather ad hoc but has the benefit of not requiring specification of an additional parameter. To satisfy Rule 2, must preserve at least some proportion of the estimated efficacy of . Further, the conditional power of Rule 2 decreases as the proportion of patients from decreases, which is also an attractive property. Despite these benefits, Rule 2 never demonstrated highest conditional power in our analyses.
Rule 3 uses an interaction test with a relaxed significance level to recommend approval in the sub-population. Interaction tests in clinical trial publications typically aim to identify heterogeneity between subgroups and are mainly purely exploratory. However, a significant subgroup-treatment interaction at the 5% level may not preclude approval in the lower response group, provided that there is a minimum level of efficacy, particularly if side effects are mild or there are few alternatives for this subgroup. Our more targeted objective is equivalent to testing whether the difference in treatment effects between the two groups is within acceptable limits and can be reconstructed as an equivalence or non-inferiority test. That is, the (interaction test) significance level can be based on a priori estimates of the maximum acceptable difference between the two subgroups ( in equation 1). In our analyses, for trials where of patients arise from , Rule 3 had the highest power to approve B− given an overall significant result.
6.3 Regulator input
In practice, acceptability of these approaches will depend on regulators (for drug trials) or commissioners (for academic trials). Since the conditional power of the rules depends crucially on the values chosen for the parameters L and αI, as well as patient sampling, prevalence of high/low responders and analysis methods, early engagement with regulators/commissioners to discuss these decision rules is worthwhile. Discussions also need to consider potential harms (side effects), in order to set realistic and acceptable targets for efficacy. In practice, investigators/sponsors will be required to pre-specify and document these decision rules in discussion with regulators.
6.4 Strength, weaknesses and future research
One benefit of our proposed decision rules is that closed-form expressions for conditional power are available for continuous, binary and count outcomes (assuming known variances). This makes estimation of sample sizes relatively simple, and a wide range of scenarios can be explored during the design phase.
In our examples, we used retrospective power calculations to show the differences in conditional power for the three rules based on trial results. We stress that these calculations were provided for illustration only and we do not endorse retrospective power calculations to aid interpretation of statistically non-significant trial results (see for example Hoenig and Heisey20).
In common with many statistical methods, there is an underlying assumption of normality when using generalised linear models. This will hold for most adequately powered, phase III trials where analysis is completed on a scale for which the sampling distributions of estimated coefficients can be assumed normal (e.g. logistic, log). For small trials, or for estimands with very skewed distributions, asymptotic approximations may not hold and analyses should be checked using simulations.
In this paper, we provided expressions for the case where patients were randomised 1:1 to the experimental and control arms, although extension to other allocation ratios is straightforward. It would also be relatively straightforward to extend the methods to biomarkers with more than two levels, although the number of patients at each level is likely to be small in this case, resulting in low conditional power for all proposed rules. An exception might be for biomarkers with ordered levels, in which case the subgroup effect and interaction with treatment could be linear terms in the analysis (equation (1)).
For time-to-event outcomes, power of the study depends directly on the number of events occurring rather than on the number of patients, so that power would also depend on recruitment and censoring patterns. Methods would need to be extended to accommodate these features. Further, we have not embedded these results in more formal decision analytic methods, and this would require further specification of costs, harms (side effects) and utilities (benefits) and would depend on the perspective of the investigator (sponsor or health provider).
In summary, in situations where additional conditions are required for approval of a new treatment in a lower response subgroup, easily applied rules based on minimum effect sizes and relaxed interaction tests are available. These depend on trial design characteristics, particularly the proportion of patients sampled from the two subgroups and must be pre-specified and documented in the Statistical Analysis Plan.
Supplemental Material
Supplemental material, sj-zip-1-smm-10.1177_09622802211017574 for Frequentist rules for regulatory approval of subgroups in phase III trials: A fresh look at an old problem by K Edgar, D Jackson, K Rhodes, T Duffy, C-F Burman and LD Sharples in Statistical Methods in Medical Research
Acknowledgements
The project was stimulated by discussions with Sue-Jane Wang of the Food and Drug Administration and benefitted from discussions with David Wright of Astra Zeneca. The views and work contained within the paper are those of the authors. Software is available from the corresponding author upon request.
Appendix 1. Conditional power calculations
The following shows how to calculate the conditional power for each rule, given pre-specified parameters , the proportion of patients in (π), and the correlation between the subgroup treatment effects (and therefore the subgroup test statistics and ) ρ.
Given pre-specified treatment effects and for the subgroups and planned sampling proportion from the sub-population π, the full population treatment effect is
We have that the conditional power for each rule is
where, for variance components and and correlation ρ, we have
The denominator does not depend on the form of any proposed rule and is given by
Note that the right-hand side is a function of three location parameters and π, and three variance parameters , ρ (see equations (2) and (3)), and conditional on these the standard normal deviate can be obtained from any statistical software.
A.1 Rule 1 numerator
For rule 1, the numerator of the conditional power is given by the expression
We make the transformation and , and find their distribution as follows
Because ZF and are Z test statistics (i.e. assumed ), the variances of X and Y are given by
The covariance of X and Y is
Therefore, the joint distribution of X and Y is
The numerator is and can be obtained from standard statistical software.
A.2 Rule 2 numerator
The numerator for Rule 2 is given by . Making the transformations, and , the joint distribution of X and Y is found as follows:
X is the same as for Rule 2, so that the expectation and variance of X are again and 1, respectively.
The expectation and variance of Y are
The covariance of X and Y is
The joint distribution of X and Y for Rule 2 is
Again, the numerator for conditional power is , which we obtain from standard statistical software.
A.3 Rule 3 numerator
For this rule the numerator is , where is the quantile from the standard normal distribution.
We make the transformations, and , where .
Again the expectation and variance of X are and 1, respectively.
The expectation and variance of Y are
The covariance of X and Y is given by
Then the joint distribution of X and Y is
Again, the numerator for conditional power is , which we obtain from standard statistical software.
A.4 Equality of conditional and unconditional power for rule 3 when ρ = 0, data are normally distributed and subgroups have the same sampling variance
To explore when conditional and unconditional power are the same, we identify conditions when , or equivalently, the correlation of R3 and is zero.
For Rule 3, conditional power is defined as and we made the transformations, and , so that the conditional power can be written .
Recall that the covariance of X and Y is
If there is no correlation between the two subgroup treatment estimates ρ = 0, then this will become
(6) |
Recall that for the normal distribution case with n patients allocated to each treatment arm and common sampling variance in the two subgroups, and . Substituting these into equation (6) results in zero correlation between Rule 3 and the overall significance condition, and shows that the interaction rule is independent of the condition for this case.
For binary responses and usng a logit link, and where θjk is the probability of an event in treatment arm j and subgroup k. In this case, the covariance is zero only if
In general, this will not hold. A similar situation applies for count data.
In summary, if the two subgroup estimates are correlated ( due to covariate adjustment) or sampling variance in the two groups differ, either because the data are not normally distributed or because outcome measurements in the two groups have different variances, then conditional and unconditional power are not necessarily equivalent under Rule 3.
Footnotes
Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: KE was supported by a UK National Institute for Health Research Methodology Fellowship.
ORCID iDs: K Edgar https://orcid.org/0000-0002-9954-5531
LD Sharples https://orcid.org/0000-0003-0894-966X
References
- 1.FDA-NIH Biomarker Working Group. BEST (Biomarkers, EndpointS, and other Tools) resource. Silver Spring (MD): Food and Drug Administration (US); Bethesda (MD): National Institutes of Health (US), 2016, www.ncbi.nlm.nih.gov/books/NBK326791/ (accessed 19 May, 2021). [PubMed] [Google Scholar]
- 2.Oldenhuis CN, Oosting SF, Gietema JA, et al. Prognostic versus predictive value of biomarkers in oncology. Eur J Cancer 2008; 44: 946–953. [DOI] [PubMed] [Google Scholar]
- 3.Committee for Medicinal Products for Human Use (CHMP). Guideline on the investigation of subgroups in confirmatory clinical trials. 31 January 2019, EMA/CHMP/539146/2013.
- 4.Brookes ST, Whitley E, Peters TJ, et al. Subgroup analysis in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Health Technol Assess 2001; 5: 33. [DOI] [PubMed] [Google Scholar]
- 5.FDA (US Food and Drug Administration). Multiple endpoints in clinical trials. guidance for industry, 2017, https://www.fda.gov/media/102657/download (accessed 19 May, 2021).
- 6.Burman CF, Sonesson C, Guilbaud O.A recycling framework for the construction of Bonferroni-based multiple tests. Stat Med 2009; 28: 739–761. [DOI] [PubMed] [Google Scholar]
- 7.Holm S.A simple sequentially rejective multiple test procedure. Scand J Stat 1979; 6: 65–70. [Google Scholar]
- 8.Spiessens B, Debois M.Adjusted significance levels for subgroup analyses in clinical trials. Contemporary Clin Trials 2010; 31: 647–656. [DOI] [PubMed] [Google Scholar]
- 9.Gonzalez-Martin A, Pothuri B, Vergote I, et al. Niraparib in patients with newly diagnosed advanced ovarian cancer PRIMA trial. New Engl J Med 2019; 381: 2391–2402. [DOI] [PubMed] [Google Scholar]
- 10.Wechsler ME, Szefler SJ, Ortega VE, et al. Step-up therapy in black children and adults with poorly controlled asthma. New Engl J Med 2019; 381: 1227–1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Asbell PA, Maguire MG, Ying G-S, et al. n-3 Fatty acid supplementation for the treatment of dry eye disease. New Engl J Med 2018; 378: 1681–1690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.James ND, de Bono JS, Spears MR, et al. Abiraterone for prostate cancer not previously treated with hormone therapy. New Engl J Med 2017; 377: 338–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gibson CM, Halaby R, Korjian S, et al. The safety and efficacy of full-versus reduced-dose betrixaban in the acute medically Ill VTE (venous thromboembolism) prevention with extended-duration betrixaban (APEX) trial. Am Heart J 2017; 185: 93–100. [DOI] [PubMed] [Google Scholar]
- 14.Vivot A, Boutron I, Béraud-Chaulet G, et al. Evidence for treatment-by-biomarker interaction for FDA-approved oncology drugs with required pharmacogenomic biomarker testing. Scientific Rep 2017; 7: 6882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ondra T, Dmitrienko A, Friede T, et al. Methods for identification and confirmation of targeted subgroups in clinical trials: a systematic review. J Biopharmaceut Stat 2016; 26: 99–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Stallard N, Hamborg T, Parsons N, et al. Adaptive designs for confirmatory clinical trials with subgroup selection. J Biopharmaceut Stat 2014; 24: 168–187. [DOI] [PubMed] [Google Scholar]
- 17.Matsui S, Crowley J.Biomarker-Stratified Phase III Clinical Trials: enhancement with a subgroup-focused sequential design. Clin Cancer Res 2018; 24: 994–1001. [DOI] [PubMed] [Google Scholar]
- 18.Best N, Price RG, Poulinguen IJ, et al. Assessing efficacy in important subgroups using Bayesian dynamic borrowing. PharmaceutStat 2021; 20 (3): 551–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Nashef SA, Fynn S, Abu-Omar Y, et al. AMAZE: a randomized controlled trial of adjunct surgery for atrial fibrillation. Eur J Cardio-Thoracic Surg 2018; 54: 729–737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hoenig JM, Heisey DM.The abuse of power: the pervasive fallacy of power alculations for data analysis. Am Stat Assoc 2001; 55: 1–6. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-zip-1-smm-10.1177_09622802211017574 for Frequentist rules for regulatory approval of subgroups in phase III trials: A fresh look at an old problem by K Edgar, D Jackson, K Rhodes, T Duffy, C-F Burman and LD Sharples in Statistical Methods in Medical Research