Skip to main content
Sage Choice logoLink to Sage Choice
. 2024 Sep 25;33(11-12):1967–1978. doi: 10.1177/09622802241277764

Testing for a treatment effect in a selected subgroup

Nigel Stallard 1,
PMCID: PMC11577705  PMID: 39319446

Abstract

There is a growing interest in clinical trials that investigate how patients may respond differently to an experimental treatment depending on the basis of some biomarker measured on a continuous scale, and in particular to identify some threshold value for the biomarker above which a positive treatment effect can be considered to have been demonstrated. This can be statistically challenging when the same data are used both to select the threshold and to test the treatment effect in the subpopulation that it defines. This paper describes a hierarchical testing framework to give familywise type I error rate control in this setting and proposes two specific tests that can be used within this framework. One, a simple test based on the estimated value from a linear regression model with treatment by biomarker interaction, is powerful but can lead to type I error rate inflation if the assumptions of the linear model are not met. The other is more robust to these assumptions, but can be slightly less powerful when the assumptions hold.

Keywords: Adaptive enrichment design, family wise error rate control, hierarchical testing, linear regression, subgroup selection

1. Introduction and motivation

Recent advances in the understanding of the heterogeneity of patients and disease types have led to an increasing desire to understand how patients may respond differently to a particular treatment. This in turn has led to development of a number of novel clinical trial approaches that attempt to identify a subgroup of patients that may benefit from an experimental therapy.1,2

One particular approach is the adaptive enrichment design.3,4 This is a two-stage design. In the first stage, patients are recruited from the whole population. Data from these patients are then used to select some subgroup of the population, which may be the whole population, in which a positive treatment effect is anticipated, with patients recruited from this subpopulation in the second stage. One such trial is described by Ho et al. 5 At the end of the trial, the question of interest is whether the treatment can be shown to be effective in the subpopulation selected.

If this is a confirmatory, phase III, randomised controlled trial, as highlighted, for example, by Simon, 6 the final analysis of the trial is challenging. It is often considered desirable to conduct a hypothesis test of the treatment effect in the selected subgroup, but if the final analysis uses data from both stages the fact that stage 1 data are used both to select the subpopulation to be included in the second stage and to assess the effectiveness of the treatment in that subpopulation can lead to inflation of the type I error rate. In order to control the type I error rate it has been suggested that only the data from the second stage might be used in the final analysis 7 or that data from all patients in the first stage are combined with those in the second stage irrespective of whether or not they are in the selected subpopulation, 4 but such approaches are generally inefficient. 8

While there has been some work on methods that enable use of the first stage data in the final analysis and control the type I error rate, these have predominantly been in the case in which the subpopulation is defined on the basis of a single binary biomarker so that in the second stage recruitment is either from the whole population or restricted to a predefined subgroup including only biomarker positive patients.913

There has been more limited work on the setting in which the biomarker is continuous. In this case, while some authors assume that this continuous biomarker divides the patients into two distinct groups with different levels of treatment effect, 14 , it is more common to assume that the biomarker has a monotonic effect on the treatment effect, often with the direction of this assumed, so that, for example, the treatment effect is assumed to be non-decreasing with increasing biomarker values. Selection of the subpopulation thus corresponds to identification of a threshold level for the biomarker so that the subgroup is comprised of all patients with biomarker levels above this threshold. Lin et al. 8 consider this problem in the setting of a single-arm trial, and propose selection of the subgroup that maximises the standardised average subpopulation effect relative to a specified value, which may be a historical treatment effect. Stallard 15 considers the setting of a two-arm study with the subgroup selected to maximise one of the measures including the test statistic comparing the treatment groups in the selected subgroup and the estimated treatment difference in the subgroup. Selection based on maximisation of a test statistic such as the estimated treatment difference can, however, lead to selection of small subgroups when the treatment effect is increasing with the biomarker.

Frieri et al. 16 extend the setting of Lin et al. 8 to that of a comparative trial, with patients randomised to receive either an experimental treatment or a standard treatment considered as a control. Like Lin et al., 8 they assume that the biomarker and the response follow a bivariate normal distribution, so that the expected response is linearly related to the biomarker level, and propose selection of the subpopulation of patients with biomarker levels such that the expected response on the experimental treatment exceeds that on the control treatment by some specified amount. Frieri et al., like Wang et al., 17 who consider time to event data, and Baldi Antognini et al., 18 who consider design issues, focus primarily on estimation of a threshold level for the continuous biomarker above which the treatment effect is positive. The final analysis they propose is, however, based only on the stage 2 patients from the selected subgroup, so does not need to take account of the data-dependent treatment selection but does not use all available data. The focus of the current paper is more on hypothesis testing when data from both stages are used, the aim being to identify a subpopulation in which the null hypothesis that the treatment is ineffective can be rejected whilst allowing for the multiple testing that arises from the same data being used to both identify the subpopulation and conduct the test.

2. Setting and notation

Since, as outlined above, the statistical challenges arise from the use of the same data for both subgroup selection and hypothesis testing within the selected subgroup, we will focus specifically on the analysis of data from stage one. Data from stage two can be added, for example, via a combination test, 19 as described in more detail in the Discussion section below. We thus consider a single-stage study in which a continuous biomarker is used to identify a subpopulation with the treatment effect tested in that subpopulation. Like Lin et al. 8 and Frieri et al., 16 we will specifically consider the setting of a continuous response that can be assumed to be normally distributed, though we will make no similar assumption about the distribution of the biomarker.

In detail, suppose n patients are assigned to two groups, with ti indicating group membership for patient i , i=1,,n , with ti=0 for patients receiving a control treatment and ti=1 for patients receiving an experimental treatment. Let n1=t1++tn and n0=nn1 denote the numbers of patients in treatment groups 1 and 0, respectively. Treatment allocation is typically assigned at random or using a randomly permuted blocks design to ensure that n0=n1=n/2 .

Let xi denote the biomarker value for patient i , i=1,,n . We will condition these values, considering them fixed. Without loss of generality, assume x1x2xn .

Let Yi denote the response for patient i , with yi denoting the observed value of Yi , i=1,,n , and Yi normally distributed and related to xi and ti via a linear regression model with main effects and interaction. That is

Yi=α0+β0xi+αti+βtixi+εi,i=1,,n (1)

with εi independent N(0,σ2) for some unknown σ .

We will assume that larger values of Yi are more desirable, and will further assume that it is known, or can be assumed a priori, that β0 ; that is, if the biomarker has a predictive effect then it is in the direction such that treatment effects are larger for larger biomarker values.

3. Closed testing procedure for strong familywise error rate control

Let θ(x) denote the treatment effect at biomarker level x , so that, from equation (1), we have θ(x)=α+βx , and write θk for θ(xk) . We wish to test the family of null hypotheses Hk:θk0 for k=1,,n . Note that, since x1xn , the assumption β0 means that θ1θn . Hence H1Hn and Hk implies θ(x)0 for all xxk . Conversely, if we reject Hk and conclude θk>0 , this implies we may conclude that θ(x)>0 for all xxk . Rejecting Hk thus leads to the conclusion that the treatment is effective for patients with biomarker level xk or larger.

Since H1Hn , the set {Hk;k=1,,n} is closed under intersection. Applying the closed testing procedure, 20 for given k , the familywise error rate (FWER) is controlled in the strong sense by a procedure under which Hk can be rejected if all intersection hypotheses that are subsets of Hk can be rejected at the nominal level. In this case, this corresponds to the rejection of H1,,Hk .

The FWER is thus strongly controlled by the hierarchical testing procedure illustrated in Figure 1, in which we test each Hk at the nominal level, starting with k=1 . If Hk is not rejected at the nominal level, we do not reject any of Hk,,Hn and conduct no further tests. If Hk is rejected at the nominal level, we reject H1,,Hk , concluding that θ(x)>0 for all xxk , and proceed to step k+1 .

Figure 1.

Figure 1.

The hierarchical testing procedure for the family of hypotheses {Hk,k=1,,n} .

4. A simple regression-based test of Hk

In order to apply the hierarchical testing procedure just described, a test of Hk is required. As noted above, Hk implies θ(xk)0 and hence θ(x)0 for all xxk , thus in particular for xk,,xn . A test of Hk can thus be based on data (xi,yi,ti),i=k,,n . Although this test is not based on the entire data set, and so may lack power for larger k , omitting data (xi,yi,ti),i=1,,k1 means that rejection of Hk , and a conclusion that the treatment is effective at biomarker level xk , cannot arise due to observed data indicating a large treatment effect at higher biomarker levels.

In order to test Hk based on the data (xi,yi,ti),i=k,,n, we can fit the linear model given by (1) to the data (xi,yi,ti),i=k,,n , noting that this requires kn3 for parameter estimates to be identifiable. Fitting the model provides estimates of the treatment and treatment by biomarker interaction effects, α^k and β^k , together with their estimated variances and covariance. The estimated treatment effect at a biomarker value xk is then θ^(xk)=α^k+β^kxk with expected value E(θ^(xk))=αk+xkβk and variance var(θ^(xk))=var(α^k)+2xkcov(α^k,β^k)+xk2var(β^k), where estimates and their variances can be obtained using standard software, for example, using the lm and vcov commands in R, 21 or via expressions in, for example, Madsen and Thyregod. 22 . Since E(θ^(xk))=θ(xk) , we can test Hk using Zk=θ^(xk)/var(θ^(xk)) , which under Hk has an asymptotic distribution that is stochastically no larger than a N(0,1) -distributed random variable.

If β^k>0 , so that the data support a treatment effect that increases with x , a (one-sided) p -value for the test Hk can thus be obtained as

pk=1Φ(Zk)=1Φ{θ^(xk)var(θ^(xk))}. (2)

If β^k<0 , indicating a treatment effect that decreases with x , we do not wish to conclude that θ(x)>0 for all xxk , so set pk=1 .

Similarly, for k>n3 , when it is impossible to fit the linear model (1) to the data (xi,yi,ti),i=k,,n , we set pk=1 and fail to reject Hk . In practice, this is unlikely to be of concern when n is of reasonable size as it is not desirable to conclude that the treatment is effective for a very small subset of the population.

5. A more robust test of Hk

As the estimate θ^(xk) on which the p -value given by (2) is based is taken directly from the linear model (1), it might be anticipated that the test based on this p -value would be sensitive to departure from the linearity assumed, and it is shown below that this can be the case, particularly when the true treatment effect θ(x) is increasing and concave. An alternative, more robust, test of Hk is thus also proposed.

As in the simpler test above, the test of Hk is based on data (xi,yi,ti),i=k,,n , that is, data with xxk . With this procedure, we first identify a value xk* such that a positive treatment effect is anticipated for xxk* , then use data with xk*xxk to test for a treatment effect after adjusting for x to test the hypothesis Hk .

The procedure is illustrated by Figure 2, which shows a plot of y1,,yn against x1xn for a particular data set. Solid points indicate ti=1 and hollow points ti=0 . In order to test Hk , data (xi,yi,ti),i=k,,n , are used. This is illustrated in the figure for k=3 , so that data given by points to the right of the solid vertical line at x3 are not used.

Figure 2.

Figure 2.

Plot showing yi plotted against xi for a single small data set using solid points for ti=1 and hollow points for ti=0 to illustrate notation and data used to test Hk (here with k=3 ). A linear model with treatment by biomarker interaction is fitted to data for patients i=k,,n and the value of x at which the lines cross is denoted by xk* . The value Jk is then the largest i with xix* . Data for patients i=k,,Jk are used to calculate the test statistic, which will be denoted Zk* .

We first fit the linear model (1), with main effects and interaction, to the data (xi,yi,ti),i=k,,n , again requiring that kn3 , to obtain estimates α^k , β^k with estimated treatment effect at a biomarker value x given by θ^(x)=α^k+β^kx . If β^k>0 , so that the data support a treatment effect that increases with x , the estimated treatment effect, θ^(x) is positive for all xxk* where xk*=α^k/β^k .

The fitted lines from model (1) are shown in Figure 2 by the dashed line (group 0) and solid line (group 1). The point at which these lines cross is xk* , shown by the vertical dashed line. Given xk* , we define Jk=argmax{xjxjxk*} , with Jk=0 if {xjxjxk*} is empty, so that xJk is the smallest xj at least as large as xk* .

If β^k>0 and Jkk , the treatment effect in the range xJkxxk adjusted for x can be estimated by fitting a linear model with main effects of treatment and biomarker but no interaction to data (xi,yi,ti),i=k,,Jk assuming kJk2 . A test statistic, Zk* , for Hk can then be obtained as the test for a treatment effect adjusted for x from this model. The data used to obtain fit Zk* lie between the vertical dashed and solid lines in Figure 2. If β^k<0 or Jk<k , the data do not indicate that there are any x<xk with θ(x)>0 and we set Zk*= .

In order to obtain a p -value, pk* , for the test of Hk , the observed value of the test statistic Zk* can be compared to its null distribution. This distribution is derived in the following subsection.

5.1. Null distribution of Zk* , the test statistic for a treatment effect in the selected subgroup

In order to base a test of Zk* , we require its null distribution.

We start by giving a derivation of the distribution of Zk* under the simple null hypothesis α=β=0 .

As described above and illustrated in Figure 2, let xk*=α^k/β^k and Jk=argmin{xjxjxk*} . We wish to obtain the distribution of Zk* , that is, specifically to obtain Pr(Zk*c).

We have

Pr(Zk*c)=j=knPr(Zk*c,Jk=j)

which, since Zk*= if β^k0 , means that

Pr(Zk*c)=j=knPr(Zk*c,Jk=j,β^k>0)

Now, for Jk=j , we must have xk*(xj+1,xj] , that is, xj+1<α^k/β^kxj. If β^k>0 , that is, α^kxj+1β^k>0 and α^k+xjβ^k0, so that

Pr(Zk*c)=j=knPr(Zk*c,α^kxj+1β^k>0,α^k+xjβ^k0,β^k>0) (3)

Since Zk*,α^kxj+1β^k,α^k+xjβ^k and β^k are all linear combinations of coefficients from a linear model involving data from patients i=k,,n , we have

(Zk*,α^kxj+1β^k,α^k+xjβ^k,β^k)=MYk

for some 4×(nk) matrix M , with Yk=(yk,,yn) , so that

(Zk*,α^kxj+1β^k,α^k+xjβ^k,β^k)N(ME(Yk),Mvar(Yk)M) (4)

(see Appendix 1 in the Supplemental material for details).

If σ2 were known, the probability

Pr(Zk*c,α^kxj+1β^k>0,α^k+xjβ^k0,β^k>0) (5)

could thus be given by a multivariate normal tail area.

Since the terms on the left-hand side, all come from linear models adjusting for the biomarker levels, they do not depend on α0 or β0 , so these may be set to 0 in the distribution (4). Hence under the null hypothesis α=β=0 , we can take E(Yk)=0 , the zero vector, so that the four elements of the vector on the left-hand side of equation (4) all have mean 0.

Thus, the value of Zk* is invariant to scaling of Yk , since it is divided by σ . Scaling of Yk will lead to scaling of α^kxj+1β^k , α^k+xjβ^k and β^k , but since in the probability given in equation (5) these are compared with their mean values 0, the probability will remain unchanged. We may thus scale each Yi to have unit variance so that equation (4) becomes

(Zk*,α^kxj+1β^k,α^k+xjβ^k,β^k)N(0,MM) (6)

Denoting by Φ¯(b,μ,Σ) the complementary distribution function giving the probability Pr(Xibi for all i) when X has a multivariate normal distribution with mean μ and variance–covariance matrix Σ , the probability given by equation (5) is equal to Φ¯((c,0,0,0),0,MM). This then enables calculation of Pr(Zk*c) given by equation (3) as required.

In practice, since σ is unknown, the estimated value from the linear regression (1) could be used. This estimate could be obtained separately for each Hk using data (xi,yi,ti),i=k,,n . However, since the estimate may be poorly estimated when the sample size used to fit the models is small, as will be the case when testing Hk for k large or when Jk is close to k , an alternative is to use an estimate of σ obtained from fitting the model (1) to the whole data set when calculating Zk* and the corresponding p -value. This will provide a more precise estimate if (1) holds with var(εi)=σ2 for all i .

The p -value given by (3) was based on the distribution of Zk* under the simple null hypothesis α=β=0 . Supplemental Appendix 2 shows that this distribution is stochastically larger than that under any α,β such that β0 and α+βxk0 , and hence α+βx0 , that is, θ(x)0 , for all xxk , so that a test based on the p -value obtained is thus conservative under any other null scenario.

The test statistic Zk* is obtained by using the data in the range xJkxxk to test for a treatment effect adjusting for x . Alternative tests could fit a model to these data with a treatment effect but without adjusting for x , or adjusting for both x and an interaction between x and treatment, could also be used, with the distribution of the test statistic obtained in a similar way to that of Zk* .

6. Simulation study

6.1. Comparison of operating characteristics

A simulation study was conducted to assess the properties of the testing procedures described above.

Taking n=80 , for each of 10,000 simulated data sets, values t1,,tn were sampled at random without replacement from the set with 40 elements equal to 0 and 40 elements equal to 1 so that n0=n1=40 and values x1,,xn were simulated from a normal distribution with mean 0 and variance 1 and ordered such that x1xn . Values y1,,yn were then simulated from the model

YiN(α0+β0xi+αti+βtixi,1) (7)

with α=0 and β taken to be 0, 0.5, and 1 to illustrate a range of treatment by biomarker interaction effects. As the models used adjust for the biomarker, the results of the hypothesis tests are invariant to α0 and β0 and these were arbitrarily set to zero. Hypotheses H1,,Hn were tested at FWER 0.025 using the tests based on Zk and Zk* as described above.

For β=0 , a type I error corresponds to rejection of any Hk while for β>0 , since α=0 so that θ(x)0 if and only if x<0 , a type I error corresponds to rejection of any Hk with xk<0 . The number of simulations leading to a type I error was recorded. For β>0 , the number of simulations leading to rejection of Hk for each xk>0 was also recorded as a measure of power. Note that the hierarchical testing procedure used, as illustrated in Figure 1, means that this number is increasing in xk (and decreasing in k ).

In order to provide a comparison with the approaches described above, the data were also used to test H1,,Hk using an approach based on that described by Stallard. 15 In this approach, a nominal test of Hk was conducted by fitting the linear model to test for a treatment effect adjusted for a biomarker effect to data sets (xi,yi,ti),i=k,,j for j=k+2,,n and taking the test statistic to be the largest observed from the (nk1) test statistics obtained. A p -value for the test based on this maximum can be obtained as described by Stallard 15 and the hierarchical testing procedure applied as above.

Simulation results are given in Figure 3 and Table 1. The left-hand panels of Figure 3 show the expected values under the simulation models for t=0 (dotted line) and t=1 (solid line) for β equal to 0 in the top row, so that the dashed and solid lines coincide, and to 1 in the second row, together with one simulated data set in each case with hollow points for t=0 and solid points for t=1 given as for illustration of the variability of the data around the expected values. The right-hand panel shows, for the same values of β , the estimated probability of rejecting Hk plotted against xk for k=1,,n using the tests described above based on Zk and Zk* and using the test proposed by Stallard. 15 . This probability corresponds to the type I error rate in the upper plot, and to the type I error for x0 and the power for x>0 in the lower plot. The overall type I error and power for H1 for β{0,0.5,1} in each case are given in Table 1.

Figure 3.

Figure 3.

Simulation models and example data sets and simulation results for β=0 (upper panels) and β=1 (lower panels). Left-hand panels show simulation models and one simulated data set for t=0 (dashed line gives expected values and hollow points give simulated values) and t=1 (solid line gives expected values and solid points give simulated values). Right-hand panels show simulated probability from 10,000 simulations per scenario of rejecting Hk plotted against xk for tests as in Stallard 15 (dashed line) and using Zk (dotted line) and Zk* (solid line). Probabilities are type I error rates for β=0 and for xk0 when β=1 and powers for β=1 and xk>0 .

Table 1.

Estimated familywise type I error rate (FWER) and power from simulation study (estimates based on 10,000 simulations).

Test based on Zk Test based on Zk* Test based on largest test statistic
β FWERa Powerb FWERa Powerb FWERa Powerb
0 0.0233 0.0234 0.0236
0.5 0.0024 0.5159 0.0013 0.2042 0.0005 0.2029
1 0.0070 0.9738 0.0006 0.6563 0.0001 0.6273

aFWER is the probability of rejecting any Hk if β=0 or any Hk with xk0 if β>0 .

bPower is probability of rejecting any Hk with xk>0 if β>0 .

It can be seen that all of the proposed procedures control the familywise type I error rate at the 0.025 level as required. For β=0, the FWER is close to the nominal level. For larger β>0 , the type I error rate reported is the probability of rejecting any Hk with xk0 . Since in this case the treatment effect θ(xk) decreases with xk , for small xk values we can have θ(xk) well below 0, leading to conservatism in the test.

It is also evident that the test using Zk is more powerful than that using Zk* , and that this is slightly more powerful than the test proposed by Stallard 15 that chooses the value of x* that gives the largest test statistic.

The probability of rejecting Hk for specified xk shown in the lower right-hand plot shows the power to detect a treatment effect at different biomarker levels, x . As the hierarchical testing procedure ensures that rejection of Hk leads to rejection of Hj for all jk , this power increases with x , reducing towards the type I error rate as x approaches zero from above, for any testing method. This seems reasonable for the model considered, for which the treatment effect θ(x) is reducing towards zero at x=0 . Since in the simulations the xk values are simulated from a N(0,1) distribution, many of the xk values are close to 0, so that even for xk>0 many true treatment effect values, θk , are small. The power to reject Hk is thus low for many k , leading to low true discovery rates, which, for β=1 are, respectively, estimated from the simulations to be 35%, 13% and 10% for the methods using Zk , Zk* and the approach of Stallard.

6.2. Robustness to departures from the linear model

Both of the tests for Hk described above are based on the linear model (1). If the assumed model is incorrect, neither of the test statistics Zk nor Zk* will follow the distributions obtained and the test may be inaccurate.

In order to assess the robustness of the methods to departures from (1) a number of simulations were conducted with data simulated from other models. In order to assess potential type I error rate inflation, simulations were conducted under null scenarios in which E(Yixi,ti=1)E(Yixi,ti=0)0 for all x1,,xn .

Writing θ(x) for E(Yixi,ti=1)E(Yixi,ti=0) , the first sets of simulations were conducted with data simulated from models with θ(x)=0 for all x but with E(Yixi,ti=1) not linearly related to x , that is, with the biomarker X having no predictive effect but a non-linear prognostic effect. Additional sets of simulations were conducted with E(Yixi,ti=0) constant and θ(x) non-linearly increasing in x but with simulated values of X restricted to ensure that θ(x)0 for all simulated X values. These models correspond to the biomarker X having no prognostic effect but a non-linear predictive effect such that the treatment is not effective for any patients. The first two sets of simulation models thus all correspond to null models in which there is no positive treatment effect at any biomarker level. A third set of simulations were conducted with E(Yixi,ti=0) constant and θ(x) non-linearly increasing in x but with θ(x)>0 for x>0 and θ(x)<0 for x<0 .

The simulation models are given in Table 2, and are also illustrated, along with one example simulated data set in each case, in the left-hand panels of Figures A1 to A3 given in Appendix 3 in the Supplemental material, which are analogous to Figure 3 above. In each case, data were simulated with n0=n1=40 , XN(0,1) and Y normal with variance 1 and mean related to X according to the expression given in the table.

Table 2.

Non-linear models used for simulations reported in Table 3 and in Supplemental Appendix 3.

Description E(Yixi,ti)
Models with prognostic biomarker effect
Step function 1+2I(xi>0)
 Concave 2+0.03(xi4)3
 Convex 2+0.03(xi+4)3
Models with predictive biomarker effect (positive treatment effect for no x values)
 Concavea 2+0.03(xi4)3I(ti=1)
 Convexb 8.3+(0.03(xi+4)310.3)I(ti=1)
Models with predictive biomarker effect (positive treatment effect for x>0 )
Step function 1+(2+I(ti=1))I(xi>0)
 Concave 2+0.03((xi4)3+176)I(ti=1)
 Convex 8.3+0.03((xi+4)364)I(ti=1)

a xi4 to ensure E(Yixi,ti=1)<E(Yixi,ti=0) .

b xi3 to ensure E(Yixi,ti=1)<E(Yixi,ti=0) .

Estimated type I error rates based on 10,000 simulations under each model are given in Table 3 and are also given in Figures A1 to A3 in Appendix 3 in the Supplemental material. It can be seen that there is a slight increase in the type I error rate above the nominal level for tests based on Zk and Zk* when there is a convex prognostic effect of x and for the test based on Zk when there is a concave prognostic effect. Particularly notable, however, is the very large increase in the type I error for the test based on Zk when there is a concave predictive effect of x . The tests proposed by Stallard 15 that do not use a linear model control the type I error rate in all cases, but are again conservative, unsurprisingly particularly when θ(x)<0 for some values of x .

Table 3.

Simulation models and estimated familywise type I error rates from 10 000 simulations to assess robustness to departures from the linear model.

Test based Test based Test based on largest
Model on Zk on Zk* test statistic
Models with prognostic biomarker effect
Familywise type I error rate
Step function 0.0261 0.0226 0.0122
 Concave 0.0365 0.0085 0.0110
 Convex 0.0840 0.0309 0.0124
Models with predictive biomarker effect (positive treatment effect for no x values)
Familywise type I error rate
 Concavea 0.4015 0.0078 0.0029
 Convexb 0.0000 0.0000 0.0000
Models with predictive biomarker effect (positive treatment effect for x>0 )
Familywise type I error rate
Step function 0.0055 0.0025 0.0018
 Concave 0.0172 0.0001 0.0006
 Convex 0.0049 0.000 0.0000
Power (percentage hypotheses correctly rejected)
Step function 0.5005 (20%) 0.5269 (19%) 0.5645 (21%)
 Concave 0.8585 (32%) 0.2298 (4%) 0.1259 (2%)
 Convex 0.9945 (45%)) 0.9712 (23%) 0.9869 (22%)

a xi4 to ensure E(Yixi,ti=1)<E(Yixi,ti=0) .

b xi3 to ensure E(Yixi,ti=1)<E(Yixi,ti=0) .

For models with a positive treatment effect for some x values, estimated power, again based on 10,000 simulations under each model, is also given in Table 3 together with estimated true rejection rates. It can be seen that the power of all methods depends on the true model for the data. In particular, the estimated power is largest for the convex model for which the true treatment effect, θ(x) , is large for larger x , and smallest for the concave model for which θ(x) is relatively small for all x , and particularly for smaller positive x . The test based on Zk has the highest power, consistent with the increased type I error rate, though, for the step function model, all three methods have similar power.

7. Example: Analysis of data from an Alzheimer’s disease study

Schnell et al. 23 report an analysis of data from a trial of an experimental therapy for Alzheimer’s disease. They compare patients receiving low-dose treatment with those receiving a placebo in terms of their change in cognitive impairment, with positive values indicating an improvement, that is, a reduction in cognitive impairment, identifying a positive treatment by age interaction indicating that the treatment is more effective for older patients.

The data from 41 patients, with 16 receiving low-dose treatment and 25 receiving placebo, are available from Schnell et al. 23 and are shown in Figure 4, where the outcome is plotted against age with placebo patients represented by a hollow circle and low-dose patients by a solid circle.

Figure 4.

Figure 4.

Improvement in cognitive impairment and age for patients in low dose (solid points) and placebo groups (hollow points) in Alzheimer’s disease study.

Figure 4 also shows fitted values for the linear model given by (1) that includes terms for treatment, age and a treatment by age interaction fitted to the whole data set. This can be used to test the null hypothesis H1 . From the linear model, the estimate β^ is equal to 0.371, so is greater than zero, indicating that the treatment effect is increasing with age, and the estimated treatment effect at the largest observed age, x1=90 , is 9.44 with standard error 2.86, so that Z1=3.30 , corresponding to a p -value of 0.0005. Fitted treatment effects are positive for patients of age x*=α^/β^=64.5 and above. The value of x* is shown by the vertical dashed line in Figure 4. Using data from patients with xix* to construct Z1* gives Z1*=2.60 . A p -value can be found as described above, and in this case is equal to 0.0036.

In order to control the FWER for testing of hypotheses H1,,Hn , the hierarchical procedure described above and illustrated in Figure 1 is applied. Results of tests for hypotheses Hk,k=1,2, based on Zk and Zk* are given in Table 4. In this data set, x4=x5 and x6=x7 , so that the hypotheses H4 and H5 and the hypotheses H6 and H7 are identical and are tested using the same data (xi,yi,ti),i=4,,n and (xi,yi,ti),i=6,,n respectively. The results for the tests of each of these pairs of hypotheses are thus given in the same line in Table 4.

Table 4.

Results of hypothesis tests for Hk:θ(x)0 for xxk for k=1,,8 in Alzheimer’s disease study data shown in Figure 4.

k xk Zk pk xk* Zk* pk*
1 90 3.30 0.0005 64.5 2.60 0.0036
2 88 3.30 0.0005 64.9 2.54 0.0045
3 87 2.81 0.0025 64.8 2.10 0.0144
4, 5 86 2.98 0.0014 65.8 2.12 0.0141
6, 7 85 2.03 0.0212 64.9 1.45 0.0650
8 84 1.49 0.0687

Using the test based on pk , the hierarchical procedure leads to rejection of hypotheses H1,,H7 at the one-sided 0.025 level, and to retention of hypotheses H8,,Hn without any further testing, since the p -value for the test of H8 is 0.0687, and thus exceeds 0.025. As x7=85 , we can conclude that there is a significant effect of treatment for patients aged 85 and above.

For the test based on pk* hypotheses H1,,H5 are rejected at the one-sided 0.025 level with H6,,Hn retained as p6*=0.0650 exceeds 0.02. In this case, it would be concluded that there is a significant treatment effect for patients aged x5=86 and above.

It is worth noting that the smallest xk for which Hk is rejected, in this case, x7=85 or x4=86 , are considerably larger than x1* , the estimated value above which the treatment effect is positive, which in this case is equal to 64.5 as noted above. This reflects the fact that the subgroup for which there is evidence of benefit is smaller than that for which there is an indication of benefit. 23 While this is a particular feature of a setting such as this when the sample size is low, in general, since we desire to have a low probability of rejecting Hk for xk with θ(xk)=0 , the power will be low for values of x just above this even for larger sample sizes. This is analogous to the setting in a simple comparison of groups where the treatment effect estimate may be positive but not statistically significantly larger than 0.

8. Discussion

This paper has presented an approach for hypothesis testing to compare two groups in a subpopulation selected on the basis of some continuous biomarker when the same data are used for the selection and the hypothesis test. Although motivated by the setting of a clinical trial with a predictive biomarker, the method could be applied in other settings in which a subgroup of a data set is selected.

In the biomarker setting, as described in the Introduction, a popular approach is to use an adaptive enrichment design in which recruitment is restricted to the selected subpopulation in the second stage of a two-stage trial. The method proposed could be used in such a setting by using a combination test to combine evidence from both stages for tests of each hypothesis within the hierarchical testing framework described. Stage two patients could either be recruited from a subpopulation defined to be patients with any biomarker level with a positive estimated effect based on the analysis of the stage one data or from the subpopulation with biomarker values corresponding to a sufficiently small p -value based on the stage one data.

Like the approaches of Lin et al. 8 and Frieri et al., 16 the proposed method is based on a model in which a continuous response is related to the biomarker values via a normal linear model. Unlike these authors, however, we do not assume bivariate normality for the biomarker and response values. Rather, we condition the biomarker values, so that the methods make no assumption about their distribution in the population. The method could thus also be used in settings where biomarker levels are chosen by design or sampled from a restricted range within the population. In the simulations and example presented above, the values x1,,xn at which the null hypotheses are tested are the observed biomarker levels for the n patients in the study, with these patients sampled at random from the population under investigation. An alternative approach might be to either stratify sampling or to use test hypotheses at biomarker levels that do not correspond to all observed values, for example corresponding to certain quantiles of the anticipated or empirical distribution of biomarker levels observed. Such an approach could also ensure that the number of patients with data used for tests of Hk for larger k was not too small.

In contrast to Frieri et al. and Baldi Antognini et al. 18 our focus is not primarily on estimation of the biomarker level above which the experimental treatment is more effective than the control, but on identification of a subpopulation in which a positive treatment effect can be demonstrated. It should also be noted that we do not take as our parameter of interest the average treatment effect in the identified subpopulation. If the treatment effect depends on a continuous biomarker, a group selected with a positive average treatment effect will include patients with biomarker levels such that the expected treatment effect is negative. Our test based on θk , the smallest treatment effect in the selected group, will avoid this.

The methods proposed are based on the assumption that response and biomarker values are linearly related, with the usual assumptions of a normal linear regression model holding. This is a common assumption in this setting and may often be reasonable, but should be checked using the data. The second method presented, based on the test statistic Zk* , appears to be more robust, as indicated in the simulation study presented, but can still lead to type I error rate inflation. The alternative method proposed by Stallard 15 does not rely on linear model assumptions for control of the family wise type I error rate. This could thus be a more appropriate method to use if it was considered that the assumption of a linear relationship between response and biomarkers might not hold. The simulation studies reported above indicate, however, that the gain in robustness of the method of Stallard 15 comes at a cost of a loss in power relative to the approach proposed above. In practice, it is recommended that simulation studies are used to assess both the type I error rate and the power of the approach selected for a range of possible scenarios and sample sizes to ensure that the method is sufficiently robust and the study appropriately powered.

Other tests, including those not based on a linear model or including higher order polynomial terms to allow for a non-linear biomarker effect, could also be used within the hierarchical testing framework described and might have attractive properties in terms of type I error rate control and power. Simulation studies could also be used to assess the power of alternative testing methods under a range of treatment effect scenarios considered to be likely. An assumption that the treatment effect θ(x) is increasing in the biomarker is required, though this may well be considered reasonable in many contexts where a predictive biomarker effect is hypothesised.

It is also assumed that the biomarker levels are measured without error. In some settings, error in these measurements might be considered sufficiently small that it can be ignored and conventional linear regression models, as proposed above, can be used. If this is not considered reasonable, alternative error-in-variables methods could be used to replace the tests proposed above to allow for the uncertainty in the true biomarker levels, 24 though this would still assume that the subsets used in the hierarchical testing approach are correct.

The methods proposed have been developed and applied with a continuous, normally distributed response. The hierarchical testing framework could be used with other settings such as those with binary or time-to-event responses given suitable tests of the individual hypotheses Hk . Extensions of the first method proposed might be possible based on estimated treatment effects from a logistic or Cox regression model, but a derivation of the distribution of the test statistic using an extension of the second method might be more challenging.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802241277764 - Supplemental material for Testing for a treatment effect in a selected subgroup

Supplemental material, sj-pdf-1-smm-10.1177_09622802241277764 for Testing for a treatment effect in a selected subgroup by Nigel Stallard in Statistical Methods in Medical Research

Acknowledgements

The author is grateful to Dr Peter Kimani and three anonymous referees for their helpful comments that have led to improvements in the manuscript. For the purpose of open access, the author has applied a Creative Commons Attribution (CC-BY) licence to any author accepted manuscript version arising from this submission.

Footnotes

Data availability statement: Data sharing is not applicable to this article as no new data were created or analysed in this study.

The author declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding: The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Medical Research Council (grant numbers MR/V038419/1, MR/W021013/1).

ORCID iD: Nigel Stallard https://orcid.org/0000-0001-7781-1512

Supplemental material: Supplemental material for this article is available online.

References

  • 1.Antoniou M, Jorgensen AL, Kolamunnage-Dona R. Biomarker-guided adaptive trial designs in phase II and phase III: A methodological review. PLoS ONE 2016; 11: e0149803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lin J, Bunn V, Liu R. Practical considerations for subgroups quantification, selection and adaptive enrichment in confirmatory trials. Stat Biopharm Res 2019; 11: 407–418. [Google Scholar]
  • 3.Wang SJ, Hung HMJ, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical J 2009; 51: 358–374. [DOI] [PubMed] [Google Scholar]
  • 4.Simon N, Simon R. Adaptive enrichment designs for clinical trials. Biostatistics 2013; 14: 613–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ho TW, Pearlman E, Lewis D, et al. Efficacy and tolerability of rizatriptan in pediatric migraineurs: Results from a randomized, double-blind, placebo-controlled trial using a novel adaptive enrichment design. Cephalagia 2012; 32: 750–765. [DOI] [PubMed] [Google Scholar]
  • 6.Simon N. Adaptive enrichment designs: Applications and challenges. Clin Invest 2015; 5: 383–391. [Google Scholar]
  • 7.Renfro L, Coughlin C, Grothey A, et al. Adaptive randomized phase ii design for biomarker threshold selection and independent evaluation. Chin Clin Oncol 2014; 3: 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lin Z, Flournoy N, Rosenberger WF. Inference for a two-stage enrichment design. Ann Stat 2021; 49: 2697–2720. [Google Scholar]
  • 9.Wang SJ, O’Neill RT, Hung HMJ. Approaches to evaluation of treatment effect in randomized clinical trials with genomic subset. Pharm Stat 2007; 6: 227–244. [DOI] [PubMed] [Google Scholar]
  • 10.Spiessens B, Debois M. Adjusted significance levels for subgroup analyses in clinical trials. Contemp Clin Trials 2010; 31: 647–656. [DOI] [PubMed] [Google Scholar]
  • 11.Friede T, Parsons N, Stallard N. A conditional error function approach for subgroup selection in adaptive clinical trials. Stat Med 2012; 31: 4309–4320. DOI: 10.1002/sim.5541. [DOI] [PubMed] [Google Scholar]
  • 12.Stallard N, Hamborg T, Parsons N, et al. Adaptive designs for confirmatory clinical trials with subgroup selection. J Biopharm Stat 2014; 24: 168–187. [DOI] [PubMed] [Google Scholar]
  • 13.Rosenblum M, Luber B, Thompson R, et al. Group sequential designs with prospectively planned rules for subpopulation enrichment. Stat Med 2016; 35: 3776–3791. [DOI] [PubMed] [Google Scholar]
  • 14.Dioa G, Dong J, Donglin Z, et al. Biomarker threshold adaptive designs for survival endpoints. J Biopharm Stat 2018; 28: 1038–1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Stallard N. Adaptive enrichment designs with a continuous biomarker. Biometrics 2023; 79: 9–19. [DOI] [PubMed] [Google Scholar]
  • 16.Frieri R, Rosenberger W, Flournoy N, et al. Design considerations for two-stage enrichment clinical trials. Biometrics 2023; 79: 2565–2576. [DOI] [PubMed] [Google Scholar]
  • 17.Wang T, Wang X, George S, et al. Design and analysis of biomarker-integrated clinical trials with adaptive threshold detection and flexible patient enrichment. J Biopharm Stat 2020; 30: 1060–1076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Baldi Antognini A, Frieri R, Rosenberger W, et al. Optimal design for inference on the threshold of a biomarker. Stat Methods Med Res 2024; 33: 321–343. [DOI] [PubMed] [Google Scholar]
  • 19.Brannath W, Posch M, Bauer P. Recursive combination tests. J Am Stat Assoc 2002; 97: 236–244. [Google Scholar]
  • 20.Markus R, Pertiz E, Gabriel KR. On closed testing procedures with special reference to ordered analysis of variance. Biometrika 1976; 63: 655–660. [Google Scholar]
  • 21.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2021. [Google Scholar]
  • 22.Madsen H, Thyregod P. Introduction to general and generalized linear models. Boca Raton: CRC Press, 2010. [Google Scholar]
  • 23.Schnell P, Tang Q, Offen W, et al. A Bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects. Biometrics 2016; 72: 1026–1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Fuller W. Measurement error models. New York: John Wiley and Sons, 1987. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-pdf-1-smm-10.1177_09622802241277764 - Supplemental material for Testing for a treatment effect in a selected subgroup

Supplemental material, sj-pdf-1-smm-10.1177_09622802241277764 for Testing for a treatment effect in a selected subgroup by Nigel Stallard in Statistical Methods in Medical Research


Articles from Statistical Methods in Medical Research are provided here courtesy of SAGE Publications

RESOURCES