Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Oct 4.
Published in final edited form as: Clin Trials. 2022 May 9;19(5):512–521. doi: 10.1177/17407745221095855

A permutation procedure to detect heterogeneous treatments effects in randomized clinical trials while controlling the Type-I error rate

Jack M Wolf 1, Joseph S Koopmeiners 1, David M Vock 1
PMCID: PMC9529771  NIHMSID: NIHMS1795972  PMID: 35531765

Abstract

Background/Aims:

Secondary analyses of randomized clinical trials often seek to identify subgroups with differential treatment effects. These discoveries can help guide individual treatment decisions based on patient characteristics and identify populations for which additional treatments are needed. Traditional analyses require researchers to prespecify potential subgroups to reduce the risk of reporting spurious results. There is a need for methods that can detect such subgroups without a priori specification while allowing researchers to control the probability of falsely detecting heterogeneous subgroups when treatment effects are uniform across the study population.

Methods:

We propose a permutation procedure for tuning parameter selection that allows for Type-I error control when testing for heterogeneous treatment effects framed within the Virtual Twins procedure for subgroup identification. We verify that the Type-I error rate can be controlled at the nominal rate and investigate the power for detecting heterogeneous effects when present through extensive simulation studies. We apply our method to a secondary analysis of data from a randomized trial of very low nicotine content cigarettes.

Results:

In the absence of Type-I error control, the observed Type-I error rate for Virtual Twins was between 99 and 100%. In contrast, models tuned via the proposed permutation were able to control the Type-I error rate and detect heterogeneous effects when present. An application of our approach to a recently completed trial of very low nicotine content cigarettes identified several variables with potentially heterogeneous treatment effects.

Conclusions:

The proposed permutation procedure allows researchers to engage in secondary analyses of clinical trials for treatment effect heterogeneity while maintaining the Type-I error rate without pre-specifying subgroups.

Keywords: permutation test, subgroup identification, treatment effect heterogeneity, Type-I error, Virtual Twins

Introduction

The primary objective of a randomized controlled trial (RCT) is typically to estimate and test the marginal treatment effect (i.e., the average treatment effect aggregated across the entire population). However, identifying subgroups with a differential response to treatment has long been an important scientific and secondary aim, which has grown in importance in the era of personalized medicine. This focus is motivated by the idea that the treatment effect may vary from individual to individual across the population, commonly referred to as treatment effect heterogeneity. Towards this aim, researchers are interested in identifying characteristics that can explain differences in the treatment effect and subgroups for which the treatment effect is different than the effect in the broader population. Characterizing treatment effect heterogeneity can help effectuate personalized medicine, deepen our understanding of possible treatment mechanisms, and suggest subgroups which may benefit from different or more intensive intervention. For example, in studies evaluating proposed regulations that would affect all members of a population, such as trials of very low nicotine content cigarettes under the broader umbrella of tobacco regulatory science1 which motivate our work here, it is important to identify subgroups that may not have an ideal response to the policy so that additional targeted interventions can be developed in support of these populations2.

Traditionally, subgroup analyses in an RCT must be pre-specified in the statistical analysis plan. Although pre-specification permits easier control of the family-wise error rate, the number of pre-specified subgroups is typically limited, involving only a small number of covariates, and any categorization of a continuous variable must be pre-specified as well.

Given the limitations of pre-specifying subgroups, many statistical methods for evaluating effect heterogeneity and discovering subgroups using flexible, data-adaptive methods have been proposed and studied. Existing tree-based methods include interaction trees3, honest causal trees4, and GUIDE5. Moreover there exist ensemble methods that leverage the estimates of many models such as random forests of interaction trees6 and STIMA7, which combines a multiple linear regression model with a regression tree to detect interaction effects. Bayesian approaches have also been proposed8,9.

While identifying heterogeneous subgroups is an important scientific aim, there is a long history of finding spurious subgroups that cannot be replicated in subsequent trials1014. Thus, researchers are naturally concerned with the a priori probability of incorrectly detecting subgroups with differential treatment effects when there is a uniform treatment effect across the study population and whether the aforementioned probability can be controlled while maintaining sufficient power to detect heterogeneity when present. This is a particular concern for many data-adaptive approaches for which such control may be difficult to implement.

In this paper, we propose a permutation procedure for identifying treatment effect heterogeneity with Type-I error rate control. We frame our approach in the context of Virtual Twins15, which is a popular two stage approach to subgroup detection that has been widely used and discussed since its original publication1620. Despite the method’s wide usage, there is currently little guidance on how to select penalty parameters needed to fit such models. Many researchers are left using default software settings in their applications which are typically selected with predictive performance in mind. We address this limitation by conceptualizing Virtual Twins as a hypothesis testing procedure and showing how parameters can be tuned to accurately control the associated Type-I error rate.

The rest of the paper proceeds as follows. First, we establish notation and review Virtual Twins as originally proposed. Next, we propose a permutation procedure to assist tuning parameter selection and Type-I error control. Then, we detail several simulation studies to assess our proposed method’s performance in a variety of scenarios. Finally, we apply this method to data from an RCT of very low nicotine content cigarettes to describe patient characteristics that may impact smokers’ individual responses to the intervention.

Methods

Notation

Consider the data (Yi, Ti, Xi), i = 1,…, n from an RCT with response Yi, binary treatment indicator Ti, and covariates Xi = (X1i,…,Xpi) (which may be continuous or categorical). We write the conditional mean of Yi given Ti and Xi as

E(YiTi,Xi)=h(Xi)+Tig(Xi) (1)

We assume that Ti is associated with Yi only through its first moment and that it does not affect the conditional variance or any higher moments of the response. However, we do not require such assumptions about the relationship between all other covariates and the response. The conditional average treatment effect for each subject is E(YiTi=1,Xi)E(YiTi=0,Xi)=g(Xi), which we denote as Zi.

Virtual Twins

Virtual Twins is a two-step approach that first estimates Zi, typically using flexible regression techniques, and then models the estimated Z^i using parsimonious and interpretable models. Although in principle an analyst could fit the main effect and interpretable treatment model simultaneously, the two-step procedure provides maximum flexibility to explain variability in the main effects in Step 1, while providing an interpretable model in Step 2.

Step 1

Step 1 consists of estimating subjects’ outcomes under both the control and treatment arms. This is accomplished by splitting the data based on the value of Ti and independently fitting flexible regression models f^0(Xi) and f^1(Xi) to estimate E(YiTi=0,Xi) and E(YiTi=1,Xi) respectively. Each subject’s estimated conditional average treatment effect is then given by Z^i=f^1(Xi)f^0(Xi). The original paper proposed using random forests21 to estimate these quantities but other authors22 have investigated the use of additional approaches to estimating the response surface in Step 1 including linear models fit using the lasso23, MARS24 and super learner25.

Step 2

In Step 2 the analyst uses a simple and interpretable model such as a regression tree to model the estimated conditional average treatment effect Z^i as a function of the covariates Xi. Variables included in this model are used to determine which covariates modify the treatment effect and identify subgroups of patients with homogeneous treatment effects. The original paper supports both regression and classification trees. Others22 have explored using the lasso and conditional inference trees as possible Step 2 methods.

Type-I error rate control

The original presentation of Virtual Twins encourages fitting the Step 2 models using a fixed list of tuning parameters. While this approach has shown acceptable performance, data-adaptive methods for parameter selection based on performance metrics may be advantageous. However, standard data-adaptive methods typically select tuning parameters to maximize predictive performance which may not be optimal in this context. First, such an approach is not guaranteed to preserve any Type-I error of detecting heterogeneous treatment effects. Second, the estimated conditional average treatment effect (i.e., Z^i) is a deterministic function of the features and, therefore, data-adaptive methods are likely to overfit the data. To address these limitations, we propose a permutation based framework to identify appropriate penalty parameters for a variety of Step 2 methods to maintain the Type-I error rate for concluding heterogeneity.

Framing as a hypothesis test

Controlling the Type-I error rate requires that we first frame Virtual Twins as a hypothesis test with a null hypothesis of a homogeneous treatment effect. This null hypothesis implies that each subject’s treatment effect is equal and g(Xi) = Δ for all i, so an individual’s conditional average response can be simplified to:

E(YiTi,Xi)=h(Xi)+TiΔ (2)

We will reject the null hypothesis if the Step 2 model estimating g(Xi) includes any covariates (e.g., a tree with at least one split). Thus, a Type-I error corresponds to rejecting the null hypothesis when g(Xi) is constant.

Permutation procedure

We consider a class of methods for Step 2 which are fit by specifying one penalty parameter where for any fixed data set there exists a sufficiently large penalty parameter such that the fitted model is constant for all inputs. Examples of such methods include regression trees26, conditional inference trees27, and the lasso23. We will henceforth refer to the smallest penalty parameter that achieves this constant model for a fixed data set as the minimal null penalty parameter. Formally, we let θ^N=min{θ:g^(Xi,θ)=dfor alli} be the minimal null penalty parameter of a given data set where g^(Xi,θ) is the fitted Step 2 model given penalty parameter θ for a given data set.

We wish to identify the penalty parameter θα such that the procedure’s Type-I error rate is at most α for any fixed 0 ≤ α ≤ 1. We estimate this parameter using a permutation procedure. Permutation tests are typically used to achieve exact inference in small sample sizes by deriving the null distribution of the test statistic without model-based assumptions. Our procedure slightly departs from this framework to describe the null distribution of the minimal null penalty parameter, θ^N, which is then used for parameter selection.

The procedure can be summarized as first permuting the treatment indicators to preserve the covariate main effects while eliminating potential treatment by covariate interactions, then fitting the Step 1 model to this permuted data to estimate the conditional average treatment effect under the null model, and finally fitting the Step 2 model to calculate θ^N for the permuted data.

The proposed algorithm is as follows. First calculate the estimated mean treatment effect Δ^ and obtain Y˜i=YiΔ^I(Ti=1) to set the mean treatment effect to zero before permuting the treatment indicators. Note that testing for heterogeneity with Yi is equivalent to doing so with Y˜i. Then, for each j where j = 1,…,m for some large m,

  1. Randomly permute the treatment indicator variables of the original data set to obtain (Y˜i,Ti(j),Xi).

  2. Fit the Step 1 model on the permuted data to estimate Z^i(j) for each i.

  3. Let θ^N(j) be the minimal null penalty parameter for the Step 2 model fit for Z^i(j).

Then, let θ^α be the 1 − α percentile of θ^N(1),,θ^N(m) and fit the Step 2 model for Z^i estimated through Step 1 on the original data using the penalty parameter θ^α.

This permutation procedure can be reframed as a permutation test to leverage existing theory. Consider the test statistic θ^N and the accompanying hypothesis test that rejects the null hypothesis of no effect heterogeneity if θ^N>θα . We wish to identify the largest critical value θα such that Pr(θ^N>θαH0)α. This can be accomplished by taking the 1 − α percentile of the null distribution of θ^N which we generate via the permutation procedure. This permutation test is valid because after permuting we have data generated equivalent in distribution under the null hypothesis that E(Y˜iTi,Xi)=h(Xi) (recall the the main effect of treatment is 0 with Y˜i).

Simulation studies

Simulation study design

Data were generated from the model Yi = h(Xi) + Tig(Xi) + ϵi where ϵiiidN(0,σ2) for i = 1,…, 1000. Covariate vectors Xi consisted of p covariates with p = 10, 20, 50 with continuous and binary variables in a 4:1 ratio. Continuous covariates were generated from a multivariate normal distribution with an AR(1) correlation structure, and binary covariates were simulated from independent Bernoulli distributions. Treatment indicators were randomly assigned to give a 1:1 allocation ratio with 500 patients per arm.

Each subject’s conditional average treatment effect was given via g(Xi). We considered a null scenario where g(Xi) was constant for all Xi. Under this scenario we expected our permutation procedure to falsely detect heterogeneity with probability α. We also simulated scenarios where g was a linear and nonlinear function of Xi to assess our permutation procedure’s power. The nonlinear g determined the conditional average treatment effect by partitioning the covariate space through several splits at the covariates’ true mean values along with one or two interaction terms, depending on the number of covariates. The function h(Xi) can be viewed as a patient’s expected outcome under the control. We examined scenarios where h was both a linear and nonlinear function of the covariates. The nonlinear h consisted of linear and quadratic terms, binary indicators for whether a covariate was above its true mean value, and covariate by covariate interactions.

We completed 2000 simulated trials for every combination of p, g, and h. Full simulation details are available in Table 1. The average R2 when ignoring any treatment by covariate interactions was about 0.7 in each simulation to resemble our motivating data. Cohen’s f2, which measures the effect size of the interaction, ranged from 0.01 to 0.03 under simulations with effect heterogeneity. Additional details are provided in the supplementary materials. Table S1 displays the average R2 for each scenario both when ignoring and including treatment by covariate interactions as well as Cohen’s f2. In addition, we ran a similar simulation study with increased residual variance to achieve an R2 of about 0.2 in all simulated datasets.

Table 1.

Simulation study details. We carried out 2000 simulations under every combination of h(Xi), g(Xi) and number of covariates p. Patient outcomes were generated through the model Yi = h(Xi) + Tig(Xi) + ϵi where ϵiiidN(0,4) and n = 1000.

p = 10 p = 20 p = 50
Covariates
 Continuous (Xi1,Xi8)TiidN(μ,Σ)μN(0,3I8)Σ=AR(1,ρ=0.7) (Xi1,Xi16)TiidN(μ,Σ)μN(0,3I16)Σ=AR(1,ρ=0.7) (Xi1,Xi40)TiidN(μ,Σ)μN(0,3I40)Σ=AR(1,ρ=0.7)
 Binary Xi9,Xi10iidBernoulli(0.7) Xi17,,Xi20iidBernoulli(0.7) Xi41,,Xi50iidBernoulli(0.7)
Covariate Main Effects
 Linear h1(Xi)=Xiββj=1.25forj=1,,10 h1(Xi)=Xiββj={1forj=1,,10,17,180otherwise h1(Xi)=Xiββj={1forj=1,,12,41,42,430otherwise
 Nonlinear h2(Xi)=Xi1+Xi2+Xi9+Xi10+2j=35(Xijμj)2+2j=68I(Xij>μj)+1/2(Xi1μ1)(Xi2μ2)(Xi1μ1)Xi9 h2(Xi)=Xi1+Xi2+Xi17+Xi18+5/4j=36(Xijμj)2+5/4j=710I(Xij>μj)+1/2(Xi1μ1)(Xi2μ2)(Xi1μ1)Xi17 h2(Xi)=Xi1+Xi2+Xi41+Xi42+Xi43+5/4j=37(Xijμj)2+5/4j=812I(Xij>μj)+1/2(Xi1μ1)(Xi2μ2)(Xi1μ1)Xi42
Conditional Average Treatment Effect
 Null g0(Xi)=2 g0(Xi)=2 g0(Xi)=2
 Linear g1(Xi)=m+Xiββj={1/2forj=1,90otherwisem=21ni=1nXiβ g1(Xi)=m+Xiββj={1/2forj=1,2,10,170otherwisem=21ni=1nXiβ g1(Xi)=m+Xiββj={1/2forj=1,2,10,410otherwisem=21ni=1nXiβ
 Nonlinear g2(Xi)=m+γ(Xi)γ(Xi)=1/2Xi9+I(Xi1>μ1)+1/8(Xi1μ1)Xi9m=21ni=1nγ(Xi) g2(Xi)=m+γ(Xi)γ(Xi)=1/2Xi17+I(Xi1>μ1)+1/2I(Xi2>μ2)+1/4I(Xi10>μ10)+1/4(Xi1μ1)Xi17m=21ni=1nγ(Xi) g2(Xi)=m+γ(Xi)γ(Xi)=1/2Xi41+I(Xi1>μ1)+1/2I(Xi2>μ2)+1/4I(Xi10>μ10)+1/4(Xi1μ1)Xi41+1/8(Xi2μ2)(Xi10μ10)m=21ni=1nγ(Xi)

Methods Considered

We implemented our permutation procedure using various methods for fitting models for Steps 1 and 2, the performance of which have been evaluated in the context of Virtual Twins22.

In Step 1 we considered using random forests and super learner25. Random forests consisted of 1000 trees. The library for the super learner models used a linear model tuned using a lasso (L1) penalty, MARS24, and a random forest.

We considered the lasso23, regression trees26, and conditional inference trees27 as Step 2 methods, all of which require the specification of some penalty parameter before they can be fit, which we tuned to control the Type-I error rate. The lasso is a linear regression method that penalizes the L1-norm of the non-intercept regression coefficients to perform variable selection and regularization. Our permutation procedure tuned the penalty term’s weight. Regression trees recursively partition the covariate space to identify subgroups of the data with similar response values. A dense tree is fit and then the optimal subtree that minimizes a weighted loss function that combines the mean squared error and the number of terminal nodes is selected. The weight assigned to the number of terminal nodes is the complexity parameter which we tuned through our procedure. Like regression trees, conditional inference trees also recursively partition the covariate space to identify groups with similar responses. Unlike regression trees, which are fit using measures of information, conditional inference trees use a significance test for variable selection; the tree only will split if a significance test comparing the mean outcome between both considered subgroups has test statistic greater than some pre-specified threshold, which we tuned through our permutation procedure.

We tested all combinations of the discussed Step 1 and Step 2 models using our permutation procedure with α = 0.2 and α = 0.05 and using standard parameter selection techniques. Additional details are available in Table S2.

Summaries of performance

We evaluated the performance of our permutation procedure using the following metrics:

  • Type-I error rate and power: For each simulated trial we recorded whether the Step 2 model included any covariates or not. If the model included at least one covariate, we concluded that the model detected treatment effect heterogeneity. The average of this value across many simulations corresponds to the Type-I error rate or the power, depending on the absence or presence of treatment effect heterogeneity, respectively.

  • Sensitivity and specificity: Consider the partition Xi=(XiHeterogeneous,XiConstant) such that g(Xi)=g(XiHeterogeneous) for all i. We calculated the proportion of covariates in XiHeterogeneous that were included in the Step 2 model (sensitivity) as well as the proportion of covariates in XiConstant that were not included (specificity).

  • Individual treatment effect mean squared error: We assessed accuracy in modeling the conditional average treatment effect by calculating the mean squared error: i=1n[g^(Xi)Zi]2/n where g^(Xi) is the fitted Step 2 model.

Results

Simulation Results

Table 2 summarizes the proportion of simulated trials in which at least one covariate was included in the Step 2 model across different data generating models for g(Xi), h(Xi) and number of covariates. In scenarios with a homogeneous treatment effect, this corresponds to the Type-I error rate. Implementations using standard approaches to selecting tuning parameters had Type-I error rates of nearly 100%. In contrast, when using our permutation procedure, we observed empirical Type-I error rates approximately equal to the targeted values. This metric is a model’s empirical power in scenarios with effect heterogeneity. Across all such scenarios, the highest power was obtained when using a super learner model in Step 1 and the lasso to fit the Step 2 model (regardless of whether g(Xi) was linear or nonlinear). These trends held regardless of the number of covariates but the power for a specific combination of methods tended to increase with the number of covariates (See Table S3 for results when p = 20). We observed similar trends with less power in supplemental simulations with a lower overall R2 value (Table S4).

Table 2.

Proportion of simulations in which at least one covariate was included in the Step 2 model across all tested combinations of Step 1 (columns) and Step 2 (rows) methods. This corresponds to the Type I error rate for scenarios with homogeneous treatment effects and the power otherwise.

p = 10
p = 50
RF SL RF SL
Homogeneous Treatment Effects; Linear Main Effects

 LASSO(α = 0.2) 0.22* 0.21 0.19 0.18*
 R.Tree(α = 0.2) 0.21 0.21 0.20 0.21
 C.Tree(α = 0.2) 0.22* 0.22* 0.19 0.20
 LASSO(α = 0.05) 0.05 0.06 0.05 0.05
 R.Tree(α = 0.05) 0.05 0.06 0.06 0.06
 C.Tree(α = 0.05) 0.05 0.06 0.05 0.04
Homogeneous Treatment Effects; Nonlinear Main Effects

 LASSO(α = 0.2) 0.19 0.22 0.21 0.20
 R.Tree(α = 0.2) 0.19 0.20 0.21 0.20
 C.Tree(α = 0.2) 0.19 0.21 0.21 0.19
 LASSO(α = 0.05) 0.06 0.05 0.06* 0.06
 R.Tree(α = 0.05) 0.06 0.06 0.06 0.06
 C.Tree(α = 0.05) 0.06 0.06 0.06* 0.05
Linear Treatment Effects; Linear Main Effects

 LASSO(α = 0.2) 0.45 0.55 0.76 0.92
 R.Tree(α = 0.2) 0.39 0.34 0.68 0.71
 C.Tree(α = 0.2) 0.44 0.39 0.74 0.79
 LASSO(α = 0.05) 0.20 0.30 0.48 0.77
 R.Tree(α = 0.05) 0.15 0.09 0.40 0.33
 C.Tree(α = 0.05) 0.19 0.15 0.48 0.47
Linear Treatment Effects; Nonlinear Main Effects

 LASSO(α = 0.2) 0.32 0.43 0.66 0.83
 R.Tree(α = 0.2) 0.22 0.25 0.46 0.59
 C.Tree(α = 0.2) 0.31 0.41 0.64 0.80
 LASSO(α = 0.05) 0.09 0.19 0.34 0.64
 R.Tree(α = 0.05) 0.06 0.08 0.18 0.29
 C.Tree(α = 0.05) 0.09 0.16 0.32 0.57
Nonlinear Treatment Effects; Linear Main Effects

 LASSO(α = 0.2) 0.57 0.71 0.57 0.77
 R.Tree(α = 0.2) 0.52 0.46 0.54 0.61
 C.Tree(α = 0.2) 0.55 0.42 0.56 0.60
 LASSO(α = 0.05) 0.30 0.44 0.31 0.52
 R.Tree(α = 0.05) 0.28 0.16 0.28 0.29
 C.Tree(α = 0.05) 0.27 0.15 0.30 0.29
Nonlinear Treatment Effects; Nonlinear Main Effects

 LASSO(α = 0.2) 0.39 0.65 0.46 0.68
 R.Tree(α = 0.2) 0.25 0.36 0.34 0.45
 C.Tree(α = 0.2) 0.38 0.59 0.45 0.64
 LASSO(α = 0.05) 0.12 0.36 0.19 0.41
 R.Tree(α = 0.05) 0.06 0.11 0.11 0.18
 C.Tree(α = 0.05) 0.12 0.28 0.19 0.35
*

For simulations with homogeneous treatment effects, 95% CI does not include α

Abbreviations: RF: random forest; SL: super learner; R.Tree: regression tree; C.Tree: conditional inference tree

Table S5 shows the sensitivity and specificity of each combination of Step 1 and Step 2 methods for each scenario. When the conditional average treatment effect was linear, the controlled lasso had the highest sensitivity of all controlled methods. When modeling a nonlinear conditional average treatment effect with Type-I error control the lasso tended to have the highest sensitivity of all methods for a given error rate. The sensitivity was relatively constant regardless of the number of covariates for all combinations of methods. Nearly all methods that controlled the Type-I error rate demonstrated near perfect specificity for all scenarios. When the Type-I error rate was not controlled, the specificity ranged from 0.01 to 0.96 and was lowest when the lasso was used in Step 2.

Table S6 displays the estimated mean squared error for the subject-specific conditional average treatment effect for all combinations of Step 1 and Step 2 methods. In the absence of treatment effect heterogeneity, the methods that did not attempt to control the Type-I error rate had mean squared errors substantially higher than their counterparts which controlled the error rate. When the treatment effect was heterogeneous, methods that fixed the Type-I error rate at α = 0.2 tended to have the lowest mean squared error when compared to models without Type-I error control and with α = 0.05 for all Step 1 methods. The mean squared error increased with the number of covariates except for models controlling the Type-I error when the treatment effect was homogeneous, for which the mean squared error remained constant as the number of covariates increased. Across all 18 simulation scenarios, using super learner and the lasso to fit the Step 1 and 2 models, respectively resulted in the smallest mean squared error out of all method combinations which control the Type-I error in Step 2.

Application

Smoking remains the leading cause of preventable death in the United States. Currently, researchers are considering the impact of multiple regulatory interventions to reduce the negative health effects of cigarette smoking, including reducing the nicotine content of cigarettes1,2,2831. Reducing the nicotine content of cigarettes would impact all smokers in the United States. While RCTs have investigated the benefit of such regulations on the U.S. smoking population, on average, it is also important to identify potential subgroups that receive less benefit from or are potentially harmed by such regulations to design additional targeted interventions to reduce smoking or minimize unintended consequences.

A recent RCT1 evaluated the impact of nicotine reduction in a randomized, double-blind trial that assigned subjects to one of three interventions: 1) immediate reduction in nicotine content, 2) gradual reduction in nicotine content, and 3) maintenance of standard tobacco cigarettes (i.e., the control condition) following a 2:2:1 allocation ratio. Subjects were provided with cigarettes with nicotine content matching their treatment assignment for a 20-week intervention period, and the impact of nicotine reduction was evaluated by comparing the change in average number of cigarettes smoked per day from baseline to the last four weeks of the intervention.

Our application focused on comparisons between the gradual nicotine reduction group and the immediate nicotine reduction group as well as between the immediate reduction group and the control. We used the same modeling approaches in Step 1 and Step 2 as in our simulation study using 40 covariates that included demographic information and baseline smoking characteristics (Table S7 summarizes the study population along these covariates). We recorded which covariates were identified as having differential treatment effects in the Step 2 model.

Table 3 displays the covariates included in each model with Type-I error control when comparing the immediate and gradual groups and the immediate and control groups. The number of covariates included in each Step 2 model when the Type-I error rate was not controlled is presented in Table S8.

Table 3.

Variables determined to contribute to treatment effect heterogeneity for a given comparison of treatments on change in cigarettes per day. Covariates which had nonzero effects in the Step 2 Virtual Twins model fit for subjects’ estimated individual treatment effects with a given Step 1 model are marked with a ✓. Variables with no estimated effect for all presented models are omitted.

Immediate vs. Gradual
Immediate vs. Control
Step 1 Step 2 TNE CEMA CPD NNAL Total Age Total
RF LASSO(α = 0.2) 2 0
R.Tree(α = 0.2) 1 1
C.Tree(α = 0.2) 1 0
LASSO(α = 0.05) 2 0
R.Tree(α = 0.05) 1 0
C.Tree(α = 0.05) 1 0

SL LASSO(α = 0.2) 3 0
R.Tree(α = 0.2) 1 0
C.Tree(α = 0.2) 1 0
LASSO(α = 0.05) 2 0
R.Tree(α = 0.05) 1 0
C.Tree(α = 0.05) 0 0

Abbreviations: RF: random forest; SL: super learner; R.Tree: regression tree; C.Tree: conditional inference tree; TNE: total nicotine equivalents; CEMA: cyanoethyl mercapturic acid; CPD: cigarettes per day; NNAL: 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol

Immediate versus control

Models exploring the effect of immediate reduction compared to the control detected at most one covariate (age) when controlling the Type-I error at 20%. Traditional approaches detected as many as 27 covariates. We note that due to the study design, this model had fewer observations (n = 538) than when comparing gradual to immediate reduction (n = 723) and had less power to detect differential effects.

Immediate versus gradual

When modeling the effect of immediate versus gradual reduction using a random forest for Step 1 and a regression tree for Step 2 (as done in the original Virtual Twins paper) the covariates included are dependent on whether the model was tuned to control the Type-I error or not. When the error rate was not controlled six covariates were found to modify the the treatment effect. However, when controlling the error rate at either 20% or 5%, only total nicotine equivalents was found to modify the treatment effect. All other method combinations found at least one covariate in all but one case.

Given the combination’s superior performance in simulation studies, we report the results found when when a super learner model in Step 1 was paired with the lasso in Step 2. We observed statistically significant treatment effect heterogeneity at the 0.05 significance level with total nicotine equivalents (nmol/ml) and cyanoethyl mercapturic acid/creatinine urine (nmol/mg) identified as likely treatment effect modifiers with estimated conditional average treatment effect associated with immediate nicotine reduction of g^(Xi)=5.78+0.187Xi1+0.009Xi2, where Xi1 and Xi2 are centered and scaled (by the IQR) measures of total nicotine equivalents and cyanoethyl mercapturic acid, respectively. The results have important implications for tobacco regulatory science. While we observed significant treatment effect heterogeneity, the average treatment effect is 5.78 cigarettes per day. In contrast, the difference in the treatment effect associated with a difference equivalent to the interquartile range for total nicotine equivalents and cyanoethyl mercapturic acid is less than one cigarettes per day, which implies that the heterogeneity is small relative to the average treatment effect and that all smokers are likely to benefit from the intervention.

Discussion

We developed a permutation procedure that selects a tuning parameter which simultaneously regularizes the treatment effect heterogeneity and controls the Type-I error rate for detecting treatment effect heterogeneity in the Virtual Twins framework. This method both tests for heterogeneity and fits an estimated model for the conditional average treatment effect if there is heterogeneity. Our simulation results indicate that this procedure can control the Type-I error under a variety of null scenarios (e.g., with both linear and nonlinear covariate main effects, different numbers of measured covariates, etc.) and can detect treatment effect heterogeneity when it is present. Application to data from a recent RCT shows that when the Type-I error is not controlled, models tend to include far too many covariates to have face-validity or be useful to construct policy. Conversely, when controlling the error rate, our approach is able to detect covariates that are likely to modify the treatment effect based on our biological understanding of the intervention.

While many methods such as GUIDE, STIMA, and interaction trees have been proposed to detect subgroups with heterogeneous treatment effects or model the treatment effect, most if not all require the selection of some penalty parameter a priori. Proper specification of this parameter can control the Type-I error rate to a desired level; however, there is little to no guidance on how to select this parameter, and existing guidance often focuses on the model’s mean squared error and not its Type-I error rate.

Our approach differs by offering explicit guidelines on how to select this parameter and control the overall Type-I error rate. Moreover, we note that our permutation procedure could be adjusted to aid parameter selection for GUIDE, STIMA, and interaction trees to facilitate Type-I error control.

Other methods control the Type-I error rate when testing the existence of treatment by covariate interactions. Some only offer a test for treatment effect heterogeneity and do not offer a method for describing the treatment by covariate interaction32,33. Additional work developed sophisticated permutation procedures to obscure treatment by covariate interactions while maintaining all other effects34. Although our proposed permutation procedure correctly controls the Type-I error of interest, these alternative permutation strategies may be more efficient and are worth investigating within our framework. Additionally, some have leveraged these permutation procedures to propose a method with the aim of identifying the appropriate complexity parameter for regression trees for the conditional average treatment effect estimated through propensity score matching such that the Type-I error rate is maintained and the conditional average treatment effect can be modeled if detected35. While our proposed method shares the goal of tuning regression trees’ complexity parameters, it is far more general and can support any Step 2 method with a single tuning parameter.

While our method showed strong performance on both simulated and real data, some drawbacks are worth noting. First, although our approach controlled the Type-I error rate regardless of the methods used in Steps 1 and 2, the power depends on how well the model was matched to the true data generative process. For example, although using the lasso for Step 2 yielded the highest power in a majority of scenarios, the power of tree-based methods tended to be closer to the (optimal) power of methods using the lasso when the true treatment effect was a nonlinear rather than linear function of the covariate space. Additionally, we note that permutation based tests are often not efficient and that there is space to develop more powerful tests that can maintain the Type-I error rate. Moreover, while our approach maintains the Type-I error rate when testing for treatment effect heterogeneity, it does not offer any probabilistic statements about the specific variables included in the model. That is, while the probability of including any covariates under the null hypothesis is controlled, the a priori probability of selecting a specific covariate when it does not contribute to effect heterogeneity is unknown. Future work could develop a hierarchical testing procedure to first assess if there are heterogeneous effects via our proposed method and then, if that null hypothesis is rejected, devise a way to individually test the candidate covariates detected in the final model for treatment interactions while controlling the family-wise error rate. Additionally, the permutation process itself imposes moderate computational costs. Because the data must be permuted before the Step 1 model is fit (which is often a dense ensemble method that requires multiple fittings such as a random forest), the model must be refit for each permutation, which can lead to nontrivial computational demands. Finally, while our proposed method is situated within the context of RCTs, future work exploring the method’s performance when applied to observational studies and extending the method to address any limitations in that context would be beneficial.

Although traditional guidelines recommend identifying potential subgroups a priori and testing for interactions to control the family-wise Type-I error rate36, there is space for data-driven discovery. Our proposed approach allows researchers to control the Type-I error rate of an overall test for treatment effect heterogeneity and identify covariates and/or subgroups possibly associated with differential effects for future investigation. We note that this does not alleviate the need to account for conducting multiple hypothesis tests (e.g., also testing for the marginal treatment effect). However, our approach moves us towards a principled yet data-driven approach to discovery.

Supplementary Material

suppl_material

Acknowledgements

We would like to thank our collaborator, Dr. Dorothy Hatsukami, for providing access to the data used to illustrate our method.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by the National Cancer Institute (Award Numbers R01CA214825 and R01CA225190), the National Institute on Drug Abuse (Award Numbers R01DA046320, and U54-DA031659) and National Center for Advancing Translational Science (Award Number UL1TR002494). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or Food and Drug Administration.

Footnotes

Software and code

The R package tehtuner implements our proposed method. The package and code to replicate the simulation studies can be downloaded at https://github.com/jackmwolf/tehtuner.

Declaration of conflicting interests

The authors declare that there is no conflict of interest.

References

  • 1.Hatsukami Dorothy K., Luo Xianghua, Jensen Joni A., al’Absi Mustafa, Allen Sharon S., Carmella Steven G., Chen Menglan, Cinciripini Paul M., Denlinger-Apte Rachel, Drobes David J., Koopmeiners Joseph S., Lane Tonya, Le Chap T., Leischow Scott, Luo Kai, McClernon F. Joseph, Murphy Sharon E., Paiano Viviana, Robinson Jason D., Severson Herbert, Sipe Christopher, Strasser Andrew A., Strayer Lori G., Tang Mei Kuen, Vandrey Ryan, Hecht Stephen S., Benowitz Neal L., and Donny Eric C.. Effect of Immediate vs Gradual Reduction in Nicotine Content of Cigarettes on Biomarkers of Smoke Exposure: A Randomized Clinical Trial. JAMA, 320(9):880–891, September 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Carroll Dana M., Lindgren Bruce R., Dermody Sarah S., Denlinger-Apte Rachel, Egbert Andrew, Cassidy Rachel N., Smith Tracy T., Pacek Lauren R., Allen Alicia M., Tidey Jennifer W., Parks Michael J., Koopmeiners Joseph S., Donny Eric C., and Hatsukami Dorothy K.. Impact of nicotine reduction in cigarettes on smoking behavior and exposure: Are there differences by race/ethnicity, educational attainment, or gender? Drug and Alcohol Dependence, 225:108756, August 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Su Xiaogang, Tsai Chih-ling, Wang Hansheng, Nickerson David M., Li Bogong, and Rosset Saharon. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 2009. [Google Scholar]
  • 4.Athey Susan and Imbens Guido. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, July 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Loh Wei-Yin, He Xu, and Man Michael. A regression tree approach to identifying subgroups with differential treatment effects. Statistics in Medicine, 34(11):1818–1833, May 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Su Xiaogang, Peña Annette T., Liu Lei, and Levine Richard A.. Random Forests of Interaction Trees for Estimating Individualized Treatment Effects in Randomized Trials. arXiv:1709.04862 [stat], September 2017. arXiv: 1709.04862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Dusseldorp Elise, Conversano Claudio, and Van Os Bart Jan. Combining an Additive and Tree-Based Regression Model Simultaneously: STIMA. Journal of Computational and Graphical Statistics, 19(3):514–530, 2010. Publisher: [American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America]. [Google Scholar]
  • 8.Nugent Ciara, Guo Wentian, Müller Peter, and Ji Yuan. Bayesian Approaches to Subgroup Analysis and Related Adaptive Clinical Trial Designs. JCO precision oncology, 3:PO.19.00003, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sugasawa Shonosuke and Noma Hisashi. Efficient screening of predictive biomarkers for individual treatment selection. Biometrics, 77(1):249–257, 2021. _eprint: 10.1111/biom.13279. [DOI] [PubMed] [Google Scholar]
  • 10.Sun Xin, Briel Matthias, Busse Jason W, You John J, Akl Elie A, Mejza Filip, Bala Malgorzata M, Bassler Dirk, Mertz Dominik, Diaz-Granados Natalia, Vandvik Per Olav, Malaga German, Srinathan Sadeesh K, Dahm Philipp, Johnston Bradley C, Alonso-Coello Pablo, Hassouneh Basil, Truong Jessica, Dattani Neil D, Walter Stephen D, Heels-Ansdell Diane, Bhatnagar Neera, Altman Douglas G, and Guyatt Gordon H. The influence of study characteristics on reporting of subgroup analyses in randomised controlled trials: systematic review. BMJ, 342, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sun Xin, Briel Matthias, Busse Jason W, You John J, Akl Elie A, Mejza Filip, Bala Malgorzata M, Bassler Dirk, Mertz Dominik, Diaz-Granados Natalia, Vandvik Per Olav, Malaga German, Srinathan Sadeesh K, Dahm Philipp, Johnston Bradley C, Alonso-Coello Pablo, Hassouneh Basil, Walter Stephen D, Heels-Ansdell Diane, Bhatnagar Neera, Altman Douglas G, and Guyatt Gordon H. Credibility of claims of subgroup effects in randomised controlled trials: systematic review. BMJ, 344, 2012. [DOI] [PubMed] [Google Scholar]
  • 12.Kasenda Benjamin, Schandelmaier Stefan, Sun Xin, Erik von Elm John You, Blümle Anette, Tomonaga Yuki, Saccilotto Ramon, Amstutz Alain, Bengough Theresa, Meerpohl Joerg J, Stegert Mihaela, Olu Kelechi K, Tikkinen Kari A O, Neumann Ignacio, Carrasco-Labra Alonso, Faulhaber Markus, Mulla Sohail M, Mertz Dominik, Akl Elie A, Bassler Dirk, Busse Jason W, Ferreira-González Ignacio, Lamontagne Francois, Nordmann Alain, Gloy Viktoria, Raatz Heike, Moja Lorenzo, Rosenthal Rachel, Ebrahim Shanil, Vandvik Per O, Johnston Bradley C, Walter Martin A, Burnand Bernard, Schwenkglenks Matthias, Hemkens Lars G, Bucher Heiner C, Guyatt Gordon H, and Briel Matthias. Subgroup analyses in randomised controlled trials: cohort study on trial protocols and journal publications. BMJ, 349, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wallach Joshua D., Sullivan Patrick G., Trepanowski John F., Sainani Kristin L., Steyerberg Ewout W., and Ioannidis John P. A.. Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA Internal Medicine, 177(4):554–560, 04 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Raghavan Sridharan, Josey Kevin, Bahn Gideon, Reda Domenic, Basu Sanjay, Berkowitz Seth A., Emanuele Nicholas, Reaven Peter, and Ghosh Debashis. Generalizability of heterogeneous treatment effects based on causal forests applied to two randomized clinical trials of intensive glycemic control. Annals of Epidemiology, July 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Foster Jared C., Taylor Jeremy M.G., and Ruberg Stephen J.. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, October 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.van Hoorn Ralph, Tummers Marcia, Booth Andrew, Gerhardus Ansgar, Rehfuess Eva, Hind Daniel, Bossuyt Patrick M., Welch Vivian, Debray Thomas P. A., Underwood Martin, Cuijpers Pim, Kraemer Helena, van der Wilt Gert Jan, and Kievit Wietkse. The development of champ: a checklist for the appraisal of moderators and predictors. BMC Medical Research Methodology, 17(1):173, Dec 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kent David M., Paulus Jessica K., van Klaveren David, D’Agostino Ralph, Goodman Steve, Hayward Rodney, Ioannidis John P.A., Patrick-Lake Bray, Morton Sally, Pencina Michael, Raman Gowri, Ross Joseph S., Selker Harry P., Varadhan Ravi, Vickers Andrew, Wong John B., and Steyerberg Ewout W.. The predictive approaches to treatment effect heterogeneity (path) statement. Annals of Internal Medicine, 172(1):35–45, 2020. . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bul Kim C. M., Doove Lisa L., Franken Ingmar H. A., van der Oord Saskia, Kato Pamela M., and Maras Athanasios. A serious game for children with attention deficit hyperactivity disorder: Who benefits the most? PLoS ONE, 13, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Apolo AB, Ellerton JA, Infante JR, Agrawal M, Gordon MS, Aljumaily R, Gourdin T, Dirix L, Lee KW, Taylor MH, Schöffski P, Wang D, Ravaud A, Manitz J, Pennock G, Ruisi M, Gulley JL, and Patel MR. Avelumab as second-line therapy for metastatic, platinum-treated urothelial carcinoma in the phase Ib JAVELIN Solid Tumor study: 2-year updated efficacy and safety analysis. J Immunother Cancer, 8(2), 10 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Nagpal Chirag, Wei Dennis, Vinzamuri Bhanukiran, Shekhar Monica, Berger Sara E., Das Subhro, and Varshney Kush R.. Interpretable subgroup discovery in treatment effect estimation with application to opioid prescribing guidelines. In Proceedings of the ACM Conference on Health, Inference, and Learning, CHIL ‘20, page 19–29, New York, NY, USA, 2020. Association for Computing Machinery. [Google Scholar]
  • 21.Breiman Leo. Random Forests. Machine Learning, 45(1):5–32, October 2001. [Google Scholar]
  • 22.Deng Chuyu, Vock David M., Carroll Dana M., Boatman Jeffrey A., Hatsukami Dorothy K., Leng Ning, and Koopmeiners Joseph S.. Practical Guidance on Modeling Choices for the Virtual Twins Method. arXiv:2111.08741 [stat], November 2021. arXiv: 2111.08741. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tibshirani Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. Publisher: [Royal Statistical Society, Wiley]. [Google Scholar]
  • 24.Friedman Jerome H.. Multivariate Adaptive Regression Splines. The Annals of Statistics, 19(1):1–67, March 1991. Publisher: Institute of Mathematical Statistics. [Google Scholar]
  • 25.van der Laan Mark J., Polley Eric C., and Hubbard Alan E.. Super Learner. Statistical Applications in Genetics and Molecular Biology, 6(1), September 2007. Publisher: De Gruyter Section: Statistical Applications in Genetics and Molecular Biology. [DOI] [PubMed] [Google Scholar]
  • 26.Breiman Leo, Friedman Jerome H., Olshen Richard A., and Stone Charles J.. Classification And Regression Trees. Routledge, Boca Raton, October 2017. [Google Scholar]
  • 27.Hothorn Torsten, Hornik Kurt, and Zeileis Achim. Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3):651–674, September 2006. Publisher: Taylor & Francis _eprint: 10.1198/106186006X133933. [DOI] [Google Scholar]
  • 28.Hatsukami Dorothy K., Kotlyar Michael, Hertsgaard Louise A., Zhang Yan, Carmella Steven G., Jensen Joni A., Allen Sharon S., Shields Peter G., Murphy Sharon E., Stepanov Irina, and Hecht Stephen S.. Reduced nicotine content cigarettes: effects on toxicant exposure, dependence and cessation. Addiction (Abingdon, England), 105(2):343–355, February 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Benowitz Neal L., Dains Katherine M., Hall Sharon M., Stewart Susan, Wilson Margaret, Dempsey Delia, and Jacob Peyton. Smoking behavior and exposure to tobacco toxicants during 6 months of smoking progressively reduced nicotine content cigarettes. Cancer Epidemiology, Biomarkers & Prevention: A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology, 21(5):761–769, May 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hatsukami Dorothy K., Hertsgaard Louise A., Vogel Rachel I., Jensen Joni A., Murphy Sharon E., Hecht Stephen S., Carmella Steven G., al’Absi Mustafa, Joseph Anne M., and Allen Sharon S.. Reduced Nicotine Content Cigarettes and Nicotine Patch. Cancer Epidemiology and Prevention Biomarkers, 22(6):1015–1024, June 2013. Publisher: American Association for Cancer Research Section: Research Articles. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Benowitz Neal L., Nardone Natalie, Dains Katherine M., Hall Sharon M., Stewart Susan, Dempsey Delia, and Jacob Peyton. Effect of Reducing the Nicotine Content of Cigarettes on Cigarette Smoking Behavior and Tobacco Smoke Toxicant Exposure: Two Year Follow Up. Addiction (Abingdon, England), 110(10):1667–1675, October 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Crump Richard K., Hotz V. Joseph, Imbens Guido W., and Mitnik Oscar A.. Nonparametric Tests for Treatment Effect Heterogeneity. Review of Economics and Statistics, 90(3):389–405, August 2008. [Google Scholar]
  • 33.Chang Chi, Jaki Thomas, Sadiq Muhammad Saad, Kuhlemeier Alena, Feaster Daniel, Cole Natalie, Lamont Andrea, Oberski Daniel, Desai Yasin, Van Horn M. Lee, and The Pooled Resource Open-Access ALS Clinical Trials Consortium. A permutation test for assessing the presence of individual differences in treatment effects. Statistical Methods in Medical Research, 30(11):2369–2381, 2021. . [DOI] [PubMed] [Google Scholar]
  • 34.Foster Jared C., Nan Bin, Shen Lei, Kaciroti Niko, and Taylor Jeremy M. G.. Permutation Testing for Treatment–Covariate Interactions and Subgroup Identification. Statistics in Biosciences, 8(1):77–98, June 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chen Jianshen and Keller Bryan. Heterogeneous Subgroup Identification in Observational Studies. Journal of Research on Educational Effectiveness, 12(3):578–596, July 2019. Publisher: Routledge _eprint: 10.1080/19345747.2019.1615159. [DOI] [Google Scholar]
  • 36.ICH Expert Working Group. ICH harmonised tripartite guideline: Statistical principles for clinical trials E9. https://www.ich.org/page/efficacy-guidelines, 1998. Accessed: 2021-10-26. [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl_material

RESOURCES