Abstract
We consider the problem of identifying a subgroup of patients who may have an enhanced treatment effect in a randomized clinical trial, and it is desirable that the subgroup be defined by a limited number of covariates. For this problem, the development of a standard, pre-determined strategy may help to avoid the well-known dangers of subgroup analysis. We present a method developed to find subgroups of enhanced treatment effect. This method, referred to as “Virtual Twins”, involves predicting response probabilities for treatment and control “twins” for each subject. The difference in these probabilities is then used as the outcome in a classification or regression tree, which can potentially include any set of the covariates. We define a measure Q(Â) to be the difference between the treatment effect in estimated subgroup  and the marginal treatment effect. We present several methods developed to obtain an estimate of Q(Â), including estimation of Q(Â) using estimated probabilities in the original data, using estimated probabilities in newly simulated data, two cross-validation-based approaches and a bootstrap-based bias corrected approach. Results of a simulation study indicate that the Virtual Twins method noticeably outperforms logistic regression with forward selection when a true subgroup of enhanced treatment effect exists. Generally, large sample sizes or strong enhanced treatment effects are needed for subgroup estimation. As an illustration, we apply the proposed methods to data from a randomized clinical trial.
Keywords: randomized clinical trials, subgroups, random forests, regression trees, tailored therapeutics
1. Introduction
Confirmatory randomized clinical trials are designed to provide definitive information about treatments, and frequently compare a standard treatment to a new treatment. The conclusions from such a study are applicable to the whole population that has been considered. However, with increasing use of targeted therapies and with increasing understanding of the mechanisms of action of new agents and of the human response to those agents, it is quite plausible that there are subgroups of patients for whom the new treatment is especially effective. Likewise, there could be subgroups of patients for whom the new treatment is not effective, or less effective than the standard therapy. There is a strong desire to find such subgroups if they exist [1]. From a statistical perspective, searching for subgroups is known to be a dangerous exercise, with the high possibility of finding false positives. There is a large literature on this topic, with many statisticians and clinical trialists writing about dangers of subgroup analyses [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. The general opinion is that if subgroups are going to be examined they should be defined before looking at the data, and that post-hoc mining of the data in an uncontrolled or undefined fashion is likely to lead to unreliable results. On the other hand, there are those who believe the biological rationale for subgroups is so strong that the statistical concerns about mining the data have been emphasized too much, and are a barrier to progress [12].
An alternative strategy to predefining the subgroups is to predefine the statistical approach that is going to be used to find subgroups [1]. Such an approach is reproducible and its statistical properties can be understood. The research we present in this paper is to describe several different strategies for finding subgroups. We believe the methodology described here would be particularly applicable for two situations: (i) those where a new treatment is shown overall to be slightly better than the standard therapy, but not sufficiently better to be widely adopted, and (ii) those situations in which the new treatment appears better, though not significantly, but where there may be a subgroup of patients for which there is a substantial benefit of the new treatment. The challenge in such situations is finding such a subgroup, and then demonstrating that the benefit is likely to be real for future patients.
The setting we consider is a two-group randomized clinical trial, with a binary outcome variable Y, treatment group indicator T and covariates X. The dimension of X is moderate, for example 8 to 100, and these covariates are measured pretreatment, and could be demographic, laboratory or questionnaire variables. The goal is to find a subgroup of patients, defined by a region (denoted A) of the covariate space of X, in which the treatment effect is substantially better than the average treatment effect, or better than some prespecified threshold. Since it is generally desirable to have a relatively simple way of defining the subgroup, we want A to depend on a small number of variables. For example, the region A could be X4 ≥ 2 and X7 ≤ 5.
The classical approach to identifying subgroups is nicely described by Kehl and Ulm [13], and involves the fitting of a model which includes interactions between treatment and the covariates. For example, for logistic regression, we could consider models of the general form
for which the main interest would be in the term w(X). One problem with such an approach is that only those factors or combinations of factors which are included as interactions in the model may potentially be identified as important in defining the subgroup of enhanced treatment effect. Additionally, even if the form of interactions between factors is assumed to be linear, the order of such terms is unknown [13]. Thus, this method is not very feasible when the dimension of X is even moderately large, because the number of potential interactions is massive, even when one only considers one and two-way interactions of T with the X’s. It is well known that large sample sizes are needed to find interactions in models, so, for similar reasons, we may expect large samples to be necessary to find and confirm the existence of subgroups and accurately define them. While subgroups and interactions in statistical models are very related topics, they are not identical. What one means by a subgroup with enhanced treatment effect depends on how one defines enhanced. What one means by an interaction in a model depends on the scale of the observations, because interactions on one scale can disappear when the data have been transformed to another scale.
Much of the statistical literature on model building and validation has some relevance to the problems of subgroup estimation and testing for subgroup effects. We will not review it here, except to mention that tests for interactions in clinical trials [14, 15, 16] appear promising, and that tree based methods of directly finding treatment-covariate interactions have been suggested [17, 18, 19].
In this paper we develop and compare some different methods for defining a subgroup which shows an enhanced treatment effect. A challenge in this setting is to give an accurate estimate of the enhanced treatment effect in the subgroup. The procedures we develop essentially mine the data in a defined way, so there is a considerable danger of overfitting. We will investigate a number of different schemes for obtaining an honest estimate of the magnitude of the treatment effect in the subgroup, including resampling schemes such as cross-validation [20].
2. Methods for Estimating Region A
2.1. Notation
The data consist of (Yi, Ti, X1i, …, Xpi), i = 1, …, n. Let nj = number of observations with Ti = j, j = 0, 1. We expect n0 and n1 to be close to n/2 in a 1-1 randomized trial. We will focus on the situation where Y is a binary outcome. The X’s could be continuous or categorical and they may be correlated with each other. There could be a marginal effect of the treatment on Y, and some of the X’s could be prognostic, in the sense that they are marginally associated with Y. The goal is to partition the covariate space into two regions A and Ac, with A being defined by a relatively small subset of the X’s. Define |Aj| to be the number of observations with Xi ∈ A and Ti = j, j = 0, 1.
2.2. Forward Logistic Regression
As a standard and simple method we consider forward selection in a logistic regression model. The terms that are considered for inclusion in the model are main effects for all X’s and T, and all X × T, X × X, and X × X × T interactions. The forward selection procedure starts with an intercept-only model and, at each step, the term which gives the smallest AIC is added to the model. The final model is used to calculate P̂1i = P(Yi = 1|Ti = 1, Xi) and P̂0i = P (Yi = 1|Ti = 0, Xi) for each person i. A new variable Zi = P̂1i − P̂0i is then created, and subjects are defined to be in group  if Zi is greater than some cutoff c, which we generally take to be either δ + 0.1 or δ + 0.05, where δ is an estimate of the treatment effect P(Yi = 1|Ti = 1) – P (Yi = 1|Ti = 0). In addition, although all X’s in the final model are needed to define Â, the X’s which are involved in first and second-order interactions with T are noted, as these are the X’s which are most important in defining Â. If no Zi is greater than c, then  is the null set. An alternative approach to defining the treatment effect for each subject is to define the difference on the logit scale, i.e. define Zi = logit(P̂1i) − logit(P̂0i), and subjects are defined to be in group  if Zi is greater than some cutoff. The form of Zi that is prefered will depend on the context. In this paper we will only consider differences on the probability scale.
2.3. Virtual Twins Method
This approach borrows concepts from counterfactual models, in which there are two possible outcomes for each person (one under each treatment assignment), only one of which can be observed, and it is the difference between the two outcomes that is important. We investigate two versions of Virtual Twins, VT(R) and VT(C), which have the same first step, but in the second step have either a regression procedure or a classification procedure.
2.3.1. Step 1. Apply Random Forests to the Data
A random forest [21] is an ensemble predictor based on multiple regression trees. For our purposes the random forest is simply a black box predictor which takes as input covariate values (Xi, Ti, XiI(Ti = 0) and XiI(Ti = 1) in our case), and gives as output an estimate of P(Yi = 1) for that set of covariate values. The inclusion of XiI(Ti = 0) and XiI(Ti = 1) as covariates is not essential, but in numerical work we found that their inclusion improved the properties of the method. Fitting of the random forest is done using the R function randomForest with all default settings except for the number of trees per forest, which we set at 1000. As with the logistic method, the random forest is used to predict P̂0i and P̂1i. If the actual treatment group for subject i is j, then P̂ji is obtained from the out-of-bag estimate from the random forest, while P̂(1−j)i is obtained by applying the random forest to that person’s data, with the treatment group switched. Once this is done, we define Zi = P̂1i − P̂0i, which can be regarded as an estimate of the treatment effect for subject i. A variation of this step would be to use two separate random forests, one for each treatment group, and predict for each subject using the forest from the other group. All of these strategies for obtaining Zi involve some form of extrapolation; that is, estimating the probability of an outcome in a treatment group that is different from the treatment they actually received.
2.3.2. Step 2. Estimate a Regression or Classification Tree
The purpose of this tree is to find a small number of X’s that are strongly associated with Z and hence can define A. We consider two alternative methods.
Virtual Twins (Regression)
In this method, denoted VT(R), we estimate a regression tree with Z as the response variable and covariates X. The regression tree is then used to predict values of Zi for each person. Then, subjects with predicted Zi greater than some threshold c are considered to be in Â. Thus,  is defined by the paths down the tree which lead to terminal nodes with predicted Z’s greater than c. We take c to be either δ + 0.1 or δ + 0.05. If none of the predicted Zi’s are greater than c then  is empty. We use the R function rpart with default settings, except that the minimal terminal node size is 20, and the complexity parameter is taken to be 0.02.
Virtual Twins (Classification)
Define a new binary variable Z*, as if Zi > c and if Zi ≤ c. The value of c for this method is generally the same as that used in VT(R), and will affect the size of Â. Since Zi is a difference in probabilities, we are essentially creating a classification variable immediately after the random forest stage. This Z* is used as the outcome in building the classification tree, which is then used to classify individuals as being in  or not, thus all X’s in the tree define Â. This method is denoted VT(C). Note that  is empty if the classification tree has no splits.
3. Properties of the Estimated Region Â
For any region A in the covariate space, define
(1) |
as the measure of the enhanced treatment effect in A compared to the average treatment effect. We define Q(A) to be zero if A is a null set. Let  be the estimated region, as determined by one of the methods described above. Since  is the region that would be recommended for use, it is important to understand how effective  will be in defining a region of enhanced treatment effect for future populations. Clearly large values of Q(Â) are desirable if  is to be useful.
3.1. Estimation of Q(Â)
Below we describe some approaches to obtaining an estimate Q̂(Â) of Q(Â), which will be evaluated in a simulation study. We consider six methods of estimating Q(Â). It is desirable that Q̂(Â) be as close as possible to Q(Â), rather than Q(A), as Q(Â) is the true measure of enhanced treatment effect for the estimated region Â, whereas Q(A) is the corresponding measure for the unknown true region A, which in general will not be the same as Â.
Method 1. Resubstitution Method
Estimate the four quantities P(Y = 1|T = 1, X ∈ Â), P(Y = 1|T = 0, X ∈ Â), P(Y = 1|T = 1) and P(Y = 1|T = 0)) from the observed proportions in the data, which are then substituted into equation 1 to give Q̂(Â). For this estimator, the same data that was used to construct  will be used to evaluate it. The methods of estimating  may be overfitting the data (i.e. modeling the noise), so a resubstituted estimate of Q(Â) is likely to be biased, especially in small samples and with many covariates.
Method 2. Simulate New Data (SND)
This method is a type of parametric Bootstrap approach. For both the logistic regression and virtual twins methods the first step gives estimates of P1i and P0i, and these estimates can be used to simulate new outcome data from Bernoulli distributions. The new data ( , Xi, Ti) will “look like” the original data in terms of marginal and conditional distributions, but be statistically independent of the original data. Specifically if Ti = 1 then and if Ti = 0 then . We then obtain the estimates of the four quantities P(Y = 1|T = 1, X ∈ Â), P(Y = 1|T = 0, X ∈ Â), P(Y = 1|T = 1) and P(Y = 1|T = 0)) from the empirical proportions in the simulated data Y*, from which we obtain Q̂ (Â). Alternatively we can avoid actually simulating new data by simply taking an appropriate average of the estimates of P1i and P0i to get Q̂ (Â):
Compared to the resubstitution estimator, we expect this method to have less bias, because it is not based explicitly on the original data. However, it may not completely eliminate the bias because any idiosyncracies in the observed data which are causing overfitting will still be present, but to a lesser degree in the estimates P̂1i and P̂0i.
Method 3. Cross-Validation of P̂1i and P̂0i
A modification of method 2 is to obtain P̂1i and P̂0i via cross-validation. In this method the specific data for subject i is not used to obtain P̂1i and P̂0i. Using 10-fold cross-validation, we apply the random forest or logistic regression approach to 9/10 of the data and use the resulting predictor to obtain estimates of P1i and P0i for the remaining 1/10 of the observations. This is repeated 10 times, then the simple averaging approach of method 2 is applied to give Q̂(Â).
Method 4. Full Cross-Validation
For all the methods we apply the whole process of defining  to 9/10 of the data, which gives a region Âk, and 1/10 is left out, which is used as an independent testing dataset. This is repeated 10 times. Each left out observation is then either in Âk or not. We count the number of observations with Y = 1 for T = 1 and X ∈ Âk, and similarly the number with Y = 1 for T = 0 and X ∈ Âk. We pool the counts across the 10 values of k to give final estimates of P(Y = 1|T = 1, X ∈ Â) and P(Y = 1|T = 0, X ∈ Â), which are then used in equation 1 to give Q̂ (Â). Note that in contrast to method 3 in which  is fixed, the estimate of  in method 4 does vary.
Methods 5 and 6. Bootstrap Bias Corrected
The Bootstrap is a method that can be used to evaluate the bias in an estimator. The original estimator is then adjusted by this estimated bias. In this method the original data will be bootstrapped (resampled with replacement) 20 times, and for each dataset a new estimated region will be obtained. Then the estimate of Q(Â) is given by (Q from original data applied to original Â) + (Q from original data applied to new Â) − (Q from new data applied to new Â). The justification for this adjusted estimate requires some notation. Let F = true unknown distribution of the data, F̂ = distribution of bootstrapped data, A = true region, ÂF = region estimated from the observed data and ÂF̂ = region estimated from bootstrapped data (new Â). It is necessary to consider three probability laws depending on whether Q is applied to future data, the current data or bootstrapped data. Specifically, Q(.) is governed by the probability law on the next dataset (i.e. the true F), Q̂F(.) is governed by the probability law on the observed data (i.e. the empirical distribution) and Q̂F̂(.) is governed by the probability law from the bootstrapped data (i.e. the empirical Bootstrap distribution).
The quantity of interest is Q(ÂF), which can be written as [Q(ÂF) − Q(A)] + Q(A) = R + S. Using the bootstrap we approximate R by [Q̂F(ÂF̂) − Q̂F(ÂF)]. To obtain S, we approximate [Q̂F(ÂF) – Q(A)] by [Q̂F̂(ÂF̂) – Q̂F(ÂF)], rearranging this to give an approximation to S of Q̂F(ÂF) – [Q̂F̂(ÂF̂) – Q̂F (ÂF)]. Then adding approximations of R and S gives Q(ÂF) = Q̂F(ÂF) + Q̂F(ÂF̂) − Q̂F̂(ÂF̂). Since the second term will tend to be smaller than the third term, the bias corrected estimate will likely be less optimistic than the original estimate Q̂F(ÂF). To implement this bootstrap corrected estimate we use 20 bootstrap samples and for the second and third terms the average of the 20 values of Q is used. This bias correction can be applied to any method we have for calculating Q. Method 5 consists of applying it to the Resubstitution method and Method 6 consists of applying it to the SND method.
3.2. Sampling Variability of Q̂ (Â)
It may be desirable to attach standard errors to Q̂(Â). We suggest the following scheme. Simulate new datasets using the estimates of P1i and P0i obtained from the random forest procedure. For new dataset j, find the new Âj, and calculate Q̂(Âj) using one of the methods above. The standard deviation of these quantities is an estimate of the standard error.
4. Simulations
4.1. Simulation Study Design
We generate data from a logistic model of the general form logit(P(Y = 1|T, X)) = α + βT + γh(X) + θTI(X ∈ A), where X’s are independent N(0, 1) and A is a known region in the covariate space defined by two X’s. There are a number of aspects of the design that may be important to consider, including sample size, dimension of X, number of X’s that determine A, size of A, values of the parameters, and correlation between the X’s, as these might influence the performance of the methods. We only consider some of these potential scenarios in this paper.
4.2. Criteria for Evaluation and Comparison of Region A Estimation Methods
We use a number of criteria to evaluate each method’s ability to identify the region A.
Finding Correct X’s
For VT(R) and VT(C), the total number of unique X’s in the tree is recorded. For the logit method, we record the total number of unique X-by-T interactions in the selected model. For the VT(R), VT(C) and logit methods, we record how often the methods find specific covariates, including some that are important and some that are not associated with the outcome. For the VT methods we record whether each covariate is in the tree, and for the logit method we record whether the one-way X-by-T interaction is in the final model. To determine whether or not the trees are finding the correct X’s, we record whether or not the two X’s that define the true region A both occur in the first 2 levels of the tree.
Closeness of  to the True A
This is measured using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and area under the ROC curve (AUC). Each estimation method gives an Â, and since we know the true A we can calculate sensitivity, specificity, PPV and NPV for each A estimation method and each value of the threshold c. Furthermore, the predicted differences in probabilities from the logit model and the VT(R) approaches can be used along with true A membership to calculate AUC by varying c. When θ ≠ 0 and  is empty the PPV is defined as zero. When θ = 0, A does not exist. In these cases, only specificity is meaningful and ideally should be close to 1.
Closeness of the Size of  to the Size of the True A
To evaluate the closeness of |Â| to |A|, we record |Â|. If  does not exist then |Â| is taken as zero.
Power
Another quantity of interest is the percentage of times our methods find a null  when θ ≠ 0, and when θ = 0. For VT(C), we consider the method to have found a null  if the final tree consists only of a single node, i.e. the method failed to find any X’s. For VT(R),  is null if either the final tree is a single node or if a tree exists but all predicted Zi’s are less than c. For the logit-based method, we consider  to be null if all Zi < c. Lastly, to quantify if there is enough data to determine whether individuals in true A have an enhanced treatment effect, we fit a logit model with the correct terms (i.e. only the terms used to generate the data), and calculate the power by determining how often the A-defining term (θ̂) is significantly different from zero.
4.3. Properties of Q̂(Â) as an Estimator of Q(Â)
For each simulated data set, we calculate Q̂(Â) for each A estimation procedure using the six methods as described above, in addition to calculating Q(A) and Q(Â). Because this is done for multiple data sets, we can estimate the variability of the various Q(A) estimates, and can quantify how far Q̂(Â) is from Q(A) and Q(Â). Thus, this allows us to estimate both the bias and the variability of Q̂(Â) and of Q̂(Â) – Q(Â) for each A estimation procedure. Because they do not lead to any different conclusions, we do not show the results for Q̂(Â) – Q(Â).
5. Simulation Results
We consider a base-case simulation design and several modifications of this base-case. In the base-case we simulate 1000 patient randomized trials, and the X’s are generated as independent Xj ~ N (0, 1), j = 1, …, 15. We consider logit models for data generation logit(P(Y = 1)) = −1 + 0.5X1 + 0.5X2 − 0.5X7 + 0.1T + 0.5X2X7 + θTI(X ∈ A), where θ determines the extent to which individuals in region A have an enhanced treatment effect. For most models, the true region A is X1 > 0 ∩ X2 < 0, which contains approximately 25 percent of the observations. Results are based on 100 simulated datasets. We consider a null case (true θ = 0) and a base-case (true θ = 0.9) chosen so that in most scenarios the correct logit model has approximately 90 % power in a test of θ = 0. Simulation results for the base-case are summarized in section 5.1 and the upper portions of Tables 1, 2 and 3, the null-case results are summarized in section 5.2 and the lower portions of Tables 1, 2 and 3, and results for other modifications are presented in section 5.3.
5.1. Base Case Results
From Table 1, we can see that the tree-based methods tend to find too many X’s, whereas the logit-based method tends to be too conservative, finding only about one X on average. Additionally, the tree-based methods appear to be better able to identify main effects and covariates determining region A than the logit-based method, though they also tend to identify non-important covariates (e.g. X3 and X4) as important slightly more often than the logit-based method. Though the tree-based methods can typically identify one of the covariates determining subregion A, they are less able to simultaneously identify both covariates which define region A as the two most important. Between the tree-based methods, VT(R) finds the correct X’s in the top two levels more frequently than VT(C). While VT(R) never fails to find a tree and VT(C)0.05 almost never fails, VT(C)0.1 fails to find a tree 11 % of the time. From Table 2, we can see that all tree-based methods except for VT(R)0.1 find null Â’s less frequently than the corresponding logit-based methods. Of the tree-based methods, VT(C) performs the best in this regard, and defining  by Zi > δ + 0.05 noticeably outperforms Zi > δ + 0.1. With one exception, the median predicted subgroup size tends to be closer to 25 % (i.e. 250) for the tree-based methods than for the corresponding logit-based method. Also, although possibly lower than desirable, the sensitivities and positive predictive values for the tree-based methods are noticeably better than those of the logit-based methods, while the specificities and negative predictive values for all methods are similar. The AUC is better for the tree-based methods than for the logit-based methods.
From Table 3, it is clear, since Q(A) > Q(Â), that all methods tend to identify subgroups which have less of a true enhanced treatment effect than the true subgroup A. The tree-based methods outperform the logit-based method, finding subgroups which have larger values of Q(Â). Also, VT(C) slightly outperforms VT(R), particularly when is used to define Â. With the exception of the full cross-validation-based estimates, all estimates (and standard deviations) are essentially identical between VT(R) and VT(C) when the threshold for defining  is δ + 0.05; however, when  is instead defined using threshold δ + 0.1, the cross-validation-based and bias-corrected estimates show similar levels of bias between VT(R) and VT(C), but the resubstitution and SND estimates are slightly less biased for VT(R). For all methods, the resubstitution, SND and cross-validation-based SND estimates tend to overestimate Q(Â), though the two SND methods are generally much closer to Q(Â) than the corresponding resubstitution estimates. The full cross-validation-based resubstitution estimates tend to greatly underestimate Q(Â), and are extremely variable. The two bootstrap bias-corrected estimates are the best, with the bias-corrected SND estimate being the closest to Q(Â) and having smaller variability than other estimates.
From the magnitude of the standard deviations in Table 3 for SND with bootstrap bias correction SND(BC), we can see that the value of Q̂(Â) would be considered greater than zero much more often for the θ = 0.9 case than for the null case. The standard errors of Q̂(Â) were also evaluated using the method outlined in Sampling Variability of Q̂(Â). We found that for VT(R) and the SND method of estimating Q(Â), the estimated standard errors were close to the standard deviations in Table 3 (results not shown), suggesting that these standard errors are reasonable measures of uncertainty to present along with the estimate Q̂(Â).
5.2. Null Case Results
From Table 1, we can see that the tree-based methods again tend to identify too many covariates as important, whereas the logit-based method finds almost zero. Moreover, none of the tree-based methods fails to find a tree more than 38 % of the time, whereas an intercept-only model is chosen 85 % of the time for the logit-based method. This could be due to the fact that logit-based method uses the same scale from which the data were generated, whereas the tree-based methods change to the probability scale. Between the tree-based methods, VT(C) finds empty trees more frequently than VT(R), and defining  using noticeably outperforms . It appears that, even in the absence of a region of enhanced treatment effect, the tree-based methods still identify main effects as important, while the logit-based method does not. Similar results can be seen in Table 2. The most conservative of the tree-based methods in this case (VT(R)0.1) still identifies a subgroup of enhanced treatment effect 35 % of the time, whereas the logit-based method identifies such a subgroup only about 15 % of the time. Between the tree-based methods, the classification methods tend to identify subgroups more frequently than the corresponding regression-based methods. The specificities for all methods were reasonably good, although again the logit-based method tended to outperform the tree-based methods. As expected, specificities for  defined by Zi > δ + 0.1 are better than those for Zi > δ + 0.05.
From Table 3, we can see that, as expected, subgroups identified by all methods have essentially no real enhanced treatment effect. The logit-based method has estimates (Q̂(Â)) which are generally much closer to zero than the corresponding tree-based methods. Between the tree-based methods, VT(R) appears to perform better for SND estimates, while the other estimates tend to be similar, or slightly better for VT(C). In addition, except for the SND estimates, defining  using Zi > δ + 0.05 tends to be similar or slightly better than Zi > δ + 0.1 for VT(R), whereas for VT(C) this relationship is reversed. Also, the resubstitution, SND and cross-validation-based SND methods tend to overestimate Q(Â), with the resubstitution method overestimating the most, while the full cross-validation-based resubstitution underestimates Q(Â). The bias-corrected estimates are again the closest to the true values of Q(Â); however, the bias-corrected estimates tend to somewhat overestimate Q(Â), particularly in the case of the tree-based methods.
5.3. Modifications to the Base Case
We also consider a number of modifications to the base-case:
30 covariates instead of 15.
Correlated covariates, in which variables in the three clusters {X1, X3, X7}, {X2, X4, X15}, and {all remaining covariates} have internal correlations of 0.7, but are uncorrelated with variables from the other clusters.
Subject-specific effects, using data generation models logit(P(Yi = 1)) = ai − 1 + 0.5X1 + 0.5X2 − 0.5X7 + 0.1T + 0.5X2X7 + θTI(X ∈ A) and logit(P(Yi = 1)) = ai − 1 + 0.5X1 + 0.5X2 − 0.5X7 + 0.1T + biT + 0.5X2X7 + θTI(X ∈ A), where ai and bi are both normally distributed with mean zero and variance 0.25.
Sample sizes of 400 and 2000.
True A redefined to be X1 ≥ −0.545 ∩ X2 ≤ 0.545, so that A is approximately 50 % of the data set.
θ = 1.5.
For all situations we present only the results for VT(R) with threshold δ + 0.05. This choice was made for illustrative purposes. The approximate power in a test of θ = 0 for the correct logit model ranged between 90 and 100 % for all situations except for the n = 400 case, in which the power was found to be approximately 60 %.
Results for these cases are compared to the base-case in the upper sections of Tables 4 and 5, with Table 5 being limited to the SND method for Q estimation. As expected, the average value of Q(Â) is noticeably smaller than Q(A) when n = 400 and more similar to Q(A) when n = 2000, indicating that the ability to find subgroups with good properties decreases with decreasing sample size. From Table 4 we can see that, with the exception of then = 400 case, which is noticeably worse than the others, and the n = 2000 and θ = 1.5 cases, which are noticeably better, the method’s ability to find the correct subgroup seems to be somewhat unaffected by moderate variations on the base-case. The NPV’s are quite good for all cases other than the Larger |A| case, and with the exception of the n = 400 case, all cases lead to very good AUC values. Furthermore, the method seems to find subgroups that are similar in size to the true subgroup for all cases, and rarely fails to find a subgroup of any sort. From the upper section of Table 5, we can see that, other than n = 400 and n = 2000 cases, the method appears to estimate Q(Â) with similar accuracy under all cases when the bias correction is made. Though reasonably good, the bias-corrected estimates for the n = 2000 and θ = 1.5 cases are negatively biased, which is somewhat counterintuitive, as one would expect increased sample size and increased signal to lead to more accurate estimates. The SND estimates for all cases other than θ = 1.5 show some positive bias, which is, for the most part, removed once we implement the bias correction. Increasing the number of covariates to 30 does not change the properties much, except for increasing the percentage of times that no subgroup is found.
Null cases (θ = 0) for each of these modifications are also considered, and comparisons of these to the null-case are given in the lower sections of Tables 4 and 5. We can see from Table 4 that for specificity, median |Â| and number of unique X’s the method seems somewhat insensitive to variation of the null-case; however, the correlated and n = 400 cases show substantially fewer instances of |Â| = 0 than the other cases. Specificities under all cases are reasonably good; however the method continues to find subgroups, even when no true subgroup exists, identifying a subgroup at least 72% of the time in all cases. From Table 5, we see that our method shows similar performance under all null cases, particularly when the bias correction is used. As expected, the estimated subgroups for all cases have, on average, no enhanced treatment effect. As we saw in the base-case, the SND estimates are positively biased; but, in this case, though the bias correction somewhat reduces this positive bias, some bias still remains. Although there is some positive bias, these estimates are relatively close to zero, so, although the method continues to identify subgroups when no true subgroup exists, it is unlikely such subgroups would be falsely identified as important.
6. Application to Clinical Trial Data
The example is taken from a clinical trial conducted by Eli Lilly, and since the specific information is still confidential, the problem and solution will be described in general terms. The data comes from a randomized, double-blind clinical trial in patients with a potentially fatal condition. Data from this study includes 1019 individuals, 517 of whom received the experimental treatment in addition to the standard of care; the remaining patients received placebo with the standard of care. The intervention is a drug that that is intended to improve survival, and as such, the agreed upon endpoint with the Food and Drug Administration was survival at 28 days post-randomization to treatment/placebo. We consider 44 covariates (X1 – X44), including demographic, laboratory, medical history and questionnaire data. Of these, 9 are binary, 14 are regarded as continuous, and 21 are categorical. The 21 categorical variables were ssubdivided using dummy variables, giving an overall total of 60 X’s. The overall treatment effect is 0.069 (SE = 0.028), indicating a modest overall survival benefit for the experimental treatment. For all methods, we define  using the threshold δ + 0.05 = 0.119.
Applying the forward logistic approach found 30 main effects, 77 X × X interactions, and 11 X × T or X × X × T interactions. The X’s which have the most significant interactions with T are X1, X2, X11, X18, X39 and X41, three of which are prognostic factors for survival and two others which have biological plausibility. The Resubstitution estimate of Q(Â) is 0.285 and the bias-corrected SND estimate is −0.042. For this method, the resulting estimated subgroup contains 290 individuals (151 treated). Though many main effects and interactions were included in the final logistic model, the bias-corrected estimates of Q(Â) are essentially zero, suggesting that no meaningful subgroup was found.
Applying VT(R) led to a tree in which  = {X : X8 < 58.22 and X44 = 1, or X8 ∈ [58.22, 180.9) and X12 ≥ 9, 938}, and similarly, VT(C) led to a tree in which  = {X : X8 < 59.19 and X44 = 1 and X1 ≥ 70.01, or X8 < 59.19 and X44 = 1 and X1 < 70.01 and X12 ≥ 3, 304}. These subgroups for VT(R) and VT(C) included 233 (126 treated) and 143 (77 treated) individuals respectively. Though our analyses focus on the threshold δ + 0.05 for defining Â, it should be noted that the Virtual Twins approaches failed to identify subgroups for the threshold δ + 0.1. The Resubstitution estimates of Q(Â) for VT(R) and VT(C) are 0.179 (SE=0.107) and 0.197 (SE=0.101) respectively, and the bias-corrected SND estimates are 0.047 (SE = 0.048) and 0.041 (SE = 0.039). The variables that appear to be potentially useful are X1, X8, X12 and X44, which include three laboratory biomarkers and one demographic variable. Three of these variables are related to the severity of the patient’s condition, and are also prognostic for the survival outcome. All four variables have some biological plausibility. However, though the estimates of Q(Â) for the Virtual Twins methods are larger than the corresponding estimates for the logistic method, they are still modest in size, and only slightly larger than their standard errors. This suggests that there is insufficient evidence from this trial that the subgroups found by the Virtual Twins methods have a significantly better outcome.
This example shows the benefits of the Virtual Twins method over the logistic method in three respects; the results are more easily interpreted, many fewer variables are identified as important, and  has better properties. Additionally, the computational time for the Virtual Twins method was substantially better than that for the logistic method.
7. Discussion
Clinical trial research often includes reports of subgroup analyses in order to explore differential treatment effects, often times as post hoc or exploratory analyses. With the advent of tailored therapeutics, much has been published on genetic or other biomarkers that may be predictive of patient outcomes or differential treatment effects, and thus may be useful to define subgroups. There are many dimensions to this problem (effect size in both the overall populations and in the subgroup of interest, subgroup size, study sample size, the number of covariates of interest, the number of covariates that define the subgroup, the correlation between covariates, etc.) that give it considerable complexity. While many have commented on the dangers of subgroup analyses, whether planned or unplanned, there has been little serious investigation of methodologies for proper identification of subgroups and assessment of their reliability, other than routine, conservative multiplicity adjustments on the number of treatment-by-subgroup interaction tests done.
In clinical drug development programs that are exploring an array of potential biomarkers, phase 2 trials usually include dozens or a few hundred patients at most. Our simulations of n = 400 patients indicate that single trials are unlikely to find the right biomarkers when the effect size is important but the subgroup size is modest. For satisfactory results, larger studies are needed, as would occur with phase 3 trials. Good results are within reach when the sample sizes are n = 2000, but such samples sizes are not always used in clinical drug development except for the largest event driven trials in a few disease indications such as cardiovascular disease or osteoporosis. In trials of n = 1000 patients, the method we propose demonstrates reasonably good properties for identification of meaningful clinical effect sizes. While in principle the methods we present could be viewed as providing definitive evidence for a subgroup, in practice the methods are more useful for giving leads and suggestions for future work, and better than what one could achieve by simply looking for interactions. In reality an actual new trial would be required for the results to be confirmed and accepted.
As we show, identifying subgroups of enhanced treatment effect is a challenging problem that would generally require large data sets. Verifying that the subgroups are real and are likely to be useful in future data is even more challenging. The Virtual Twins method appears to be a promising approach, and better than a simple alternative for identifying a subgroup. For the cases we consider, the identified subgroup had reasonable properties as measured by sensitivity, specificity, and magnitude of the enhanced treatment effect; however, the method is less good at identifying the correct covariates. One reassuring finding from the simulations is that increasing the number of potential covariates, making the covariates correlated, or introducing between-subject heterogeneity does not appear to have much of an effect on the properties of the estimated subgroup, though we doubt this finding would hold if there were hundreds of covariates instead of the tens we had. One drawback of the Virtual Twins method is that it has a tendency to identify a subgroup, even when there should not be one. One strategy to use to mitigate any consequence of this is to accept that an aggressive strategy may find subgroups too often, but then to accompany the subgroup with an estimate of how good it is, and an associated measure of uncertainty. To obtain an honest estimate of the enhanced treatment effect in the identified subgroup, we find that a bias-corrected bootstrap procedure gives good, but not perfect results.
In any real situation, there are likely to be a number of logistical and practical considerations. For example, in the methods we have treated all the X’s equally; however, in reality they will not be equivalent. For example, there may be an a priori biological rationale why some of the X’s are more likely to have an interaction with treatment in a model. Another issue is that some X’s may be cheaper or easier to measure than others. Thus it could be desirable to have the region  based on such X’s, provided this does not substantially harm the properties of Â. Another practical concern is the size of Â. Depending on the context it may be desirable to have a large  with a modest enhanced treatment effect or a much smaller  with a larger treatment effect.
In the simulations we report whether the methods find the “right” X’s, however in some contexts this may not be such an important issue. The X’s are likely to be correlated with each other, so it is quite plausible that a ”wrong” set of X’s could be quite effective at defining the subjects who do have an enhanced treatment effect. The very concept of a set of “right” X’s is itself too idealized. While for some treatments it may be possible to hypothesize an all-or-nothing situation where the treatment will only work if the person has a certain set of attributes, it may be difficult to develop technology that can accurately and reliably measure these attributes. On the other hand if a subgroup is to be based on a set of X’s, then to be accepted by the scientific community these X’s would have to have at least some biological plausibility.
We have chosen to summarize the properties of the estimated subgroup using a metric Q(Â), which is the treatment effect in the subgroup minus the marginal treatment effect. An alternative metric is the treatment effect in  minus the treatment effect in the complement of Â. From the perspective of the drug developer, neither metric is totally satisfactory because they don’t incorporate the size of Â. For a given enhanced treatment effect, a larger  is more desirable than a small Â. It is unlikely that a single summary measure can capture all the costs and benefits of the identified subgroup from the drug developer’s, or society’s point of view.
While we demonstrate that the Virtual Twins method appears promising, many variations could be contemplated. For example, in the first step the random forest could be replaced by other non-parametric regression methods, such as MARS. This could give smoother forms for the estimate of P(Y = 1|X, T) as a function of each continuous X, which could be plausible in many applications. In the second step a regression tree is appealing because of its simplicity and its ability to select a small number of features. Other feature selection methods could also be considered. For instance, Kehl and Ulm [13] developed a method for censored data in which the second step involves the use of stabilized bump hunting, rather than a regression tree. Variations on cross-validation and the bootstrap could be used to assess the properties of the estimated subgroup.
We consider the situation of a binary outcome variable, which gives rise to some issues, specifically whether one should consider treatment effects as measured on an absolute probability scale or on a logit scale. While we simulated data using models on a logit scale, we assessed the enhanced treatment effect on the probability scale. It is likely that the methodology we describe would have looked better if we had used the same scale for both. What scale one uses to assess enhanced treatment effect depends on the specific application. The reason to use the probability scale for assessing the properties of the region is because this would translate directly into a number of people who would benefit from the treatment. If the outcome were a continuous measure, then the virtual twins methodology would still be applicable, and for this situation there would be no reason to have any scale of measurement other than linear. We hypothesize that the methods described in this paper would perform better for continuous outcome data, not only because of the uniform definition of the scale, but also because there is inherently more information in continuous data than binary data.
A subtle concern with the goal of finding subgroups using data from a specific randomized trial is that the trial may have had exclusion criteria. For example, there may have been a strong suspicion that the treatment would not be effective for low values of a certain covariate, and hence people with such low values were excluded from the trial. Criteria such as these may hinder the ability to find a meaningful interaction, either with this covariate or with others, that would be present in a broader population.
We have formulated the goal as finding a subgroup of patients based on a small number of X’s that have an enhanced treatment effect. A related goal, which in the long run might be more useful for patients, is giving each patient their predicted probability of response under each possible treatment, i.e. giving them P̂1i and P̂0i.
Acknowledgments
The authors acknowledge the valuable suggestions of Yuefeng Lu. This research was partially supported by a grant from the Eli Lilly Corporation to the University of Michigan, and by National Institutes of Health Grant T32 CA083654.
Appendix A
Table 1.
Logit | VT(R) | VT(C)0.05 | VT(C)0.1 | |
---|---|---|---|---|
θ = 0.9: | ||||
Mean (# Unique X’s) | 1.13 | 3.62 | 3.94 | 3.72 |
SD (# Unique X’s) | 0.97 | 1.04 | 1.40 | 1.75 |
Prop. found X1 (int*) | 0.30 | 0.96 | 0.92 | 0.82 |
Prop. found X2 (int) | 0.42 | 0.68 | 0.67 | 0.64 |
Prop. found X7 (main) | 0.11 | 0.72 | 0.64 | 0.62 |
Prop. found X3 (absent) | 0.02 | 0.14 | 0.12 | 0.17 |
Prop. found X4 (absent) | 0.02 | 0.11 | 0.15 | 0.15 |
Prop. {X1, X2} in top 2 | NA | 0.51 | 0.39 | 0.32 |
Prop. found nothing | 0.29 | 0.00 | 0.01 | 0.11 |
θ = 0: | ||||
Mean (# Unique X’s) | 0.18 | 3.29 | 3.73 | 2.08 |
SD (# Unique X’s) | 0.48 | 1.00 | 1.83 | 1.89 |
Prop. found X1 (main) | 0.06 | 0.72 | 0.67 | 0.35 |
Prop. found X2 (main) | 0.03 | 0.68 | 0.51 | 0.32 |
Prop. found X7 (main) | 0.02 | 0.59 | 0.59 | 0.37 |
Prop. found X3 (absent) | 0.00 | 0.14 | 0.17 | 0.05 |
Prop. found X4 (absent) | 0.02 | 0.05 | 0.18 | 0.08 |
Prop. found nothing | 0.85 | 0.00 | 0.07 | 0.38 |
indicates whether the variable is in the true data generating model as only a main effect, an interaction with T, or absent. Subscripts on VT(C) indicate the constant added to average treatment effect for  definition.
Table 2.
Logit0.05 | Logit0.1 | VT(R)0.05 | VT(R)0.1 | VT(C)0.05 | VT(C)0.1 | |
---|---|---|---|---|---|---|
θ = 0.9: | ||||||
size of Â: | ||||||
Prop. Â null | 0.29 | 0.29 | 0.07 | 0.32 | 0.01 | 0.11 |
5th % | 0 | 0 | 0 | 0 | 99 | 0 |
50th % | 199 | 73 | 189 | 114 | 222 | 103 |
95th % | 327 | 194 | 397 | 252 | 333 | 186 |
Sensitivity | 0.34 | 0.16 | 0.47 | 0.28 | 0.49 | 0.28 |
Specificity | 0.89 | 0.96 | 0.89 | 0.95 | 0.87 | 0.96 |
PPV | 0.37 | 0.41 | 0.55 | 0.45 | 0.56 | 0.59 |
NPV | 0.81 | 0.78 | 0.84 | 0.81 | 0.84 | 0.80 |
AUC | 0.69 | 0.69 | 0.77 | 0.77 | - | - |
θ = 0: | ||||||
size of Â: | ||||||
Prop. Â null | 0.84 | 0.85 | 0.25 | 0.65 | 0.07 | 0.38 |
5th Pct. | 0 | 0 | 0 | 0 | 0 | 0 |
50th Pct. | 0 | 0 | 131 | 0 | 162 | 40 |
95th Pct. | 188 | 55 | 308 | 142 | 274 | 118 |
Specificity | 0.97 | 0.99 | 0.87 | 0.97 | 0.85 | 0.96 |
Table 3.
Logit0.05 | Logit0.1 | VT(R)0.05 | VT(R)0.1 | VT(C)0.05 | VT(C)0.1 | |
---|---|---|---|---|---|---|
θ = 0.9: | ||||||
Q(A) | 0.139 | 0.139 | 0.139 | 0.139 | 0.139 | 0.139 |
Q(Â) | 0.043 | 0.054 | 0.070 | 0.062 | 0.071 | 0.083 |
SD(Q(Â)) | 0.042 | 0.056 | 0.047 | 0.057 | 0.043 | 0.055 |
Q̂(Â): | ||||||
RS(Mean) | 0.080 | 0.125 | 0.164 | 0.155 | 0.164 | 0.192 |
(SD) | 0.063 | 0.127 | 0.078 | 0.122 | 0.062 | 0.106 |
SND(Mean) | 0.073 | 0.104 | 0.111 | 0.101 | 0.115 | 0.144 |
(SD) | 0.053 | 0.074 | 0.046 | 0.074 | 0.033 | 0.058 |
CV(Mean) | 0.057 | 0.081 | 0.106 | 0.095 | 0.104 | 0.106 |
(SD) | 0.045 | 0.063 | 0.045 | 0.070 | 0.041 | 0.079 |
FCV(Mean) | 0.023 | 0.005 | 0.037 | 0.028 | 0.052 | 0.027 |
(SD) | 0.088 | 0.163 | 0.110 | 0.164 | 0.091 | 0.182 |
RS(BC)(Mean) | 0.060 | 0.096 | 0.103 | 0.118 | 0.105 | 0.133 |
(SD) | 0.060 | 0.132 | 0.076 | 0.105 | 0.068 | 0.104 |
SND(BC)(Mean) | 0.045 | 0.068 | 0.072 | 0.076 | 0.067 | 0.092 |
(SD) | 0.043 | 0.060 | 0.044 | 0.061 | 0.041 | 0.049 |
θ = 0: | ||||||
Q(Â) | 0.000 | −0.001 | 0.010 | 0.009 | 0.008 | 0.012 |
SD(Q(Â)) | 0.006 | 0.016 | 0.023 | 0.023 | 0.025 | 0.024 |
Q̂(Â): | ||||||
RS(Mean) | 0.012 | 0.010 | 0.124 | 0.084 | 0.109 | 0.114 |
(SD) | 0.034 | 0.094 | 0.099 | 0.127 | 0.092 | 0.135 |
SND(Mean) | 0.014 | 0.019 | 0.074 | 0.051 | 0.089 | 0.090 |
(SD) | 0.033 | 0.049 | 0.054 | 0.072 | 0.040 | 0.074 |
CV(Mean) | 0.009 | 0.013 | 0.069 | 0.047 | 0.069 | 0.046 |
(SD) | 0.024 | 0.037 | 0.052 | 0.068 | 0.048 | 0.071 |
FCV(Mean) | −0.040 | −0.024 | −0.020 | −0.076 | −0.014 | −0.079 |
(SD) | 0.149 | 0.188 | 0.132 | 0.232 | 0.105 | 0.219 |
RS(BC)(Mean) | 0.004 | 0.002 | 0.068 | 0.062 | 0.043 | 0.068 |
(SD) | 0.033 | 0.096 | 0.085 | 0.103 | 0.097 | 0.119 |
SND(BC)(Mean) | 0.007 | 0.012 | 0.036 | 0.036 | 0.033 | 0.048 |
(SD) | 0.019 | 0.031 | 0.045 | 0.054 | 0.041 | 0.047 |
Table 4.
Sensitivity | Specificity | AUC | PPV | NPV | Proportion  null | Median |Â| | {X1, X2}∈ tree | # Unique X’s | |
---|---|---|---|---|---|---|---|---|---|
θ = 0.9: | |||||||||
Base Case | 0.47 | 0.89 | 0.77 | 0.55 | 0.84 | 0.07 | 189 | 0.51 | 3.62 |
p = 30 | 0.48 | 0.89 | 0.78 | 0.51 | 0.85 | 0.15 | 213 | 0.52 | 2.89 |
Correlated | 0.52 | 0.85 | 0.76 | 0.52 | 0.85 | 0.03 | 247 | 0.32 | 4.08 |
Sub-spec (ai) | 0.54 | 0.89 | 0.79 | 0.60 | 0.86 | 0.04 | 218 | 0.62 | 3.59 |
Sub-spec (ai, bi) | 0.50 | 0.89 | 0.76 | 0.57 | 0.85 | 0.05 | 216 | 0.48 | 3.61 |
n = 400 | 0.40 | 0.82 | 0.66 | 0.42 | 0.81 | 0.05 | 93 | 0.19 | 4.09 |
n = 2000 | 0.68 | 0.93 | 0.87 | 0.74 | 0.90 | 0.03 | 462 | 0.78 | 3.03 |
Larger |A| | 0.41 | 0.91 | 0.78 | 0.77 | 0.61 | 0.06 | 233 | 0.54 | 3.35 |
θ = 1.5 | 0.78 | 0.91 | 0.89 | 0.77 | 0.93 | 0.00 | 245 | 0.83 | 3.39 |
θ = 0: | |||||||||
Base Case | - | 0.87 | - | - | - | 0.25 | 131 | - | 3.29 |
p = 30 | - | 0.88 | - | - | - | 0.27 | 103 | - | 2.90 |
Correlated | - | 0.82 | - | - | - | 0.09 | 164 | - | 4.19 |
Sub-spec (ai) | - | 0.84 | - | - | - | 0.17 | 152 | - | 3.44 |
Sub-spec (ai, bi) | - | 0.86 | - | - | - | 0.28 | 131 | - | 3.25 |
n = 400 | - | 0.78 | - | - | - | 0.05 | 86 | - | 4.10 |
n = 2000 | - | 0.88 | - | - | - | 0.25 | 230 | - | 2.84 |
Table 5.
Q(A) | Q(Â) | Q̂(Â) | Q̂(Â)BC | |
---|---|---|---|---|
θ = 0.9: | ||||
Base Case | 0.139 | 0.070(0.047) | 0.111(0.046) | 0.072(0.044) |
p = 30 | 0.137 | 0.062(0.047) | 0.092(0.049) | 0.069(0.048) |
Correlated | 0.129 | 0.055(0.040) | 0.110(0.035) | 0.061(0.036) |
Sub-spec (ai) | 0.133 | 0.078(0.043) | 0.113(0.043) | 0.073(0.046) |
Sub-spec (ai, bi) | 0.126 | 0.072(0.041) | 0.106(0.043) | 0.064(0.048) |
n = 400 | 0.137 | 0.046(0.052) | 0.107(0.050) | 0.051(0.056) |
n = 2000 | 0.135 | 0.096(0.044) | 0.109(0.036) | 0.081(0.042) |
Larger |A| | 0.094 | 0.066(0.033) | 0.105(0.042) | 0.065(0.045) |
θ = 1.5 | 0.228 | 0.167(0.057) | 0.156(0.040) | 0.132(0.056) |
θ = 0: | ||||
Base Case | 0.000 | 0.010(0.023) | 0.074(0.054) | 0.036(0.045) |
p = 30 | 0.000 | 0.016(0.026) | 0.072(0.053) | 0.043(0.043) |
Correlated | 0.000 | 0.006(0.023) | 0.095(0.043) | 0.041(0.038) |
Sub-spec (ai) | 0.000 | 0.010(0.029) | 0.079(0.046) | 0.037(0.036) |
Sub-spec (ai, bi) | 0.000 | 0.011(0.028) | 0.063(0.049) | 0.025(0.037) |
n = 400 | 0.000 | 0.010(0.035) | 0.099(0.046) | 0.038(0.046) |
n = 2000 | 0.000 | 0.003(0.017) | 0.064(0.045) | 0.035(0.034) |
References
- 1.Ruberg SJ, Chen L, Wang Y. The mean doesn’t mean as much anymore: Finding sub-groups for tailored therapeutics. Clinical Trials. 2010;7:574–583. doi: 10.1177/1740774510369350. [DOI] [PubMed] [Google Scholar]
- 2.Assmann SF, Pocock SJ, Enos LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. The Lancet. 2000;355:1064–1069. doi: 10.1016/S0140-6736(00)02039-0. [DOI] [PubMed] [Google Scholar]
- 3.Brookes ST, Whitley E, Peters TJ, Mulheran PA, Egger M, Smith GD. Subgroup analyses in randomized controlled trials: quantifying the risks of false-positives and false-negatives. Health Technology Assessment. 2001;5(33):1–56. doi: 10.3310/hta5330. [DOI] [PubMed] [Google Scholar]
- 4.Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA, Peters TJ. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Oncol. 2004;57:229–236. doi: 10.1016/j.jclinepi.2003.08.009. [DOI] [PubMed] [Google Scholar]
- 5.Cui L, Hung HMJ, Wang SJ, Tsong Y. Issues related to subgroup analysis in clinical trials. J Biopharmaceutical Statistics. 2002;12(3):347–358. doi: 10.1081/bip-120014565. [DOI] [PubMed] [Google Scholar]
- 6.Lagakos SW. The challenge of subgroup analyses-Reporting without distorting. N Engl J Med. 2006;354(16):1667–1669. doi: 10.1056/NEJMp068070. [DOI] [PubMed] [Google Scholar]
- 7.Peto R, Collins R, Gray R. Large-scale randomized evidence: Large simple trials and overviews of trials. J Clin Epidemiol. 1995;48:23–40. doi: 10.1016/0895-4356(94)00150-o. [DOI] [PubMed] [Google Scholar]
- 8.Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Statistics in Medicine. 2002;21:2917–2930. doi: 10.1002/sim.1296. [DOI] [PubMed] [Google Scholar]
- 9.Rothwell PM. Subgroup analysis in randomized controlled trials: importance, indications and interpretation. Lancet. 2005;365:176–186. doi: 10.1016/S0140-6736(05)17709-5. [DOI] [PubMed] [Google Scholar]
- 10.Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and Interpretation of Treatment Effects in Subgroups of Patients in Randomized Clinical Trials. JAMA. 1991;266(1):93–98. [PubMed] [Google Scholar]
- 11.Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in Medicine - Reporting of subgroup analyses in clinical trials. New Eng J Med. 2007;357(21):2189–2194. doi: 10.1056/NEJMsr077003. [DOI] [PubMed] [Google Scholar]
- 12.Feinstein AR. The problem of cogent subgroups: a clinicostatistical tragedy. J Clin Epidemiol. 1998 Apr;51(4):297–9. doi: 10.1016/s0895-4356(98)00004-3. [DOI] [PubMed] [Google Scholar]
- 13.Kehl V, Ulm K. Responder identification in clinical trials with censored data. Comput Stat Data Anal. 2006 Mar;50(5):1338–1355. DOI= http://dx.doi.org/10.1016/j.csda.2004.11.015. [Google Scholar]
- 14.Dixon DO, Simon R. Bayesian subset analysis. Biometrics. 1991;47:871–881. [PubMed] [Google Scholar]
- 15.Gail M, Simon R. Testing for qualitative interactions between treatment effects and patient subsets. Biometrics. 1985;41:361–372. [PubMed] [Google Scholar]
- 16.Simon R. Bayesian subset analysis: application to studying treatment-by-gender interactions. Statistics in Medicine. 2002;21:2909–2916. doi: 10.1002/sim.1295. [DOI] [PubMed] [Google Scholar]
- 17.Negassa A, Ciampi A, Abrahamowicz M, Shapiro S, Boivin JR. Tree-structured subgroup analysis for censored survival data: Validation of computationally inexpensive model selection criteria. Statistics and Computing. 2005;15:231–239. [Google Scholar]
- 18.Su X, Tsai CL, Wang H, Nickerson DM, Bogong L. Subgroup analysis via recursive partitioning. J Machine Learning Research. 2009;10:141–158. [Google Scholar]
- 19.Su X, Zhou T, Yan X, Fan J, Yang S. Interaction trees with censored survival data. Intl J Biostatistics. 2008;4(1):article 2. doi: 10.2202/1557-4679.1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Freidlin B, Jiang W, Simon R. The Cross-Validated Adaptive Signature Design. Clin Cancer Res. 2010;16(2):691–698. doi: 10.1158/1078-0432.CCR-09-1357. [DOI] [PubMed] [Google Scholar]
- 21.Breiman L. Random Forests. Machine Learning. 2001;45(1):532. [Google Scholar]