Abstract
Causal mechanism of relationship between the clinical outcome (efficacy or safety endpoints) and putative biomarkers, clinical baseline, and related predictors is usually unknown and must be deduced empirically from experimental data. Such relationships enable the development of tailored therapeutics and implementation of a precision medicine strategy in clinical trials to help stratify patients in terms of disease progression, clinical response, treatment differentiation, and so on. These relationships often require complex modeling to develop the prognostic and predictive signatures. For the purpose of easier interpretation and implementation in clinical practice, defining a multivariate biomarker signature in terms of thresholds (cutoffs/cut points) on individual biomarkers is preferable. In this paper, we propose some methods for developing such signatures in the context of continuous, binary and time-to-event endpoints. Results from simulations and case study illustration are also provided.
Keywords: subgroup identification, precision medicine, clinical trial, variable selection, cutoff estimation, predictive modeling, cross-validation, predictive significance
1. Introduction
Biomarkers are useful for several applications in pharmaceutical drug development. Examples include identification of patients that are more likely to respond to a drug, enrolling patients with the specific disease subtype of interest (e.g., Alzheimer’s versus frontal temporal or vascular dementia), determining which patients are more likely to progress in the disease over the next few months/years, early assessment of efficacy/safety to enable quicker go/no-go decisions, earlier or more accurate read on liver, renal or other adverse reactions, and so on. The focus of this paper is on biomarkers that help identify patient subgroups that are more likely to have a desired efficacy or safety profile to a specific treatment of interest, such as for personalized medicine objectives.
For these and related applications, the discovery and evaluation of biomarkers with the appropriate performance characteristics is critical. Biomarkers can be of a wide variety such as genomic, proteomic, imaging, electrophysiological, cytological, histological, physiological, behavioral, clinical, and demographic. A biomarker signature is a combination of one or more of these markers that are measured at baseline and can predict an outcome of clinical interest via an empirical model or rule. Biomarker signatures may be derived from a variety of possible datasets depending on the intended goals; the number of subjects/samples may be in the order of 10 to 100s, and the number of markers considered may be 10 to 100 to 1000s to millions. Here, we propose some threshold-based methods to derive multivariate signatures of two main types:
Prognostic: identify a subgroup of patients that are more likely to experience an outcome of interest (efficacy, toxicity, disease progression, etc.).
Predictive: identify a subgroup of patients that respond better to a specific treatment, but not to other treatments.
Threshold-based signatures are often of most interest to clinicians from a standpoint of practical utility, owing to their ease of interpretation. In the statistical literature, there are many examples of such threshold-based methods that have been used in predictive modeling [1], for example, CART [2], MARS [3], and RuleFit [4]. More recently, such methods and their variants are being increasingly employed in clinical applications, especially in areas like biomarker signature discovery [5–8]. Meanwhile, there is an increasing need to consider personalized therapeutics in the pharmaceutical industry where the goal is to identify patient subgroups that have the desired efficacy/safety profile. Recent methods developed to tackle this problem that include PRIM [9], interaction trees [10,11], SIDES [12,13], Bayesian approaches [14], and others. In this paper, we propose additional algorithms that provide a clear binary stratification of patients via a simple signature rule that takes the form of thresholds on a set of biomarkers and other predictors. Such simple decision rules for patient subgroup selection are convenient in clinical practice. Most of the current examples of personalized therapeutics apply simple binary decision rule via one or few biomarkers; for example, gefitnib (Iressa; AstraZeneca), a small-molecule epidermal growth factor receptor (EGFR) inhibitor, initially failed in the overall patient population (all comers) but later was observed to have favorable efficacy in patients with tumors of certain sensitizing mutations within EGFR [15,16]. Trastuzumab (Herceptin; Roche/Genentech), a monoclonal antibody directed against the HER2 receptor, is effective in patients with tumors that overexpress the target antigen or manifest amplification of the target gene [17,18]. Anti-EGFR antibody therapies cetuximab (Erbitux; ImClone Systems/Bristol-Myers Squibb/Merck Serono) and panitumumab (Vectibix; Amgen/Takeda Bio) in colorectal cancer are effective in patents of wild-type version of the KRAS gene [19–22]. Crizotinib (Xalkori; Pfizer), an anaplastic lymphoma kinase (ALK) and c-ros oncogene 1 (ROS1) inhibitor, and certitinib (Zykadia; Novartis), an ALK inhibitor, are known to be effective only in ALK-positive non-small cell lung carcinoma (NSCLC) patients [23,24].
In Section 2, we describe two methods for subgroup identification: (i) sequential-BATTing, a multivariate extension of the bootstrapping and aggregating of thresholds from trees (BATTing) and (ii) AIM-RULE, a multiplicative rules-based modification of the adaptive index model (AIM) [25]. The application of sequential-BATTing method has been described in McKeegan et al. [26], and here we provide the formal statistical description of this algorithm. We present these subgroup identification methods under a unified regression framework while incorporating other methods such as BATTing and AIM to search for the optimal stratification. We then describe a cross-validation (CV) procedure [9] to evaluate the performance of patient subgroup identification methods. In Section 3, we perform intensive simulations for comparing the performance of the proposed methods under different scenarios. Specifically, we compare our methods with the PRIM and SIDES methods. Real data illustration of these methods is presented in Section 4.
2. Methods
Consider a supervised learning problem with data (Xi, yi), i = 1, 2, …, n, where Xi is a p-dimensional vector of predictors and yi is the response/outcome variable for the ith patient. We assume that the observed data are independently identically distributed copies of (X, y). We consider three major applications: linear regression for continuous response, logistic regression for binary response, and Cox regression for time to event response, where yi = (Ti, δi), Ti is a right censored survival time and δi is the censoring indicator. Denote the observed log likelihood or log partial likelihood by , where η(Xi) is the linear combination of predictors. For example, η(·) represents mean response in simple linear regression, the log odds ratio in logistic regression, and the log hazard ratio in the Cox proportional hazards model without intercept. We consider the following working model for the development of prognostic signatures (i.e., for identifying patient subgroups with favorable response, independent of the therapeutic),
| (1) |
Similarly, we consider the following working model for predictive signatures (i.e., for identifying patient subgroups with favorable response to a specific therapeutic),
| (2) |
where t is the treatment indicator with 1 for treated and 0 for untreated subjects. In both models, ω(X) is the binary signature rule, with 1 and 0 for signature positive and negative subgroups, respectively. In this paper, we focus on multiplicative signature rules, that is,
| (3) |
where cj is the cutoff on the jth selected marker Xj, sj = ± 1 indicates the direction of the binary cut off for the selected marker, and m is the number of selected markers. Note that, in contrast, the AIM approach [25] was fitting an additive signature score as the sum of individual binary rules
We focus on multiplicative signature rules (3) because it is more interpretable in analyses of clinical data, and it does not require further thresholding the signature score ω(X) to classify patients into signature positive and negative subgroups. In this section, we will discuss several algorithms that derive the optimal subset of (Xj, cj, sj) triplets for constructing ω(X) with the objective of optimizing the statistical significance level for testing β = 0 in (1) or (2) based on the score test statistics:
| (4) |
where and V{ω(·)}is the corresponding inverse of the Fisher information matrix under the null hypothesis with the null hypothesis with β=0 [25]. The specific form of this test statistics depends on the employed working model. For example,
for estimating the predictive signature rule for binary responses, where and are consistent estimators for α and γ, respectively, in the absence of interaction term, and is the current estimator of the rule.
Remark: We may use t−πt to replace the treatment indicator t in the working model and the corresponding algorithm such as AIM, where πt is the proportion of treated patients in the study. In this way, the main effect becomes orthogonal to this ‘centered treatment indicator’, and thus, the search for predictive signature based on the score test would not be confounded by the presence of prognostic effect. On the other hand, when the prognostic and predictive signatures do share common components, the current algorithm may be more sensitive in estimating such predictive signature rules.
2.1. Bootstrapping and aggregating of thresholds from trees
Bootstrapping and aggregating of thresholds from trees (BATTing) is a univariate resampling tree-based approach for estimating robust thresholds on predictors. The motivation of BATTing is that a single tree built on the original dataset may be unstable and not robust enough against small perturbations in the data and prone to be over-fitting, thus resulting in lower prediction (stratification) power. We note that the idea of BATTing is closely related to Breiman’s bagging method [27] for generating multiple versions of a predictor via bootstrapping and using these to obtain an aggregated predictor. We summarize the BATTing algorithm as follows:
BATTing procedure
Step 1 Draw B bootstrap datasets from the original dataset
Step 2 Build a stub on the outcome of interest with a single split on the predictor for each of these B datasets, such that the two terminal nodes will yield the best possible separation of the groups with respect to the objective function S{ω(·)} in (4)
Step 3 Examine the distribution/spread of the B single-stub thresholds and use a robust estimate (e.g., median) of this distribution as the BATTing threshold estimate. The direction s of the threshold is then determined by the predefined desired effect in the signature positive group (e.g., lower incidence rate for binary outcome and longer median survival time for the survival outcome).
We perform a simple simulation to demonstrate the advantage of BATTing in terms of the robustness of threshold estimation. We use the same simulation model as described in the simulation section for prognostic signatures except that we include only one true predictor as the candidate predictor. Specifically, under the simulated model, we set α = 0, β = 0.2, and c1 = 0. We would like to investigate the benefit of BATTing on the threshold estimation under difference scenarios (with different sample sizes and effect sizes). Figure 1 shows the distribution of BATTing threshold estimates from 500 simulation runs across different number of bootstrapping for sample size = 100 and effect size = 0.2, with true optimal cutoff being 0 (red dashed vertical line). It shows that BATTing helps reduce the influence of data perturbations in the dataset and thus stabilize the threshold estimate. Figure 2 shows the interquartile range of threshold estimates from 500 simulation runs across different effect size when n = 100 (left panel) and across different sample size when effect size is 0.2 (right panel). These plots also demonstrate that BATTing procedure helps reduce the variation and stabilizes the threshold estimate. To further evaluate the advantage of using the more ‘stable’ threshold in terms of the accuracy in identifying subgroups of patients of interest, we also estimated the improvement in accuracy of subgroup identification of BATTing procedure with 50 bootstraps versus a stub without bootstrap. The simulation result shows that the median accuracy is 76% using stub and 85% using BATTing, which represents a 12% improvement. In our experience, the number of bootstrap samples ≥ 50 is adequate and recommended in practice. We also note that fewer bootstraps are adequate as the sample size increases and for larger effect size.
Figure 1.

Simulation comparison on batting threshold distribution (sample size = 100, effect size = 0.2, ‘n.boot’ refers to the number of bootstraps).
Figure 2.

Interval quartile ranges on the distribution of threshold estimates.
2.2. Sequential bootstrapping and aggregating of thresholds from trees
Sequential BATTing is designed to derive a binary signature rule of (3) in a stepwise manner by extending the BATTing procedure. The resulting signature rule is a multiplicative of predictor–threshold pairs. The details of the algorithm are described in the following steps:
Sequential BATTing procedure
Step 1 ω(0)(X) = 1, Λ = {1, …, p}, where p is the number of candidate predictors.
Step 2 For j=1, …, m, first find ck and sk for each Xk, k ∈Λj via BATTing procedure described earlier and Xh(j) is then selected to maximize S{ω(j − 1)(X)I(shXh≥shch)} with respect to h. Update ω(j)(X) ← ω(j − 1)(X)I(sh(j)Xh(j)≥sh(j)ch(j)). and Λj←Λj − 1\h(j − 1).
Step 3 The final signature rule ω(X)=ω(j − 1)(X), if the likelihood ratio test statistics of ω(j)(X) versus ω(j − 1)(X) is not significant at a predefined level of α.
Note that the p-values from the likelihood ratio test in the stopping criteria do not have usual interpretation because of multiplicity inherent in subgroup search and the α cutoff in the stopping criteria is served as a tuning parameter. Nevertheless, the choice of α = 0.05 prevents premature termination of the algorithm and encourages inclusion of potentially informative markers.
2.3. AIM and AIM-rule
The AIM [25] essentially uses the same model (1) for prognostic signature and (2) for predictive signature, but the signature rule ωAIM(X) is defined as the sum of binary scores, that is, , where m predictors are selected from a set of p candidates via a CV procedure. We modified the original AIM procedure for the purpose of subgroup identification.
AIM procedure:
Step 1 Beginning with ωAIM(0)(X) = 0, Λ = {1, …, , p}, where p is the number of candidate predictors
- Step 2 For j = 1, 2, …, m, update
where index h(j) ∈ Λj and (ch(j), sh(j)) are selected to maximize Step 3 BATTing is applied to the resulting AIM score . to construct the final signature rule in the form of I(sAIMωAIM(X)≥cAIM) with estimated direction sAIM and cutoff cAIM.
Owing to the randomness of CV, the optimal number of predictors m may be different in each implementation for the same dataset. In order to stabilize the variable selection process (i.e., estimation of m), we propose a Monte Carlo procedure that entails repeating the CV multiple times, estimating the optimal number of predictors, m, each time, and using the median of the ms derived from each cross-validation run as the final optimal number of predictors. We performed simulations (not reported) to demonstrate how the proposed Monte Carlo procedure stabilized the estimation of m and also suggested the optimal number of Monte Carlo repetition to be 25 to 50.
Note that step 2 of the AIM algorithm implicitly ranked the selected predictors in terms of their contribution to the model. This order of the predictor importance, however, is not reflected in the additivity form of the AIM score ωAIM(X), because it assigns equal weights to all m predictors. Intuitively, the AIM procedure is very efficient if all m predictors have relatively equal contributions to the model; otherwise, the same AIM score may not imply the same effects for different patients. Here, we propose another variation of the AIM algorithm called AIM-rule that uses multiplicative binary rules as the final signature, where k = 1, 2, …, m.
AIM-rule procedures:
Step 1 Construct the AIM score . Without loss of generality, we assume that the relevant features enter the signature rule in the order of X1, X2, …, Xm.
- Step 2 Construct the final signature rule in the form of
where the index is selected via BATTing on the basis of the ordered signature rules:
2.4. Performance measures
On the basis of the signature rule derived from the aforementioned algorithms, patients are stratified into signature positive and signature negative groups. It should be noted that the resubstitution p-values for β associated with the final signature rules in (1) and (2) and the resubstitution summary measures determined from the same data used to derive the stratification signature may be severely biased because the data had already been explored for deriving the signature. Instead, we advocate determining the p-value of the signature via K-fold CV that is referred to as predictive significance [9]. From this CV procedure, we can also estimate the effect size and related summary statistics, along with estimates of predictive/prognostic accuracy. For practical use, we recommend K = 5. We now describe the procedure for deriving the predictive significance and the associated summary measures. First, the dataset is randomly split into five subsets (folds). A signature rule is then derived from K-1 folds from one of the algorithms. This signature rule is then applied to the left-out fold, resulting in the assignment of a signature positive or signature negative label for each patient in this fold. This procedure is repeated for each of the other K-1 folds by leaving them out one at a time, resulting in a signature positive or signature negative label for each patient in the entire dataset. These signature positive and negative groups are then analyzed, and a p-value for β is calculated; we refer to this p-value as CV p-value. Owing to the variability in random splitting of the entire dataset, this K-fold CV procedure is repeated multiple times (e.g., 100 times), and the median of the CV p-values across these CV iterations is used as an estimate of the predictive significance. Note that the CV p-value preserves the type-I error of falsely claiming a signature when there is no true signature as demonstrated in the simulation section. Therefore, it can be used to conclude that no signature is found if the effect of interest is greater than a prespecified significance level (i.e., 0.05). In addition to p-values, we can use the same procedure to calculate the CV version of relevant summary statistics (e.g., response rate, median survival time, restricted mean survival time [28], sensitivity, and specificity) and point estimates of the treatment effect in each subgroup (odds ratio, hazard ratios, etc.).
Note that this CV procedure evaluates the predictive performance only after aggregating the predictions from all the left-out folds, which is an important difference compared with the more traditional/common approaches that evaluate the predictive performance of each fold separately. The proposed approach is in the same spirit of the pre-validation scheme proposed in the literature [29], as well as the cross-validated Kaplan–Meier curves proposed in the literature [29–31]. The proposed CV procedure preserves the sample size of the original training set, which is particularly important for the subgroup identification algorithms where we evaluate the type-I error (p-values) for testing β = 0 and also for more reliable estimation of summary statistics and point estimates – this is especially critical when the training dataset is not large, as is often the case in phase-I and phase-II clinical trials.
The predictive significance and related CV version of summary statistics calculated in the aforementioned process help reduce the bias of the resubstitution performance assessment of the final signature and provide realistic estimation of predictive performance of the recovered signature derived from a method. However, owing to the exploratory nature of this retrospective subgroup identification, follow-up randomized controlled trials need to be conducted to validate the predictive value of the final signature.
3. Simulations
Simulations were performed to evaluate the three proposed subgroup identification algorithms: sequential BATTing (Seq-BT hereinafter), AIM, AIM-rule, along with the two methods in the literature: PRIM [9] and SIDES [12,13]. All simulations were performed using the r programming language. The r program for SIDES algorithm was downloaded from http://biopharmnet.com/subgroup-analysis-software/. The r scripts for our proposed Monte Carlo version of the AIM method and the AIM-rule algorithm were based on the AIM package in R [25].
3.1. Development of predictive signature
For the development of predictive signature, our simulation model (5) is a generalized version of the simulation model considered in the literature [12,13] in the following manner: (i) We enable each predictor to be continuous instead of being dichotomized. This is because most predictors in practice (e.g., gene/protein expression biomarkers) are continuous with no natural trivial thresholds and finding a robust threshold is an important objective in biomarker signature development as demonstrated in the simple simulation example in Section 2. (ii) We consider the number of candidate predictors to be 10 and 18 with different correlation structure (pairwise correlations between noise predictors and true predictors range from 0.1 to 0.5), total sample size ranges from n = 300 to n = 1000, and treatment effect size in the optimal subgroup ranges from low (0.2), medium (0.5), to high (0.8). (iii) We consider additional scenarios such as when a predictor has both predictive and prognostic effects. For each simulation, a balanced design is assumed. The first two predictors, X1 and X2, are assumed to predict the treatment benefit, and signature positive patients are those with both of these markers above predefined thresholds, cj, j = 1, 2. A common positive prevalence rate, f0 = 0.50, is assumed for predictor X1 and X2, that is, P(Xij >cj) = f0, j = 1,2.
Specifically, the response yi for patient i will be generated using the following model
where
| (5) |
ϵ ~ N(0, σ2) and (a, σ2) are selected constants to achieve the desired size for the treatment effect (hereafter called as predictive effect size) and b represents the prognostic effect of marker X2 so that X2 is both a predictive and prognostic marker when b≠0. Under the aforementioned setting, we note that E(yi|ti = 1) − E(yi|ti = 0) = 0, when b = 0, so there is no overall treatment effect in this case. The interaction plot in Figure 3 illustrates the response rates in each of the simulated subgroups across treatment and control arms when effect size is 0.5 and b = 0.
Figure 3.

Simulation model illustration with effect size = E(Y|Trt, sig+) − E(Y|ctrl, sig+) = 0.5 and b = 0.
The predictor Xi is generated from a multivariate normal distribution with mean = 0, var(Xij) = 2, Corr(Xi1, Xi2) = Corr(Xij,j>2, Xik, k>2) = 0.2, and Corr(Xi1, Xik, k>2) and Corr(Xi2, Xik, k>2) are assigned values ranging from 0 to 0.5.
For each scenario, 500 simulation runs are carried out, and in each run, (i) two datasets are generated: one for training and the other for testing; (ii) the subgroup identification methods are applied to the training dataset to derive the signature rules and the predictive significance is then evaluated via 100 iterations of fivefold cross-validation (CV p-value); (iii) if the CV p-value of the treatment effect in the signature positive subgroup is greater than a prespecified significance level (i.e., 0.05), a null signature (i.e., ) is returned and all subjects are labeled as signature negative, which concludes that no subgroup is found; otherwise, the derived signature is considered as significant and will be applied to the testing dataset. The following criteria [32] are used on the testing dataset to compare the performance of the different subgroup identification methods.
False positive rate: , where P(·) refers to the probability.
Identification power (probability of asserting significant effects): and treatment effect is significant for the estimated signature positive subgroup |ω(·)≠0), that is, the probability of identifying a subgroup from the training dataset and it is also validated in the testing dataset, when there is a subgroup exists.
Selection Accuracy: P(patients are correctly labeled .
Positive percentage agreement or sensitivity: P(patients are labeled as signature positive | patients are signature positive).
Negative percentage agreement or specificity: P(patients are labeled as signature negative | patients are signature negative).
Positive predictive value: P(patients are signature positive | patients are labeled as signature positive)
Negative predictive value: P(patients are signature negative | patients are labeled as signature negative).
Table I shows the simulation results of false positive rate and the power of identifying a significant predictive signature (probability of asserting a subgroup with significant treatment effect) based on 500 simulation runs across difference scenarios. For the null scenarios where no true subgroup exists (a = 0 for null predictive signatures), Table I shows that the false positive rates of asserting a significant subgroup are well controlled for all methods. For the scenarios with true subgroups, the ‘identification power’ is provided. The last columns of the tables correspond to the power of asserting significant treatment effect in the positive subgroup when the true subgroup label (oracle) is known. All methods are underpowered when effect size and sample size are small; as the sample size increases, the identification power increases dramatically. When there is no prognostic effect (b = 0), AIM-rule has the largest power and AIM is the second largest. For given sample size and predictive effect size, when b> 0, the power of all methods slightly decreases as the prognostic effect increases. However, the power of AIM-rule is more negatively affected by the increases of prognostic effects compared with other methods. Seq-BT has slightly better identification power than SIDES (practically similar), and PRIM has less power.
Table I.
False positive rate (predictive effect size = 0) and identification powers (predictive effect size>0) for predictive cases.
| Predictive effect size | n | a | b | Seq-BT | PRIM | AIM | AIM-rule | SIDES | Oracle |
|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 300 | 0.00 | 0.0 | 0.040 | 0.012 | 0.014 | 0.022 | 0.036 | 0.054 |
| 0.0 | 300 | 0.00 | 0.5 | 0.028 | 0.010 | 0.028 | 0.028 | 0.034 | 0.040 |
| 0.0 | 300 | 0.00 | 1.0 | 0.038 | 0.026 | 0.038 | 0.038 | 0.040 | 0.048 |
| 0.0 | 500 | 0.00 | 0.0 | 0.022 | 0.018 | 0.024 | 0.030 | 0.028 | 0.042 |
| 0.0 | 500 | 0.00 | 0.5 | 0.034 | 0.014 | 0.034 | 0.038 | 0.034 | 0.052 |
| 0.0 | 500 | 0.00 | 1.0 | 0.040 | 0.016 | 0.054 | 0.052 | 0.036 | 0.055 |
| 0.0 | 1000 | 0.00 | 0.0 | 0.034 | 0.018 | 0.024 | 0.016 | 0.042 | 0.063 |
| 0.0 | 1000 | 0.00 | 0.5 | 0.032 | 0.014 | 0.030 | 0.032 | 0.038 | 0.054 |
| 0.0 | 1000 | 0.00 | 1.0 | 0.032 | 0.016 | 0.032 | 0.034 | 0.036 | 0.043 |
| 0.5 | 300 | 0.85 | 0.0 | 0.252 | 0.148 | 0.360 | 0.462 | 0.204 | 0.878 |
| 0.5 | 300 | 0.85 | 0.5 | 0.236 | 0.118 | 0.378 | 0.438 | 0.172 | 0.904 |
| 0.5 | 300 | 0.85 | 1.0 | 0.204 | 0.106 | 0.368 | 0.358 | 0.174 | 0.857 |
| 0.5 | 500 | 0.85 | 0.0 | 0.642 | 0.472 | 0.740 | 0.868 | 0.544 | 0.981 |
| 0.5 | 500 | 0.85 | 0.5 | 0.618 | 0.436 | 0.784 | 0.844 | 0.532 | 0.984 |
| 0.5 | 500 | 0.85 | 1.0 | 0.588 | 0.374 | 0.732 | 0.710 | 0.522 | 0.970 |
| 0.5 | 1000 | 0.85 | 0.0 | 0.990 | 0.752 | 0.946 | 0.986 | 0.966 | 1.000 |
| 0.5 | 1000 | 0.85 | 0.5 | 0.980 | 0.936 | 0.974 | 0.996 | 0.966 | 0.999 |
| 0.5 | 1000 | 0.85 | 1.0 | 0.988 | 0.922 | 0.958 | 0.954 | 0.964 | 0.999 |
| 0.8 | 300 | 1.37 | 0.0 | 0.878 | 0.732 | 0.922 | 0.972 | 0.844 | 0.999 |
| 0.8 | 300 | 1.37 | 0.5 | 0.874 | 0.728 | 0.914 | 0.934 | 0.824 | 0.998 |
| 0.8 | 300 | 1.37 | 1.0 | 0.864 | 0.632 | 0.840 | 0.804 | 0.808 | 0.998 |
| 0.8 | 500 | 1.37 | 0.0 | 0.998 | 0.990 | 0.978 | 0.998 | 0.994 | 1.000 |
| 0.8 | 500 | 1.37 | 0.5 | 0.998 | 0.976 | 0.976 | 1.000 | 0.996 | 1.000 |
| 0.8 | 500 | 1.37 | 1.0 | 0.990 | 0.956 | 0.980 | 0.962 | 0.992 | 0.999 |
| 0.8 | 1000 | 1.37 | 0.0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.8 | 1000 | 1.37 | 0.5 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.8 | 1000 | 1.37 | 1.0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing; SIDES, Subgroup Identification based on Differential Effect Search.
Figure 4 shows box plots of selection accuracy as defined earlier for the predictive case. The selection accuracy increases as sample size and effect size increases. Similar to what we observed in Table I, when comparing the identification power, in the predictive case, when there is no prognostic effect (b = 0), AIM has the best accuracy in correctly identifying the subjects, sequential BATTing and SIDES have comparable performance, and PRIM has the least accuracy. We also note that, similar to the power, for a given sample size and predictive effect size, when b>0, the accuracy of all methods slightly decreases as the prognostic effect increases. Simulation results for criteria 4 to 7 show similar results as shown in Figure 4.
Figure 4.

Simulation result of selection accuracy for predictive signatures. AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing; SIDES, Subgroup Identification based on Differential Effect Search.
3.2. Development of prognostic signature
For the development of prognostic signature, our simulation model is as follows:
where
| (6) |
ϵ ~ N(0, σ2), and (d, σ2) are predefined constants to achieve the desired size for the subgroup difference of the outcome (hereafter, called as prognostic effect size). As defined in (6), signature positive patients are those with both of the markers X1 and X2 above predefined thresholds, cj, j=1, 2. A common positive prevalence rate f0 = 0.50 is assumed for predictor X1 and X2, that is, P(Xij>cj)=f0, j=1,2. Similar to the predictive signature simulation, in addition to these two predictors, we consider the number of candidate predictors to be 10 and 18 with different correlation structure (pairwise correlations between noise predictors and true predictors range from 0.1 to 0.5), total sample size ranges from n= 300 to n= 1000, and the prognostic effect size ranges from low (0.2), medium (0.5), to high (0.8). The predictor Xi is generated in the same way as in the simulation of predictive signature development.
For each scenario, 500 simulation runs are carried out. And in each run, (i) two datasets are generated: one for training and the other for testing. (ii) The subgroup identification methods are applied to the training dataset to derive the signature rules and the predictive significance is then evaluated via 100 interactions of fivefold cross-validation (CV p-value). (iii) If the CV p-value of the group difference in comparing the signature positive subgroup versus negative group is greater than a prespecified significance level (i.e., 0.05), a null signature (i.e., ) is returned and all subjects are labeled as signature negative, which concludes that no subgroup is found; otherwise, the derived signature is considered as significant and will be applied to the testing dataset. The same criteria (except that the criteria (2) of identification power is defined as and outcome is significantly different for the estimated signature positive versus negative subgroup |ω(·)≠0)) are used on the testing dataset to compare the performance of different methods. We note that SIDES cannot handle the prognostic case and is thus not included in the comparison.
Table II shows the simulation results of false positive rate and the power of identifying a significant prognostic signature (probability of asserting two subgroups with significant difference) based on 500 simulation runs across difference scenarios. For the null scenarios where no true subgroup exists (d = 0 for the null prognostic signatures), Table II shows that the false positive rates of asserting a significant signature are well controlled for all methods. For the scenarios with true subgroups, the ‘identification power’ is provided. The last columns of the tables correspond to the power of asserting significant subgroup difference when the true subgroup label (oracle) is known. All methods are underpowered when effect size and sample size are small; as the sample size increases, the identification power increases dramatically. AIM-rule has the largest power followed by AIM and Seq-BT.
Table II.
False positive rate (prognostic effect size = 0) and identification powers (prognostic effect size>0) for prognostic cases.
| Prognostic effect size | n | d | Seq-BT | PRIM | AIM | AIM-rule | Oracle |
|---|---|---|---|---|---|---|---|
| 0.0 | 300 | 0.00 | 0.040 | 0.008 | 0.032 | 0.048 | 0.046 |
| 0.0 | 500 | 0.00 | 0.038 | 0.002 | 0.034 | 0.048 | 0.054 |
| 0.0 | 1000 | 0.00 | 0.038 | 0.010 | 0.026 | 0.044 | 0.051 |
| 0.2 | 100 | 0.34 | 0.038 | 0.010 | 0.026 | 0.032 | 0.529 |
| 0.2 | 300 | 0.34 | 0.290 | 0.146 | 0250 | 0.280 | 0.954 |
| 0.2 | 500 | 0.34 | 0.628 | 0.534 | 0.620 | 0.654 | 0.996 |
| 0.2 | 1000 | 0.34 | 0.964 | 0.972 | 0.988 | 0.994 | 0.999 |
| 0.5 | 100 | 0.85 | 0.684 | 0.598 | 0.760 | 0.766 | 0.999 |
| 0.5 | 300 | 0.85 | 0.990 | 0.998 | 1.000 | 1.000 | 1.000 |
| 0.5 | 500 | 0.85 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.5 | 1000 | 0.85 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing.
Figure 5 shows box plots of selection accuracy as defined earlier for the prognostic case. The selection accuracy increases as sample size and prognostic effect size increases. Among all methods, AIM-rule has the best accuracy in correctly labeling subjects, AIM and Seq-BT have comparable performance, and PRIM has the least accuracy.
Figure 5.

Simulation result of selection accuracy for prognostic signatures. AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing.
In conclusion, our proposed methods demonstrate good performance in identifying subgroups of interest across the various scenarios considered in our simulation design. As remarked in the method section, when prognostic and predictive signatures share common predictors, the current 0–1 parametrization of treatment coding t is preferred; while there is little or no common component between prognostic and predictive signatures, the new parametrization of t−πt is preferred. Performed additional simulation study (not reported) also confirmed this observation.
4. Clinical trial example
In this section, we apply our proposed methods to a clinical trial example for developing predictive signatures. This simulated dataset is based on a phase III clinical trial [12,13] downloaded from the SIDES software website (http://multxpert.com). This clinical trial compared a novel treatment with the standard of care (control) in patients with severe sepsis. The clinical database contains 470 patients who were randomized to the treatment arm (n = 317) or control arm (n = 153). The outcome of interest is a binary endpoint indicating subjects’ death after 28 days of treatment. Available markers include demographic and clinical covariates, that is, age; time from first sepsis-organ failure to the start of drug administration (hours); baseline sepsis-related organ failure assessment scores; number of baseline organ failures; pre-infusion Acute Physiology and Chronic Health Evaluation II (APACHE II) score; baseline Glasgow coma scale score; baseline activity of daily living score; and laboratory markers such as baseline local platelets, creatinine, serum IL-6 concentration, and local bilirubin. The trial’s overall outcome was insignificant (one-tailed p-value = 0.08), with mortality rates of 40.7% and 34% in the treatment and control arms, respectively. However, a true subgroup of ‘pre-infusion APACHE II score ≤ 26 and age ≤ 49.8’ is simulated and known to be a subgroup with significant treatment effect. For this analysis, we randomly split the dataset into two equal-sized training and validation (test) sets. In the training set, we applied the proposed algorithms to derive predictive signatures and applied the CV procedure to estimate the predictive significance for each signature. Final signature will be proposed on the basis of the aforementioned predictive significance estimation from 100 iterations of fivefold cross-validation, and its performance will be validated in the ‘hold-out’ validation sample.
Among the algorithms, AIM-rule, Seq-BT, and PRIM returned a simple rule of ‘pre-infusion APACHE II score ≤ 27’ for defining the signature positive subgroup. AIM defined the signature positive subgroup as patients who meet at least two out of the three thresholds: (i) pre-infusion APACHE II score<27; (ii) age<54; (iii) local bilirubin>0.8. The signature rule for the positive subgroup derived by SIDES is ‘creatinine ≤ 1.1 and baseline Glasgow coma scale score>11’. Table III shows the p-values in comparing survival rates in treatment versus control in the signature positive and negative group; in signature positive versus negative group in the treatment arm and control arm; and the interaction between treatment and control across training set, CV estimate (median of 100 CV iterations), and the performance in the validation set. The predictive significance from the cross-validated p-value suggested that the signature from Seq-BT has the most promising performance (CV p-value = 0.002 for treatment and subgroup interaction), and this result is validated in the validation dataset. Table IV shows detailed statistics for each signature from the training set, the bias-adjusted cross-validation results from the training set, and the results from validation set. To visualize and compare the performance of the derived signatures, Figure 6 shows the interaction plots for the signatures derived by each method, which summarize the survival rate and its confidence interval in each subgroup in both the training and validation datasets. The interaction plot shows that the performance of Seq-BT, AIM, AIM-rule, and PRIM is consistent in both the training and validation datasets, while the SIDES signature is less significant in the training set and becomes non-significant in the validation dataset. This observation is also consistent with the predictive significance performance, where the SIDES signature appeared to suffer from overfitting. The results from this exercise is also consistent with the simulation results in that our proposed method are potentially more powerful when sample size is not large and the effect size is weak to moderate.
Table III.
Subgroup performance: one-tailed p-values for sepsis trial example.
| Method | Treatment difference | Group difference | Interaction | |||
|---|---|---|---|---|---|---|
| Sig+ | Sig− | Treatment | Control | |||
| Training dataset | Seq-BT | 0.126 | 0.000 | 0.000 | 0.692 | 0.000 |
| PRIM | 0.126 | 0.000 | 0.000 | 0.692 | 0.000 | |
| AIM | 0.019 | 0.001 | 0.000 | 0.462 | 0.000 | |
| AIM-rule | 0.126 | 0.000 | 0.000 | 0.692 | 0.000 | |
| SIDES | 0.018 | 0.041 | 0.002 | 0.304 | 0.003 | |
| Cross-validation (median of 100 iteration) | Seq-BT | 0.105 | 0.005 | 0.000 | 0.816 | 0.002 |
| PRIM | 0.597 | 0.097 | 0.000 | 0.472 | 0.137 | |
| AIM | 0.267 | 0.007 | 0.000 | 0.560 | 0.006 | |
| AIM-rule | 0.495 | 0.011 | 0.000 | 0.677 | 0.018 | |
| SIDES | 0313 | 0.209 | 0.003 | 0.059 | 0.641 | |
| Validation dataset | Seq-BT | 0.173 | 0.001 | 0.000 | 0.957 | 0.000 |
| PRIM | 0.173 | 0.001 | 0.000 | 0.957 | 0.000 | |
| AIM | 0.952 | 0.050 | 0.000 | 0.150 | 0.164 | |
| AIM-rule | 0.173 | 0.001 | 0.000 | 0.957 | 0.000 | |
| SIDES | 0.885 | 0.299 | 0.871 | 0.599 | 0.742 | |
AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing; SIDES, Subgroup Identification based on Differential Effect Search.
Table IV.
Subgroup statistics for sepsis trial example.
| Training | Cross-validation (median of 100 iterations) | Testing | |||||
|---|---|---|---|---|---|---|---|
| Method | Subgroup | n | Survival rate | n | Survival rate | n | Survival rate |
| Seq-BT | sig + .trt | 106 | 0.764 | 84 | 0.786 | 106 | 0.764 |
| sig + .trt | 51 | 0.647 | 43 | 0.651 | 50 | 0.66 | |
| sig + .trt | 52 | 0.25 | 74 | 0.378 | 53 | 0.245 | |
| sig + .trt | 26 | 0.692 | 34 | 0.676 | 26 | 0.654 | |
| PRIM | sig + .trt | 106 | 0.764 | 72 | 0.75 | 106 | 0.764 |
| sig + .trt | 51 | 0.647 | 37 | 0.702 | 50 | 0.66 | |
| sig + .trt | 52 | 0.25 | 86 | 0.465 | 53 | 0.245 | |
| sig + .trt | 26 | 0.692 | 40 | 0.625 | 26 | 0.654 | |
| AIM | sig + .trt | 86 | 0.872 | 94 | 0.777 | 93 | 0.742 |
| sig + .trt | 43 | 0.698 | 45 | 0.689 | 38 | 0.737 | |
| sig + .trt | 72 | 0.264 | 64 | 0.328 | 66 | 0.379 | |
| sig + .trt | 34 | 0.618 | 32 | 0.625 | 38 | 0.579 | |
| AIM-rule | sig + .trt | 106 | 0.764 | 99 | 0.737 | 106 | 0.764 |
| sig + .trt | 51 | 0.647 | 44 | 0.682 | 50 | 0.66 | |
| sig + .trt | 52 | 0.25 | 59 | 0.356 | 53 | 0.245 | |
| sig + .trt | 26 | 0.692 | 33 | 0.636 | 26 | 0.654 | |
| SIDES | sig + .trt | 34 | 0.824 | 57 | 0.754 | 31 | 0.581 |
| sig + .trt | 26 | 0.538 | 16 | 0.875 | 18 | 0.611 | |
| sig + .trt | 124 | 0.532 | 101 | 0.505 | 128 | 0.594 | |
| sig + .trt | 51 | 0.725 | 61 | 0.607 | 58 | 0.672 | |
AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing; SIDES, Subgroup Identification based on Differential Effect Search.
Figure 6.

Interaction plots for predictive signatures from the sepsis data (half training/half testing). AIM, adaptive index model; PRIM, patient rule induction method; Seq-BT, sequential BATTing; SIDES, Subgroup Identification based on Differential Effect Search.
5. Discussion
In this paper, we addressed the topic of retrospective exploratory subgroup analysis. We proposed subgroup identification methods for developing threshold-based multivariate biomarker signatures via resampled tree-based algorithms and Monte Carlo variations of the adaptive indexing method. Variable selection is automatically built in to these algorithms.
For high dimensional data, pre-filtering techniques can be adopted before applying the proposed methods to reduce computational time. For the pre-filtering purpose, L1 penalized (Lasso) method is an immediate solution for the prognostic case. And the modified covariate method proposed by Tian et al. [33] can be used for the predictive case.
Our subgroup identification algorithms can handle a variety of endpoints (binary, continuous, and time to event) and are unified under the regression framework (linear regression, logistic regression, proportional hazards regression, etc.). In addition to continuous and ordinal covariates, all algorithms can also handle nominal covariates by incorporating them in the signature as I(X=∩L), where ∩L is a nontrivial union of different levels of X.
We may consider other stopping criteria for Seq-BT; for example, use a constant threshold Cp for the percentage that the p-value of β has improved (i.e., the algorithm will stop if the p-value of β after adding this marker is not smaller than Cp*(p-value of β without this marker)). The threshold Cp could be manually assigned or derived as a tuning parameter from CV.
For each method, an internal CV procedure is recommended for evaluating the predictive significance and performance of the signatures [9]. It is of note that the cross-validation based signature performance estimate would be of more practical use and more likely to be replicated in a future prospective trial enriching the patient population based on such signature. This procedure is similar to the pre-validation scheme [30], which attempted to quantify significance of learned predictors and facilitate a fair comparison between a learned predictor and predefined covariates. It is worth noting that in this CV procedure, the predicted subgroup labeling in each fold may not be truly independent, which may affect the estimation of the p-value between signature positive versus negative. We plan to investigate the effect of such dependency on the predictive significance and compare it against permutation-based methods and other types of CV [31].
In the context of retrospective exploratory analysis, we recommend a variety of methods to be considered for patient subgroup identification. As an independent validation/test dataset is often not readily available at this stage, the performance of the derived signatures from different algorithms should be evaluated and reported with careful application of CV approach described in this paper or related methods. After an independent validation/test dataset becomes available, the performance of these signatures can be evaluated further. Further research is needed to improve the proposed methods.
We have uploaded the R package, ‘SubgrpID’ in CRAN for implementing the proposed algorithms along with the assessment of predictive significance and associated effect size and related summary measures via multiple iterations of CV.
Acknowledgements
We thank Robert J. Tibshirani for his valuable input during this research.
Funding and declaration of conflict of interest
The design, study conduct, analysis, and financial support of the analysis were provided by AbbVie. AbbVie participated in the interpretation of data, writing, review, and approval of the content. Xin Huang, Yan Sun, Saptarshi Chatterjee, and Viswanath Devanarayan are employees of AbbVie, Inc. Paul Trow is a former employee of AbbVie, Inc. Arunava Chakravartty is an employee of Novartis Oncology, India. Lu Tian is an employee of Stanford University.
References
- 1.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second (2nd ed. 2009. Corr. 7th printing 2013 edition edn) edn). Springer, 2011. [Google Scholar]
- 2.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees (1 edition edn). Chapman and Hall/CRC, 1984. [Google Scholar]
- 3.Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics 1991; 19:1–67. [Google Scholar]
- 4.Friedman JH, Popescu BE. Predictive learning via rule ensembles. Ann. Appl. Stat 2008; 2:916–954. [Google Scholar]
- 5.Liu X, Minin V, Huang Y, Seligson DB, Horvath S. Statistical methods for analyzing tissue microarray data. Journal of Biopharmaceutical Statistics 2004; 14:671–685. [DOI] [PubMed] [Google Scholar]
- 6.Nannings B, Abu-Hanna A, de Jonge E. Applying PRIM (patient rule induction method) and logistic regression for selecting high-risk subgroups in very elderly ICU patients. International Journal of Medical Informatics 2008; 77:272–279. [DOI] [PubMed] [Google Scholar]
- 7.Nannings B, Bosman R-J, Abu-Hanna A. A subgroup discovery approach for scrutinizing blood glucose management guidelines by the identification of hyperglycemia determinants in ICU patients. Methods of Information in Medicine 2008; 47:480–488. [DOI] [PubMed] [Google Scholar]
- 8.Ramalingam SS, Shtivelband M, Soo RA, Barrios CH, Makhson A, Segalla JGM, Pittman KB, Kolman P, Pereira JR, Srkalovic G, Belani CP, Axelrod R, Owonikoko TK, Qin Q, Qian J, McKeegan EM, Devanarayan V, McKee MD, Ricker JL, Carlson DM, Gorbunova VA. Randomized phase II study of carboplatin and paclitaxel with either linifanib or placebo for advanced nonsquamous non-small-cell lung cancer. Journal of Clinical Oncology 2015; 33(5):433–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chen G, Zhong H, Belousov A, Devanarayan V. A PRIM approach to predictive-signature development for patient stratification. Statistics in Medicine 2015; 34:317–342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Su X, Tsai C-L, Wang H, Nickerson DM, Li B. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 2009; 10:141–158. [Google Scholar]
- 11.Su X, Zhou T, Yan X, Fan J, Yang S. Interaction trees with censored survival data. The International Journal of Biostatistics 2008; 4(1):2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lipkovich I, Dmitrienko A. Strategies for identifying predictive biomarkers and subgroups with enhanced treatment effect in clinical trials using SIDES. Journal of Biopharmaceutical Statistics 2014; 24:130–153. [DOI] [PubMed] [Google Scholar]
- 13.Lipkovich I, Dmitrienko A, Denne J, Enas G. Subgroup identification based on differential effect search – a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine 2011; 30:2601–2621. [DOI] [PubMed] [Google Scholar]
- 14.Berger JO, Wang X, Shen L. A Bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics 2014; 24:110–129. [DOI] [PubMed] [Google Scholar]
- 15.Paez JG, Jänne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ, Naoki K,Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy Science (New York, N.Y.) 2004; 304:1497–1500. [DOI] [PubMed] [Google Scholar]
- 16.Pao W, Miller V, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L, Mardis E, Kupfer D, Wilson R, Kris M, Varmus H. EGF receptor gene mutations are common in lung cancers from “never smokers” and are associated with sensitivity of tumors to gefitinib and erlotinib. Proceedings of the National Academy of Sciences of the United States of America 2004; 101:13306–13311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Cobleigh MA, Vogel CL, Tripathy D, Robert NJ, Scholl S, Fehrenbacher L, Wolter JM, Paton V, Shak S, Lieberman G, Slamon DJ. Multinational study of the efficacy and safety of humanized anti-HER2 monoclonal antibody in women who have HER2-overexpressing metastatic breast cancer that has progressed after chemotherapy for metastatic disease. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology 1999; 17:2639–2648. [DOI] [PubMed] [Google Scholar]
- 18.Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, Fleming T, Eiermann W, Wolter J, Pegram M, Baselga J, Norton L. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. The New England Journal of Medicine 2001; 344:783–792. [DOI] [PubMed] [Google Scholar]
- 19.Amado RG, Wolf M, Peeters M, Van Cutsem E, Siena S, Freeman DJ, Juan T, Sikorski R, Suggs S, Radinsky R, Patterson SD, Chang DD. Wild-type KRAS is required for panitumumab efficacy in patients with metastatic colorectal cancer. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology 2008; 26:1626–1634. [DOI] [PubMed] [Google Scholar]
- 20.Bokemeyer C, Bondarenko I, Hartmann JT, Fd B, Schuch G, Zubel A, Celik I, Schlichting M, Koralewski P. Efficacy according to biomarker status of cetuximab plus FOLFOX-4 as first-line treatment for metastatic colorectal cancer: the OPUS study. Annals of Oncology 2011; 22:1535–1546. [DOI] [PubMed] [Google Scholar]
- 21.Lièvre A, Bachet J-B, Le Corre D, Boige V, Landi B, Emile J-F, Côté J-F, Tomasic G, Penna C, Ducreux M, Rougier P,Penault-Llorca F, Laurent-Puig P. KRAS mutation status is predictive of response to cetuximab therapy in colorectal cancer. Cancer Research 2006; 66:3992–3995. [DOI] [PubMed] [Google Scholar]
- 22.Van Cutsem E, Lang I, D’haens G, Moiseyenko V, Zaluski J, Folprecht G, Tejpar S, Kisker O, Stroh C, Rougier P. KRAS status and efficacy in the first-line treatment of patients with metastatic colorectal cancer (mCRC) treated with FOLFIRI with or without cetuximab: the CRYSTAL experience. ASCO Meeting Abstracts 2008; 26:2. [Google Scholar]
- 23.Forde PM, Rudin CM. Crizotinib in the treatment of non-small-cell lung cancer. Expert Opinion on Pharmacotherapy 2012; 13:1195–1201. [DOI] [PubMed] [Google Scholar]
- 24.Shaw AT, Kim D-W, Mehra R, Tan DSW, Felip E, Chow LQM, Camidge DR, Vansteenkiste J, Sharma S, De Pas T, Riely GJ, Solomon BJ, Wolf J, Thomas M, Schuler M, Liu G, Santoro A, Lau YY, Goldwasser M, Boral AL, Engelman JA. Ceritinib in ALK-rearranged non–small-cell lung cancer. New England Journal of Medicine 2014; 370:1189–1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Tian L, Tibshirani R. Adaptive index models for marker-based risk stratification. Biostatistics 2011; 12:68–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.McKeegan EM, Ansell PJ, Davis G, Chan S, Chandran RK, Gawel SH, Dowell BL, Bhathena A, Chakravartty A, McKee MD, Ricker JL, Carlson DM, Ramalingam SS, Devanarayan V. Plasma biomarker signature associated with improved survival in advanced non-small cell lung cancer patients on linifanib. Lung Cancer 2015; 90:296–301. [DOI] [PubMed] [Google Scholar]
- 27.Breiman L Bagging predictors. Machine Learning 1996; 24:123–140. [Google Scholar]
- 28.Tian L, Zhao L, Wei LJ. Predicting the restricted mean event time with the subject’s baseline covariates in survival analysis Biostatistics(Oxford, England: ) 2014; 15:222–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Tibshirani R, Efron B. Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology 2002; 1(1):1–8. [DOI] [PubMed] [Google Scholar]
- 30.Simon RM. Genomic Clinical Trials and Predictive Medicine (1 edition edn). Cambridge University Press, 2013. [Google Scholar]
- 31.Simon RM, Subramanian J, Li M-C, Menezes S. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Briefings in Bioinformatics 2011; 12:203–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Li M, Yu T, Hu Y-F. The impact of companion diagnostic device measurement performance on clinical validation of personalized medicine. Statistics in Medicine 2015; 34:2222–2234. [DOI] [PubMed] [Google Scholar]
- 33.Tian L, Alizadeh AA, Gentles AJ, Tibshirani R. A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 2014; 109:1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
