Summary
A potential venue to improve healthcare efficiency is to effectively tailor individualized treatment strategies by incorporating patient level predictor information such as environmental exposure, biological, and genetic marker measurements. Many useful statistical methods for deriving individualized treatment rules (ITR) have become available in recent years. Prior to adopting any ITR in clinical practice, it is crucial to evaluate its value in improving patient outcomes. Existing methods for quantifying such values mainly consider either a single marker or semi-parametric methods that are subject to bias under model misspecification. In this paper, we consider a general setting with multiple markers and propose a two-step robust method to derive ITRs and evaluate their values. We also propose procedures for comparing different ITRs, which can be used to quantify the incremental value of new markers in improving treatment selection. While working models are used in step I to approximate optimal ITRs, we add a layer of calibration to guard against model misspecification and further assess the value of the ITR non-parametrically, which ensures the validity of the inference. To account for the sampling variability of the estimated rules and their corresponding values, we propose a resampling procedure to provide valid confidence intervals for the value functions as well as for the incremental value of new markers for treatment selection. Our proposals are examined through extensive simulation studies and illustrated with the data from a clinical trial that studies the effects of two drug combinations on HIV-1 infected patients.
Keywords: Biomarker-analysis Design, Counterfactual Outcome, Personalized Medicine, Perturbation-resampling, Predictive Biomarkers, Subgroup Analysis
1. Introduction
The standard analyses of randomized clinical trial evaluate the treatment effect based on the overall treatment difference on the entire study population. However, such overall the treatment effect assessment may not be adequate when a new treatment benefits patients differentially depending on each patient’s characteristics. A treatment deemed effective on average does not guarantee that it will be beneficial to all future patients. Conversely, a negative finding on the average the treatment effect does not imply that the new treatment is entirely futile since the effectiveness of a treatment on a small subgroup of patients may be hidden by its inactivity on a large (heterogeneous) patient population (Rothwell, 1995; Rothwell et al., 2005; Kent and Hayward, 2007). When the treatment effect varies across subpopulations, it would be desirable to develop individualized treatment rules (ITR) according to individual patients’ baseline characteristics. Assigning treatments to achieve optimal patient outcomes may substantially improve healthcare efficiency (Baker et al., 2012).
Statistical methods for developing optimal ITRs have received much attention in recent years. Traditional methods based on ad hoc subgroup analyses or searching for markertreatment interactions, while useful, may not be efficient or valid due to the curse of dimensionality and multiple comparisons. More systematic approaches to deriving ITR have been recently proposed. With a single baseline marker, semi- and non-parametric procedures have been proposed to identify a subgroup of patients who would benefit from the new treatment (e.g., Song and Pepe, 2004; Bonetti and Gelber, 2000, 2004). With multiple baseline markers, a wide range of procedures have been proposed to derive ITRs that combines information across all markers (e.g., Qian and Murphy, 2011; Imai and Strauss, 2011; Foster et al., 2011; Cai et al., 2011; Zhao et al., 2012; Zhang et al., 2012a; Zhao et al., 2013).
As strategies for deriving ITRs become increasingly available, it is important to examine the net benefit of assigning treatment according to an ITR prior to recommending its wide spread use. Most current research focuses on developing ITRs with relative little attention given to making robust inference about such estimated ITRs and their value in improving population outcomes. Although a few methods have been proposed to quantify such values, these methods consider either a single marker or semi-parametric methods that are subject to bias under model misspecification (Song and Pepe, 2004; Song and Zhou, 2009; Janes et al., 2011; Huang et al., 2012, e.g). Zhang et al. (2012a) propose a robust approach to overcome model misspecification by restricting the ITR in a parametric class and estimate the ITR parameters by maximizing an empirical value function associated with the ITR. The direct maximization of the non-smooth empirical value function could suffer substantial variability in the estimated ITR parameters. As we show Section 3.2 and Web Appendix B, even for a univariate X with ITR given by I(X ≥ c), direct maximization gives an estimate of c with a cubic convergence rate. When there are multiple markers, direct maximization of an empirical value function with respect to all unknown parameters involved in the ITR, such as those proposed in Zhang et al. (2012b), could be computationally prohibitive and unstable. Here, we consider a general setting with multiple markers and adopt a two-step method to derive a class of ITRs and make inference about the value of such ITRs. We also propose procedures for comparing different ITRs, which can be used to quantify the incremental value (IncV) of new markers in improving treatment selection. Such IncV assessment is particularly important if a marker used in the ITR is expensive and/or invasive.
The remainder of this paper is organized as follows. We describe in Section 2, the general framework for quantifying the value of ITRs and deriving ITRs that attain maximal values. We also provide some simple results demonstrating that a two-step procedure could potentially lead to an ITR that is optimal (i) among all ITRs based on a set of predictors X when the fitted models in the first step are nearly correct; and (ii) within a smaller class of ITRs when the models are misspecified. In Section 3, we provide the estimation and procedures for making inference about the proposed two-step ITR as well as its value function. In Section 4, we evaluate the finite sample performance of our proposed methods through a series of simulation studies. We apply the proposed method, in Section 5, to a data set from a clinical trial (ACTG 320) conducted by the AIDS Clinical Trial Group as a further illustration. In Section 6, we provide some concluding remarks and further discussion.
2. Quantifying the Value of ITR and Optimizing ITR
2.1 Notations and Settings
Let Y be the response variable and Y(j) denote the potential outcome (Rubin, 1974; Rubin et al., 1978; Holland, 1986; Robins, 1986) of a patient if assigned to treatment G = j, where j = 1 refers to the experimental treatment and j = 0 to the standard treatment. The potential outcome (also referred to as counterfactual outcome) Y(j) is defined as the value of the outcome had the treatment been set to G = j by external intervention. Both Y and Y(j) are related via the consistency assumption Y = GY(1) + (1 – G)Y(0). We assume the standard stable unit treatment value assumption i.e. each subject’s potential response to a treatment does not depend on the treatment assignment mechanism, the treatments received by other patients or their potential responses to treatments (Rubin, 1980, 1986).Without loss of generality, we assume that a larger value of Y is more beneficial.
Let denote a binary ITR as a function of a baseline covariate vector X with indicating assignment to treatment j. Our goal is to identify an optimal that maximizes patients’ outcomes. When the treatment selection is optimized for all patients, the resulting population average outcome is also optimized. Thus, an optimal ITR is also expected to maximize a population average value function. A sensible choice of the value function is the cost-adjusted population average outcome associated with :
where ξ is a pre-specified incremental financial and/or medical cost associated with taking the new treatment as compared to the standard treatment. It is not difficult to see that the optimal maximizing is the Bayes rule
(2.1) |
where I(.) is the indicator function and D(X) = E(Y(1)∣X – E(Y(0)∣X). With a given dataset, the optimal ITR can be approximated by estimating the conditional treatment effect function D(X) or by direct optimization of within a class of .
Throughout, we use notation D(β,X) to denote a model based approximation to the true conditional mean difference D(X) with a model parameter value β; let denote its estimate from the data and denote its limiting value. In addition, we let denote a calibrated estimate of treatment difference given . We next describe the pros and cons of various approaches.
2.2 Various Approaches to Approximating
When X = X is univariate, we can approximate by estimating D(X) via kernel smoothing. Note that if D(X) is an increasing function in X, the optimal rule (X) must take the form I(X ≥ c°), where c° = argmaxc and . Evaluating the utility of a single marker based on has been previously considered in Song and Pepe (2004) and Song and Zhou (2009). However, when D(X) is not monotone in X, the optimal ITR may not take the form of I(X ≥ c) and for any c if there exists x1 > x2 such that D(x1) = D(x2) = ξ and P(x1 > X > x2) > 0.
With multivariate X, using fully non-parametric methods and incorporating non-linear functional spaces for approximating (Foster et al., 2011; Zhao et al., 2012) could be extremely valuable, especially when D(X) takes a complex form. However, these methods are subject to curse of dimensionality and pose challenges in making inference about the resulting ITR and its associated value function. On the other hand, if D(X) is estimated by imposing parametric or semi-parametric models on E(Y(j)∣X), the plug-in estimate of may lead to a much lower population average outcome compared to that of the true (Qian and Murphy, 2011). One may reduce model misspecification by including non-linear bases and selecting important variables via regularized estimation (Qian and Murphy, 2011; Imai and Strauss, 2011). However, it remains challenging to efficiently choose non-linear basis functions to achieve an optimal bias and variance trade-off.
We seek to overcome model misspecification by following a two-step principal previously proposed in Cai et al. (2011) and Zhao et al. (2013): (I) obtain a parametric or semiparametric model based estimate of D(X), denoted by , where is the estimated model parameters; and (II) non-parametrically estimate treatment effect parameters to account for possible model misspecification in step I. Both of these existing methods, while related to using working models for treatment selection, do not address the question of how to derive an optimal ITR or how to make inference about its associated value, which is the focus of this paper. Cai et al. (2011) uses as a univariate score to create subgroups, and provides inference procedures for E{Y(1) – Y(0) ∣ X ∈ Ωs}. Zhao et al. (2013) proposes a non-parametric estimator of for a range of c in step II and hence only relate to ITRs of the form . Although the methods given in Zhao et al (2013) can be used to compare the potential of two scores in guiding treatment selection, their measures do not have a clear clinical interpretation and cannot be used to quantify the performance of a single score.
In this paper, we propose a two-step approach to construct a calibrated ITR,
and evaluate its value . It follows from (2.1) that is the optimal ITR based on the univariate score . We next detail some pros and cons of using for treatment selection under various conditions.
When the working models in the first step are nearly correct such that D(X) is an increasing function of , then . Hence, the two-step procedure leads to the optimal ITR with .
Under more severe model misspecification when , will be sub-optimal relative to with . However, when and are replaced with their respective estimates and obtained in finite sample, applying to a future population may not yield a value function higher than that of if the sampling variability associated with is much larger than that of .
When the model misspecification is not severe with being increasing in s, takes the form , where c° = argmaxc with .
When is not monotone and there exists d1 > d2 such that and , we have . In other words, under certain model misspecifications such as missing a quadratic effect, assigning treatment according to as in Zhao et al. (2013) will be sub-optimal when compared to .
These findings suggest the benefit in approximating with and motivate us to provide a robust estimate of the ITR as follows:
posit working models to approximate D(X) using a model based score ;
non-parametrically estimate as using observed responses along with ;
for a future patient with X = x, the treatment assignment is determined using , which is an estimator of with ;
evaluate the performance of based on the observed data by estimating .
We next detail our proposed estimators for and its associated value along with their asymptotic properties.
3. Estimation and Inference Procedures for and
We assume that data available for analysis are from a randomized clinical trial with study participants randomly assigned to either a standard treatment (G = 0) or an experimental treatment (G = 1). Our data consist of n random vectors *****, where we assume that {(Yi,Xi) : Gi = j} are independent and identically distributed random vectors, for j = 0, 1. Furthermore, we assume that the ratio n1/n converges to a constant π1 ∈ (0, 1) as n → ∞.
3.1 Estimation of and
We first approximate D(X) by imposing parametric or semi-parametric working models as E(Yi ∣ Xi = X, Gi = j) = μj(βj;X) with β being the unknown model parameter for group j and μj being a known link function. Without loss of generality, we suppose that βj is estimated by , the solution to , where
(3.1) |
Uj(β;·,·) is the estimating function relating the observed data to the parameters of interest, and is a tuning parameter chosen to ensure stable fitting when the dimension of β is not too small compared to the sample size. For example, one may let under generalized linear models, where ψ(·) is a prespecified finite dimensional vector of potentially non-linear transformation functions which would allow one to capture non-linear effects. We choose penalty so that still converges to a zero-mean normal random vector to enable easy inference. Then, the model based estimate of D(X) can be obtained as , where is a vector of estimated model parameters.
Next, we propose to estimate and non-parametrically with and respectively, where
(3.2) |
representing a non-parametric smoothed estimator of and K(·) is a smooth symmetric density function with h → 0 as n → 1.
We show in Web Appendix A that, under mild regularity conditions, and in probability. Furthermore, we show that , which converges in distribution to a zero mean normal random variable with variance , where is defined in (A.1) of Web Appendix A and is the estimated treatment assignment rule knowing .
This indicates that in the non-parametric calibration step does not contribute any additional variability for the estimation of at the first order. Alternative choices of such as a local likelihood estimator can also be valid provided that .
3.2 Estimation of the Threshold Under Monotonicity
Under the assumptions that is monotone, has a unique solution at c°, and the product . The optimal ITR based on takes the form , where c° = argmaxc is an interior point of the support of . Hence, one may also approximate via direct maximization as , for , where we define .
We show, in Web Appendix B, that is a consistent estimator of c°. However, due to the non-smoothness in the empirical value function , we have , suggesting that direct maximization results in an estimator with a much slower convergence rate and hence large variability in . Furthermore, following from Theorem §1.1 of Kim and Pollard (1990), converges in distribution to argmint , where is a Gaussian process. However, argmint does not generally have an explicit form.
Since co is also the solution to , we propose to estimate c° as , the solution to . As shown in Web Appendix B, and converges in distribution to a normal when h = O(n−v) with v ∈= [1/5, 1/2). In addition, and . This suggests that, under the monotonicity assumption, the variability of is ignorable at the first order when making inference about and the estimator is asymptotically equivalent to obtained by first estimating via smoothing.
Finally, using similar arguments as those given inWeb Appendix B, one may also show that . However, in finite sample, the large variability in leads to having higher variability than that of . We further illustrate these points in the simulation section.
3.3 Bias Correction and Interval Estimation
In practice, for a small or moderate sample size n, the value of the marker-guided ITR could be substantially over-estimated due to overfitting (Zhao et al., 2013). To correct for the bias, we use the standard cross-validation (CV) technique. Specifically, we first randomly splits the data into disjoint subsets of about equal size and label them as , where k = 1, …. For each k, we use all the observations not in to obtain for βj via (3.1), and compute the score as well as . Then, we use observations in to obtain
Finally, we obtain the CV estimator for as
The variance of involves unknown density-like functions, which makes it difficult to estimate directly, especially when the number of covariates in the model is not small. To circumvent this issue, we use a perturbation-resampling technique to obtain a good approximation to the distribution of our proposed estimators. Specifically, let {Wi, i = 1; … , n} be n independent and identically distributed random variables generated from a known distribution with mean 1 and variance 1. The perturbed version of can be obtained as
where and is the solution to .
Using arguments similar to Park and Wei (2003) and Cai et al. (2005), we can show that, the distribution of can be approximated by the conditional distribution of given the data. Therefore, a 100(1 – α)% confidence interval (CI) for is constructed as or , where is obtained as the standard error (SE) of and zq is the qth percentile of the standard normal distribution. When h = O(n−v) with v ∈ (1/5, 1/2), the resampling method can also be used to obtain CIs for c° based on . However, we acknowledge that it is unclear whether the proposed resampling procedures or the standard bootstrap can be used to approximate the distribution of due to its non-regular distribution. Our numerical results seem to suggest that the perturbation works reasonably well in finite sample.
3.4 Incremental Value (IncV) of Markers in Improving ITR
When a set of new markers Xnew is available to further improve ITR, it is important to assess their IncV, especially if these markers are costly or invasive. Using the population average outcome, we may quantify the IncV of the new markers by comparing the ITRs constructed with Xold and , denoted respectively by and , with respect to their value functions, i.e. the cost-adjusted population average outcomes. Specifically, the IncV of Xnew can be quantified as . Based on the inference procedures described above, we obtain a plug-in estimate of IncV as .
The CI of IncV can be constructed via the perturbation-resampling approach as well. For each set of perturbation random variables {Wi, i = 1, … , n}, we follow the perturbation procedures described in Section 3.3 to obtain perturbed counterpart of (, ) as (, ). Then the perturbed counterpart of can be obtained as . The empirical distribution of conditional on the data can be used to approximate the distribution of and construct the CI accordingly.
4. Simulation Studies
To evaluate the performance of the proposed method under practical settings, we conducted extensive simulation studies to examine the finite-sample properties of the proposed point and interval estimation estimators. In addition, we compared the performance of the proposed procedures to that of Qian and Murphy (2011) and Zhang et al. (2012a) with respect to the achievable value function. Throughout the simulation studies, we let n0 = n1 = 250 and set ξ = 0 for simplicity. All the CIs were obtained using 500 perturbed samples with Wi ~ exp(1). All the results are summarized based on 1000 simulated datasets.
4.1 Evaluation of the Proposed Method
We first evaluated the proposed point and interval estimation procedures for the value functions. We generated covariates from
Given , two types of outcomes were generated: (1) a continuous outcome from a linear model
(4.1) |
where ∊i ~ N(0, 1) with β1 = (1, 0.5, 1, 1.5, 1, 1)⊤ and β0 = (6, 2.5, −3, −1, 1.5, –2)⊤; (2) a binary outcome from a logistic regression model
(4.2) |
where β1 = (0.1, 0., 0.1, 0.15, 0.2, 0.1)⊤ and β0 = (0.6, 0.25, −0.2, −0.1, 0.15, −0.2)⊤.
To derive ITRs, we fitted various working models with either linear (for continuous Y) or logistic regression (for binary Y) with the following different sets of covariates:
For the overfitted model (M2), we generated X6 ~ Bernoulli(0.7) and (X7, … ,X20)⊤ N(μ, 8 · I14×14 + 8), where μ = (3, 1, 2, −3, 3, 1, 2, 1.5, 2.5, 1, 0.5, 2, 1, 1)⊤; I14×14 denotes a 14 × 14 identity matrix. To obtain model based estimate of D(X), we fit simple linear regression and logistic regression models with penalty parameter λjn set to 0 since the models sizes are reasonably small. It can be seen that Δ(·) is an increasing function under those four working models and hence is asymptotically equivalent to and .
The results for estimating via , , and together with a 2-fold CV procedure are shown in Table 1. For all estimators of , the biases are negligible and the estimated SEs are close to the empirical SEs. The 95% CIs have empirical coverage levels close to the nominal level. As expected, the CV procedure generally provides estimators with lower bias compared to the apparent estimates. Since and are equivalent under monotonicity, the results for and are almost identical to each other.
Table 1.
(a) Continuous outcome | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
Apparent |
2-fold CV |
||||||||
Model | Method | True | ASE | Bias | ESE | 95%-CP | Bias | ESE | 95%-CP |
0.629 | 0.109 | 0.609 | 0.952 | 0.073 | 0.602 | 0.955 | |||
M 1 | 9.510 | 0.621 | 0.047 | 0.605 | 0.946 | 0.026 | 0.602 | 0.947 | |
0.621 | 0.047 | 0.605 | 0.946 | 0.026 | 0.602 | 0.947 | |||
0.628 | 0.108 | 0.610 | 0.949 | 0.076 | 0.605 | 0.954 | |||
M 2 | 9.510 | 0.621 | 0.048 | 0.603 | 0.946 | 0.022 | 0.600 | 0.946 | |
0.621 | 0.048 | 0.603 | 0.946 | 0.022 | 0.600 | 0.946 | |||
0.632 | 0.124 | 0.629 | 0.948 | 0.089 | 0.623 | 0.944 | |||
M 3 | 8.842 | 0.621 | 0.044 | 0.620 | 0.945 | −0.020 | 0.616 | 0.945 | |
0.621 | 0.044 | 0.620 | 0.945 | −0.020 | 0.616 | 0.945 | |||
0.639 | 0.212 | 0.633 | 0.962 | 0.146 | 0.630 | 0.956 | |||
M 4 | 8.834 | 0.637 | 0.137 | 0.627 | 0.955 | 0.107 | 0.623 | 0.944 | |
0.630 | 0.074 | 0.621 | 0.951 | 0.057 | 0.613 | 0.944 |
(b) Binary outcome | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
Apparent |
2-fold CV |
||||||||
Model | Method | True | ASE | Bias | ESE | 95%-CP | Bias | ESE | 95%-CP |
0.042 | 0.028 | 0.037 | 0.927 | 0.016 | 0.037 | 0.935 | |||
M 1 | 0.766 | 0.039 | 0.018 | 0.036 | 0.948 | 0.008 | 0.036 | 0.963 | |
0.039 | 0.019 | 0.035 | 0.946 | 0.007 | 0.036 | 0.960 | |||
0.043 | 0.042 | 0.037 | 0.905 | 0.010 | 0.037 | 0.972 | |||
M 2 | 0.766 | 0.041 | 0.033 | 0.035 | 0.920 | −0.005 | 0.036 | 0.974 | |
0.040 | 0.034 | 0.035 | 0.921 | −0.004 | 0.036 | 0.976 | |||
0.041 | 0.045 | 0.036 | 0.902 | 0.029 | 0.037 | 0.933 | |||
M 3 | 0.754 | 0.039 | 0.029 | 0.035 | 0.946 | 0.019 | 0.035 | 0.949 | |
0.038 | 0.030 | 0.035 | 0.940 | 0.020 | 0.035 | 0.943 | |||
0.048 | 0.039 | 0.042 | 0.900 | 0.031 | 0.041 | 0.915 | |||
M 4 | 0.749 | 0.042 | 0.032 | 0.041 | 0.915 | 0.024 | 0.040 | 0.924 | |
0.041 | 0.019 | 0.040 | 0.924 | 0.015 | 0.039 | 0.940 |
Although and are also asymptotically equivalent, we note that tends to have slightly larger bias and variation in finite samples. This could be in part due to the larger variability in estimating the corresponding optimal threshold values as shown in Table 2. The bias and variance of are substantially larger than those of . The efficiency of relative to is only about 66-75% for most of the settings we considered. This confirms the disadvantage of estimating c° by directly maximizing the non-smooth empirical objective function .
Table 2.
Continuous |
Binary |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Model | Method | True | Bias | ESE | ASE | 95%-CP | True | Bias | ESE | ASE | 95%-CP |
M 1 | 0.00 | −0.178 | 1.667 | 1.753 | 0.960 | 0.00 | −0.012 | 0.115 | 0.121 | 0.961 | |
0.036 | 1.358 | 1.407 | 0.967 | −0.001 | 0.100 | 0.106 | 0.971 | ||||
M 2 | 0.00 | −0.201 | 1.647 | 1.757 | 0.964 | 0.00 | −0.021 | 0.118 | 0.128 | 0.958 | |
−0.0 | −0.007 | 1.355 | 1.407 | 0.972 | 0.004 | 0.112 | 0.118 | 0.972 | |||
M 3 | −0.019 | −0.212 | 1.764 | 1.851 | 0.967 | −0.023 | −0.022 | 0.109 | 0.114 | 0.954 | |
0.101 | 1.480 | 1.550 | 0.965 | −0.008 | 0.094 | 0.100 | 0.964 | ||||
M 4 | −0.121 | 0.190 | 1.693 | 1.853 | 0.979 | 0.017 | −0.030 | 0.103 | 0.105 | 0.951 | |
0.071 | 1.491 | 1.547 | 0.946 | −0.018 | 0.099 | 0.102 | 0.964 |
We also evaluated the finite sample performance of the inference procedure for the IncV of new markers. In Table 3, we presented results on the estimated IncV of X1 and X3 in improving the value function by comparing the ITRs derived from the full model (M1) to those from model (M3) with no information on X1 and X3. Our proposed procedures based on and give minimally biased estimates of the IncV and the resampling procedures provide good estimates for the SEs and CIs. Since the model sizes are small, both the apparent estimates and the CV estimates lead to reasonable interval estimates with proper coverage levels. For comparison, we provided the results based on as well which also leads to consistent estimates of the IncV since Δ(·) is monotone for both models. Similar to the results shown earlier, the direct maximization of resulted in larger variability in , which subsequently lead to higher variability in estimating IncV in finite sample.
Table 3.
Apparent |
2-fold CV |
||||||||
---|---|---|---|---|---|---|---|---|---|
Response | Method | True | ASE | Bias | ESE | 95%-CP | Bias | ESE | 95%-CP |
Continuous | 0.663 | 0.215 | 0.025 | 0.217 | 0.946 | −0.019 | 0.218 | 0.948 | |
0.187 | 0.013 | 0.183 | 0.948 | −0.007 | 0.184 | 0.952 | |||
Binary | 0.011 | 0.026 | 0.005 | 0.024 | 0.973 | 0.002 | 0.024 | 0.966 | |
0.021 | 0.004 | 0.020 | 0.966 | 0.001 | 0.019 | 0.957 |
4.2 Comparisons to Existing Methods
Additional simulation studies were conducted to compare our calibration methods with those proposed in Qian and Murphy (2011) (QM) and Zhang et al. (2012a) (Zhang). The QM method employs model based ITRs but guard against model mis-specification by including non-linear basis functions and then obtains stable parameter estimates by imposing L1 penalization. In our simulations, we included all linear effects and two-way interactions for their method. To apply the Zhang method, we let the propensity score be 0.5 and search for the optimal ITR using the linear regression followed by a CART procedure. The complexity parameter was first set at 0.001 to build a large tree which is then pruned via a 10-fold CV.
To compare the performance of these methods, we generated X20×1 ~ N(0, 2.4I20×20+1.6) and a continuous Y that depends non-linearly to the first 5 covariates through
(4.3) |
where , , sin(X4))⊤, β0 = (12, 5, −6, −2, 3, −4, 7, −2)⊤ and β1 = (2, 1, 2, 3, 2, 2, 3, 3)⊤. To construct the ITRs, we investigated four sets of working linear models with covariates being
For () and (), Δ(·) is non-monotone. Since our proposal also allows non-linear transformations, we derive based on using (i) linear models () and (ii) linear + two-way interactions with ridge penalty (). For comparison, we also consider the model based rule by assigning patients according to for both linear effects (ModelL) and linear + interaction effects (ModelL+I). The ridge penalty parameters for ModelL+I and were chosen via the CV. For all scenarios, we estimated ITRs with training sets of sizes n0 = n1 = 250 and evaluated their corresponding value functions on independent validation datasets with n0 = n1 = 105. We use large validation sets to ensure that the variability observed in the empirical value function reects mainly the variability due to estimating the ITRs. Table 4 summarizes the mean and empirical SE (ESE) of the empirical value functions for (1) ModelL; (2) ; (3) ModelL+I ; (4) ; (5) QM; and (6) Zhang.
Table 4.
ModelL | ModelL+I | QM | Zhang | ||||
---|---|---|---|---|---|---|---|
|
|||||||
Mean | 42.39 | 42.39 | 42.39 | 42.39 | 42.38 | 42.06 | |
ESE | 0.13 | 0.13 | 0.13 | 0.13 | 0.13 | 0.15 | |
Mean | 42.39 | 42.39 | 40.90 | 42.01 | 42.35 | 42.04 | |
ESE | 0.13 | 0.13 | 0.51 | 0.16 | 0.14 | 0.16 | |
Mean | 39.65 | 41.95 | 40.47 | 41.99 | 40.61 | 41.75 | |
ESE | 0.95 | 0.17 | 0.39 | 0.15 | 0.43 | 0.72 | |
Mean | 39.61 | 41.89 | 40.42 | 41.92 | 40.58 | 41.72 | |
ESE | 1.01 | 0.17 | 0.52 | 0.15 | 0.57 | 0.70 |
Under the true model () with 5 covariates, all methods perform very similarly. Our calibrated method performs almost identically to the model based method, suggesting that little price is paid for the additional calibration. The QM method also did not pay much price for including the non-informative two-way interactions in this case. With the over-fitted model (), except for ModelL+I, all other methods achieve similar level of value function with slight difference in the variability. The QM method performs quite well considering that about 200 covariates are included in the fitting with only 5 are informative. This is not too surprising since L1 penalization is expected to work well when the signals are strong and sparse, as in the present case. Both the QM and our method with achieved value functions almost identical to those obtained under (). Our method with and the Zhang method result in slightly higher variability compared to the QM and . ModelL+I has a slightly lower value function with larger variability, suggesting instability in ridge penalized modeling fitting. However, the calibration appears to have the ability to reduce the instability with a much more stable compared to ModelL+I. In general, the price paid for overfitting under correct model specification seems to be relatively low. This is in part because all methods are building on top of correctly specified models and hence the estimated regression parameters are maximizing the value function asymptotically. As a result the variability due to estimating all parameters in the ITR contributes at a second order, similar to those argued in Zhao et al. (2012).
With the mis-specified working models () and (), our proposed calibration method performs better than all other competing methods with respect to the achievable value function and/or the variability. For example, for (), the average value function was 41.95 and 41.99 for and respectively, 40.61 for the QM method, and 41.75 for the Zhang method. Since the true underlying effects are quite non-linear, the model based method with linear effects perform poorly in this setting with value function of only 39.65. It seems that there is a slight increase in the achievable value function from our calibration method by including the non-linear basis in the working models. Under mis-specified models, the calibration method tends to produce more stable ITRs than those from the QM and Zhang methods. For example, the ESE was 0.17 and 0.15 for and respectively, 0.43 for the QM method, and 0.72 for the Zhang method.
5. Example: Application to an HIV/AIDS Clinical Trial
In this section, we use a randomized clinical trial from the AIDS Clinical Trials Group Protocal 320 (ACTG 320) to illustrate our proposed methods. A total of 1156 zidovudine-experienced patients with advanced human immunodeficiency virus (HIV) type-I infection were enrolled in ACTG 320 and assigned to either a 2-drug combination of zidovudine and lamivudine (treatment 0) or a 3-drug combination of zidovudine, lamivudine and indinavir (treatment 1). The objective of this study was to assess the added value of a protease inhibitor (indivar) to the dual nucleoside reverse-transcriptase inhibitors (xidovudine and lamivudine) (Hammer et al., 1997). The overall success of the 3-drug combination on various study endpoints was so significant that the study was terminated early by the Data Safety Monitoring Board. However, since the benefit of 3-drug combination therapy over the 2-drug alternative potentially differs across patients, it would be interesting to identify subgroups of patients who can be managed almost as well using the less potent 2-drug therapy.
For our analyses, the potential baseline predictors for constructing ITRs consist of age (years), CD40, logCD40, and log10RNA0, where CD4k and RNAk denote, respectively, the CD4 cell count (cells/mm3) and the RNA measure (copies/ml) at week k. We restricted our study to the 856 subjects who had complete information on these variables and on the outcome of interest, 427 of which were in the 3-drug combination group. Because it is relatively expensive to measure RNA, particularly in resource-limited settings, it would be of interest to examine whether Xnew = log10RNA0 is useful for improving the ITRs. Thus, we compared the optimal ITRs based on the various working models with the following two sets of predictors: (Mold) : Xold=(1, Age, CD40, logCD40)⊤
To quantify the effectiveness of the therapy, we considered the immune response Y defined as the change in logCD4 from baseline to week 24. Table 5 summarizes the estimated regression coefficients for fitted linear models with (Mupdate) and (Mold). To construct optimal ITRs under these two models, we let ξ = 0.277, which is about the within subject variability of logCD4 counts (Hughes et al., 1994), indicating that 3-drug therapy is only preferred if the gain in immune response exceeds ξ. All estimators of the value functions are based on 500 repeated two-fold CVs. The SEs are based on 500 perturbations.
Table 5.
Model | Treatment | Intercept | Age | CD40 | log(CD40) | log10(RNA0) | |
---|---|---|---|---|---|---|---|
2-drug | Estimate | 1.065 | 0.002 | 0.004 | −0.338 | 0.011 | |
Std. Error | 0.350 | 0.004 | 0.001 | 0.057 | 0.049 | ||
p-value | 0.002 | 0.568 | < 0.001 | < 0.001 | 0.829 | ||
Mupdate |
|
||||||
3-drug | Estimate | 1.317 | 0.009 | −0.001 | −0.328 | 0.128 | |
Std. Error | 0.348 | 0.004 | 0.001 | 0.062 | 0.051 | ||
p-value | < 0.001 | 0.023 | 0.199 | < 0.001 | 0.013 | ||
| |||||||
2-drug | Estimate | 1.124 | 0.002 | 0.004 | −0.338 | ||
Std. Error | 0.218 | 0.004 | 0.001 | 0.057 | |||
p-value | < 0.001 | 0.584 | < 0.001 | < 0.001 | |||
Mold |
|
||||||
3-drug | Estimate | 1.984 | 0.008 | −0.002 | −0.318 | ||
Std. Error | 0.225 | 0.004 | 0.001 | 0.062 | |||
p-value | < 0.001 | 0.035 | 0.066 | < 0.001 |
Under (Mupdate), the maximum cost-adjusted value function is = 0.610 with a 95% CI (0.526, 0.692). Its corresponding optimal threshold is equal to 0.236 with a 95% CI=(−0.040, 0.512). Under (Mold), i.e. without the RNA information, we have =0.606 with a 95% CI (0.522, 0.690) and = 0.231 with a 95% CI=(−0.004, 0.466). Finally, the estimated difference is equal to 0.004 with the 95% CI (−0.035, 0.043). This implies that despite the highly significant difference between the effect of RNA on the outcome between the two treatment groups (as shown in Table 5), including RNA information in the ITR does not improve the value function. Therefore, RNA is not necessary to better assign future patients to a specific treatment. By including two-way interactions of all variables under (Mupdate) and (Mold), our calibrated method yields ITRs with estimated values 0.614 for (Mupdate) and 0.609 for (Mold), similar to their linear effect counterparts. We also applied the QM and Zhang methods to ACTG 320 data set and observed slightly lower value functions compared to those from our methods. The value function associated with the models (Mupdate) and (Mold) are, respectively, 0.594 and 0.585 based on the Zhang method, and 0.578 and 0.569 based on the QM method.
It would also be of interest to compare to the simple rules that assign all patients to the 3-drug combination group or to the 2-drug group. The value functions are estimated as 0.562 if all treated with 3-drug and 0.177 if all treated with 2-drug. Comparing to assigning all patients to 3-drug, leads to an increased value of 0.48 with a 95% CI [0.003, 0.093], suggesting an improvement by adopting the ITR.
6. Remarks
In this paper, we have proposed a robust procedure that uses multiple baseline covariates to develop and evaluate ITRs. While the procedure builds upon an existing two-step framework, this paper provides additional insights into how to develop an optimal ITR based on working models and how to evaluate such ITRs. Fitting the data with semi-parametric models in step I, combined with the non-parametric estimation in step II, provides robust estimates of ITRs and valid estimates of their associated cost-adjusted population average outcomes. Our proposed ITRs are optimal across all rules based on the given set of covariates when the fitted working models are correct or nearly correct as discussed in section 2. The ITR is optimal among all rules based on the estimated scores when the fitted models are mis-specified. To account for the variability in estimating various parameters, we proposed perturbation-resampling procedures that can be used to assess the variability in the estimators.
While in traditional statistical methods, model misspecification may hinder predictions and lead to inaccurate assignment rules, the methods developed in this paper rely on a layer of calibration in step II to guard against model misspecification. We provide justifications for when the proposed procedures lead to ITRs that are globally optimal under correct or near correct model specification and optimal within a class of ITR rules under model misspecification. Under model misspecification, the proposed ITR could be suboptimal when compared to the global optimal ITR. Hence, it would be crucial to provide working models that can approximate the true model reasonably well. Incorporating non-linear effects through basis function specification could be helpful and one may use the proposed inference procedure for comparing value functions as a tool for selecting important bases functions.
Through simulation studies and theoretical studies, we also demonstrate that direct optimization of the empirical value function could lead to rather unstable ITRs, with high variability and slow convergence rate in estimating the optimal threshold values, even when the underlying conditional treatment difference function Δ(c) is monotone in c.
When there are new biomarkers introduced to assist in treatment selection, it is important to evaluate their value in improving population average outcomes. It is important to note that a variable highly differentially associated with Y(1) and Y(0) may not necessarily be important for improving ITRs. This is somewhat similar to the phenomenon observed in the risk prediction literature: a variable highly significant in regression modeling may not result in large improvement in prediction. However, the IncV with respect to improving ITRs is more subtle than the typical prediction setting. Taking an extreme case scenario with a single marker X and cost ξ = 0, suppose D(X) is strictly increasing and D(X) >> 0 for all X. Then obviously X will be selected as important for predicting treatment response by any variable selection procedure. On the other hand, X would be deemed as not important for treatment selection since all subjects should be assigned to treatment 1 and having the information on X does not change the treatment assignment or the corresponding value function. Thus, if one is interested in measuring the importance of markers in guiding treatment selection, it would be valuable to directly assess the IncV with respect to the value function as proposed in this paper. When the dimension of new markers is not small, it would be crucial to employ the cross-validation to correct for the overfitting bias as suggested by Zhao et al. (2013). Procedures for efficiently selecting the informative markers warrant further research.
Supplementary Material
Acknowledgement
The work is partially supported by grants R01- GM079330, R01 HL089778, R01 AI052817, and T32 NS048005 awarded by the National Institutes of Health.
Footnotes
Supplementary Material Web Appendices referenced in Sections 1 and 3 as well as the R code used to implement our simulations are available with this paper at the Biometrics website on Wiley Online Library.
References
- Baker S, Kramer B, Sargent D, Bonetti M. Biomarkers, subgroup evaluation, and clinical trial design. Discovery Medicine. 2012;13:187–192. [PubMed] [Google Scholar]
- Bonetti M, Gelber R. A graphical method to assess treatment–covariate interactions using the Cox model on subsets of the data. Statistics in medicine. 2000;19:2595–2609. doi: 10.1002/1097-0258(20001015)19:19<2595::aid-sim562>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
- Bonetti M, Gelber R. Patterns of treatment effects in subsets of patients in clinical trials. Biostatistics. 2004;5:465–481. doi: 10.1093/biostatistics/5.3.465. [DOI] [PubMed] [Google Scholar]
- Cai T, Tian L, Wei L. Semiparametric Box–Cox power transformation models for censored survival observations. Biometrika. 2005;92:619–632. [Google Scholar]
- Cai T, Tian L, Wong P, Wei L. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics. 2011;12:270–282. doi: 10.1093/biostatistics/kxq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foster J, Taylor J, Ruberg S. Subgroup identification from randomized clinical trial data. Statistics in Medicine. 2011;30:2867–2880. doi: 10.1002/sim.4322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hammer S, Squires K, Hughes M, Grimes J, Demeter L, Currier J, Eron J, Jr, Feinberg J, Balfour H, Jr, Deyton L, et al. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine. 1997;337:725–733. doi: 10.1056/NEJM199709113371101. [DOI] [PubMed] [Google Scholar]
- Holland PW. Statistics and causal inference. Journal of the American statistical Association. 1986;81:945–960. [Google Scholar]
- Huang Y, Gilbert PB, Janes H. Assessing treatment-selection markers using a potential outcomes framework. Biometrics. 2012;68:687–696. doi: 10.1111/j.1541-0420.2011.01722.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes MD, Stein DS, Gundacker HM, Valentine FT, Phair JP, Volberding PA. Within-subject variation in cd4 lymphocyte count in asymptomatic human immunodeficiency virus infection: implications for patient monitoring. Journal of Infectious Diseases. 1994;169:28–36. doi: 10.1093/infdis/169.1.28. [DOI] [PubMed] [Google Scholar]
- Imai K, Strauss A. Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the get-out-the-vote campaign. Political Analysis. 2011;19:1–19. [Google Scholar]
- Janes H, Pepe M, Bossuyt P, Barlow W. Measuring the performance of markers for guiding treatment decisions. Annals of Internal Medicine. 2011;154:253–259. doi: 10.1059/0003-4819-154-4-201102150-00006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent DM, Hayward RA. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. Journal of American Medical Association. 2007;298:1209–1212. doi: 10.1001/jama.298.10.1209. [DOI] [PubMed] [Google Scholar]
- Kim J, Pollard D. Cube root asymptotics. The Annals of Statistics. 1990;18:191–219. [Google Scholar]
- Park Y, Wei L. Estimating subject-specific survival functions under the accelerated failure time model. Biometrika. 2003;90:717–723. [Google Scholar]
- Qian M, Murphy S. Performance guarantees for individualized treatment rules. Annals of Statistics. 2011;39:1180–1210. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. [Google Scholar]
- Rothwell P. Can overall results of clinical trials be applied to all patients? The Lancet. 1995;345:1616–1619. doi: 10.1016/s0140-6736(95)90120-5. [DOI] [PubMed] [Google Scholar]
- Rothwell P, Mehta Z, Howard S, Gutnikov S, Warlow C. From subgroups to individuals: general principles and the example of carotid endarterectomy. The Lancet. 2005;365:256–265. doi: 10.1016/S0140-6736(05)17746-0. [DOI] [PubMed] [Google Scholar]
- Rubin D. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association. 1980;75:591–593. [Google Scholar]
- Rubin D. Statistics and causal inference: Comment: Which ifs have causal answers. Journal of the American Statistical Association. 1986;81:961–962. [Google Scholar]
- Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology. 1974;66:688. [Google Scholar]
- Rubin DB, et al. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics. 1978;6:34–58. [Google Scholar]
- Song X, Pepe M. Evaluating markers for selecting a patient’s treatment. Biometrics. 2004;60:874–883. doi: 10.1111/j.0006-341X.2004.00242.x. [DOI] [PubMed] [Google Scholar]
- Song X, Zhou X-H. Evaluating markers for treatment selection based on survival time. UW Biostatistics Working Paper Series pages Working Paper. 2009;349:1–27. [Google Scholar]
- Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E. Estimating optimal treatment regimes from a classification perspective. Stat. 2012a;1:103–114. doi: 10.1002/sta.411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012b;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao L, Tian L, Cai T, Claggett B, Wei L. Effectively selecting a target population for a future comparative study. Journal of the American Statistical Association. 2013;108:527–539. doi: 10.1080/01621459.2013.770705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Zeng D, Rush A, Kosorok M. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.