Abstract
Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many complex diseases, such as cancer, treatment decisions need to be tailored over time according to patients’ responses to previous treatments. Such an adaptive strategy is referred as a dynamic treatment regime. A major challenge in deriving an optimal dynamic treatment regime arises when an extraordinary large number of prognostic factors, such as patient’s genetic information, demographic characteristics, medical history and clinical measurements over time are available, but not all of them are necessary for making treatment decision. This makes variable selection an emerging need in precision medicine.
In this paper, we propose a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the non-polynomial (NP) order of the sample size. To preserve the double robustness property of the A-learning method, we adopt the Dantzig selector which directly penalizes the A-leaning estimating equations. Oracle inequalities of the proposed estimators for the parameters in the optimal dynamic treatment regime and error bounds on the difference between the value functions of the estimated optimal dynamic treatment regime and the true optimal dynamic treatment regime are established. Empirical performance of the proposed approach is evaluated by simulations and illustrated with an application to data from the STAR*D study.
Keywords and phrases: A-learning, Dantzig selector, NP-dimensionality, Model misspecification, Optimal dynamic treatment regime, Oracle inequality
1. Introduction
Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many chronic diseases, such as cancer, cardiovascular disease and diabetes, treatment decisions need to be tailored over time according to patients’ responses to previous treatments. Such an adaptive treatment strategy is referred as an dynamic treatment regime. Formally speaking, a dynamic treatment regime is a sequence of decision rules, dictating how the treatment will be tailored through time to individual’s status. The optimal dynamic treatment regime is defined as the one that yields the most favorable outcome on average.
Various methods have been proposed to estimate the optimal dynamic treatment regime, including Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010) and A-learning (Robins, Hernan and Brumback, 2000; Murphy, 2003). Both Q-learning and A-learning rely on a backward induction algorithm to find the optimal dynamic treatment regime, however, Q-learning models the conditional mean of the outcome given predictors and treatment while A-learning directly models the contrast function that is sufficient for treatment decision. In particular, A-learning has the so-called doubly robust property, i.e. when either the baseline mean function or the propensity score model is correctly specified, the resulting A-learning estimating equation for the contrast function is consistent.
With the fast development of new technology, it becomes possible to gather an extradinary large number of prognostic factors for each individual, such as patient’s genetic information, demographic characteristics, medical history and clinical measurements over time. For such big data, it is important to make effective use of information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing precision medicine. In addition, variable selection is an essential tool in making inference for problems in which the number of covariates is comparable or much larger than the sample size. There have been extensive developments of penalized regression methods for variable selection in prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and the Dantzig selector (Candès and Tao, 2007), to name a few. In contrast to most penalized regression methods, which adds a penalty term to an objective function, the Dantzig selector focuses directly on estimating equations.
Although there is a large amount of work on developing variable selection methods for prediction, variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size. Qian and Murphy (2011) proposed to estimate the conditional mean response using a L1-penalized regression and studied the error bound of the value function for the estimated treatment regime. When the number of covariates is fixed, introduced a new penalized least squared regression framework and established the oracle property of the estimator, which is robust against the misspecification of the conditional mean function. extended this result to the setting allowing NP-dimensionality of covariates. However, all these works only consider studies with a single treatment decision. When moving to multiple-stage studies, the asymptotic properties of the estimated optimal dynamic treatment regime are much harder to derive since it needs to handle model misspecification of the contrast functions in the presence of NP-dimensionality of covariates. Moreover, these methods are not doubly robust.
In this paper, we propose a penalized A-learning method for deriving the optimal dynamic treatment regime when the number of covariates is of NP-order of the sample size. To preserve the doubly robust property of the A-learning method, we adopt the Dantzig selector (Candès and Tao, 2007) which directly penalizes the A-leaning estimating equations. The technical challenges and advances of the proposed estimators are described as follows.
First, to prove the theoretical properties of the Dantzig estimator in linear regression setting, the uniform uncertainty principle (UUP, Candès and Tao, 2007) or restricted eigenvalue condition (RE, Bickel, Ritov and Tsybakov, 2009) is required on the Gram matrix XT X, where X stands for the design matrix. The UUP condition essentially requires that every principle submatrix with the number of rows or columns less than some specified s behaves like an orthonormal system. The RE condition is the weakest and hence the most general condition in the literature to ensure the theoretical properties of Lasso and Dantzig estimators. A close connection between these two conditions are discussed in Bickel, Ritov and Tsybakov (2009). In a random design case, Candès and Tao (2007) studied the UUP condition for Gaussian, Bernoulli and Fourier ensembles. Mendelson, Pajor and Tomczak-Jaegermann (2007, 2008) obtained a similar result for a more general class of design matrices, the isotropic subgaussian matrices, based on some empirical process results. These results were further extended by Zhou (2009), where the UUP and RE conditions are developed for subgaussian ensembles with a correlated covariance structure. In the proposed penalized A-learning method, however, such conditions are required on matrices involving estimates, such as
(1.1) |
where A = (A1, …, An)T denotes the vector of treatments received by n subjects, denotes the corresponding estimated propensity scores and ◦ denotes the componentwise product operator. The presence of in (1.1) adds extraordinary difficulties in establishing theoretical properties of such a random matrix. We establish the UUP and RE conditions under a proper convergence rate of , which provides a new theoretical framework for studying random matrices that involve estimates of unknown parameters.
Second, in the proposed penalized A-learning method, we need to estimate the baseline mean function and the propensity score model with NP-dimensionality of covariates. We adopt the penalized regressions with the folded-concave penalties, for example, a linear regression for the baseline mean function and a logistic regression for the propensity score model, with the SCAD penalties. Several difficulties need to be addressed for deriving the theoretical properties of the resulting penalized estimators. First, to our knowledge, penalized regressions with folded-concave penalties have seldom been studied in a random design setting. A major difficulty of adapting the existing results for the fixed design case to the random design case is to control the maximum eigenvalues of some random matrices,
where λmax[K] denotes the maximum eigenvalue of a matrix K, M is a given subset of [1, ⋯, p], Xj denotes the jth column of a matrix X, and XM the submatrix formed by columns in M. Such a problem is not standard since matrix does not possess subexponential tail. We derive some concentration inequalities for such random matrices and for summations of subexponential and subgaussian random variables. Based on these results, we establish the weak oracle (Lv and Fan, 2009) properties, i.e, sign consistency and L∞ convergence rate of the estimators under subgaussian ensembles, which is one of our major technical contributions. Moreover, the posited models for the baseline mean function or the propensity score may be misspecified. Therefore, the derivation of the asymptotic properties needs to take into account model misspecification with NP-dimensionality of covariates, which is challenging.
Third, a challenge for extending the results for a single treatment decision to sequential treatment decisions is that the contrast functions are likely to be misspecified in the backward induction algorithm, such as A-learning. This together with the NP-dimensionality of covariates make it extremely hard to study theoretical properties of the value function under the estimated optimal dynamic treatment regime. We overcome this difficulty by first defining population-level least favorable parameters in the misspecified contrast functions. Moreover, we derive the error bounds for the corresponding estimates under the model misspecification, which in turn leads to an error bound for the difference between the value functions of the estimated optimal dynamic treatment regime and the underlying true optimal dynamic treatment regime.
The remainder of the paper is organized as follows. We introduce the proposed penalized A-learning method in Section 2. Some implementation issues are addressed in Section 3, followed by simulation results in Section 4. We apply our method to a data from the STAR*D study in Section 5. Section 6 studies the error bounds of the penalized A-learning estimator and the difference between the value functions of the estimated optimal regime and the true optimal regime, at the second stage. Section 7 characterizes such results for the estimates at the first stage. Section 8 presents the weak oracle properties of the penalized estimators in the propensity score and baseline mean models under a random design setting. Section 9 discusses the UUP and RE condition in the context of A-learning. All technical conditions, lemmas and proofs are given in the Appendix.
2. Penalized A-Learning
For simplicity of presentation, we only consider a two-stage study where binary treatment decisions are made at time points t1 and t2. The data of a subject can be summarized as
(2.1) |
where S(1) denotes the covariates collected prior to t1, A(1) ∈ {0, 1} is the treatment received at time t2, S(2) denotes intermediate covariates collected between time points t1 and t2, A(2) ∈ {0, 1} is the treatment received at time t2, and Y is the final outcome of interest. It is assumed that a larger value of Y stands for a better clinical outcome. Denote Y★ (a1, a2) the potential response of patient if he/she were given a1 as the first treatment and a2 as the second. If a patient follows a given regime (d1, d2), we can write the potential outcome
where I(·) denotes the indicator function. Our goal is to find a dynamic treatment regime to maximize the mean potential outcome. Throughout the paper, we make the commonly used assumptions for studying dynamic treatment regimes: stable unit treatment value assumption and sequential randomization assumption (Murphy, 2003).
The observed data from n subjects can be summarized as
which are assumed to be independently and identically distributed copies of O. We assume the following semiparametric regression model for Y:
(2.2) |
where is the vector of covariates for the ith patient, h(2)(·) is an unspecified baseline mean function, C(2)(·) the contrast function, and ei is an independent error with mean 0. The design matrix is denoted as X = (X1, …, Xn)T.
Define
At the first stage, we consider the following conditional mean model for V(2):
(2.3) |
where h(1)(·) and C(1)(·) are functions of the baseline covariates. To simplify the notation, we use a shorthand Si for and let S = (S1, …, Sn)T, the design matrix at the baseline.
It can be shown that the optimal dynamic treatment regime is given by , where
(2.4) |
To estimate and , we posit the following models for C(1)(·), C(2)(·), h(1)(·), h(2)(·), π(1)(·), and π(2) (·):
(2.5) |
(2.6) |
(2.7) |
and
Models in (2.5)–(2.7) can be misspecified, however, we require that either h(j) or π(j) is correct for j = 1, 2. For simplicity, we require C(2) to be correctly specified. The general case when C(2) is misspecified can be similarly discussed. We use backward induction to estimate the optimal dynamic treatment regime. At the second decision point, we first estimate the parameters in the posited propensity score and baseline mean models using penalized regressions. Specifically, define
and
where , , and belong to the class of folded-concave penalty functions (Lv and Fan, 2009), such as SCAD (Fan and Li, 2001), and , the associated regularization parameters.
Next, we estimate β2 in (2.2) using the Dantzig selector based on A-learning estimating function (Murphy, 2003), defined by
(2.8) |
where
and the regularization parameter.
To estimate the regime at the first decision point, we define the pseudo-outcome using the advantage function (Murphy, 2003) by
(2.9) |
Similarly, define
and
where , and and are folded-concave penalty functions. Then, we estimate β1 in (2.3) by
(2.10) |
where
The estimated optimal dynamic treatment regime is given by
(2.11) |
3. Some Implementation Issues
When the tuning parameters in optimization problems (2.8) and (2.10) are fixed, the Dantzig selector can be solved by a standard linear programming algorithm. One issue for implementing Dantzig selector is the choice of the tuning parameters. We use a BIC criterion for selecting tuning parameters. For Dantzig selector (2.8), is chosen as the minimizer of
(3.1) |
where , and d(λ) is the number of nonzero components in . A similar BIC criterion was proposed by Chen and Chen (2008). We use a similar criterion for choosing .
It was observed that the Dantzig estimators may underestimate the true values of parameters due to the shrinkage estimation (Candès and Tao, 2007). Therefore, we use a two-step procedure for practical implementation, which is referred as Gauss-Dantzig selector in Candès and Tao (2007). Specifically, in the first step, we apply the proposed penalized A-learning to select important variables for making an optimal decision, i.e. those variables with non-zero estimated coefficients. Then, in the second step, their corresponding coefficients are re-calculated by solving the unpenalized A-learning estimating equations with important variables only.
4. Simulation Studies
4.1. Settings
To evaluate the numerical performance of the proposed penalized A-learning method, we consider simulation studies with two treatment decision points, based on the following model:
(4.1) |
where A(j), j = 1, 2, is the treatment given at the jth stage, S(j), j = 1, 2, denote the covariate information collected before the jth treatment is given, and Y is the final response of interest. The random error ε follows a normal distribution with mean 0 and variance 0.25. Here, covariates follow a multivariate normal distribution with mean 0 and variance Iq. In addition, the intermediate covariate S(2) is a scalar and generated as , where e follows a normal distribution with mean 0 and variance 0.25.
We set β0 = 0. Based on model (4.1), the optimal treatment regime at stage 2 is . Following this optimal treatment regime at stage 2, the Q-function at stage 1 is given by
where and a+ = (|a| + a)/2. Therefore, the contrast function C(S(1)) = Q1(S(1), 1) − Q1(S(1), 0) and thus the optimal treatment regime at stage 1 is I{C(S(1)) > 0.
To evaluate the double robustness of the proposed method, we consider a variety of scenarios with correctly specified and misspecified baseline mean and/or propensity score models. At stage 2, a linear model with covariates S(1), S(2) and A(1) is fitted for the baseline mean function, while the true baseline mean function is . We choose β1 = 0q, for which the baseline mean function is correctly specified, and β1 = (04, 1, −1, 0q−6)T, for which the baseline mean function is misspecified. At stage 1, a linear model with covariates S(1) is fitted for the baseline mean function, which is always misspecified. Logistic models are used for estimating the propensity scores, which are correctly specified for the constant model but misspecified for the probit model. The following four settings are considered:
Setting 1: β1 = 0q, P (A(2) = 1) = 0.5;
Setting 2: β1 = (04, 1, −1, 0q−6)T, P(A(2) = 1) = 0.5;
Setting 3: β1 = 0q, P (A(2) = 1) = Pr(N(0, 1) ≤ ST γ);
Setting 4: β1 = (04, 1, −1, 0q−6)T, P (A(2) = 1) = Pr(N(0, 1) ≤ ST γ),
where S = ((S(1))T, S(2))T and N(0, 1) a standard normal random variable. For other parameters, we choose P(A1 = 1) = 0.5, β2 = (0, 0, 1, −1, 0q−4)T, σ1 = σ2 = 0.5, d = (d0, d1, d2, d3,)T = (0, 1, 1, 1)T, and γ = (0q−2, 1, −1, 1)T. Table 1 summarizes the information of model misspecification for the baseline mean and propensity score models and associated important variables under different settings. In next section, we show simulation results of the four settings with q = 1000 and sample size n = 150/300 over 500 replications.
Table 1.
Stage | Baseline | Propensity Score | Important Variables | ||
---|---|---|---|---|---|
Setting 1 | Stage 2 | right | right |
|
|
Stage 1 | wrong | right |
|
||
| |||||
Setting 2 | Stage 2 | wrong | right |
|
|
Stage 1 | wrong | right |
|
||
| |||||
Setting 3 | Stage 2 | right | wrong |
|
|
Stage 1 | wrong | right |
|
||
| |||||
Setting 4 | Stage 2 | wrong | wrong |
|
|
Stage 1 | wrong | right |
|
4.2. Competing methods
We further compare our method with outcome weighted learning (OWL, Zhao et al., 2012), which is a robust method which estimates individualized treatment rule by directly maximizing the estimated value function. Zhao et al. (2015) further introduced backward outcome weighted learning (BOWL) and simultaneous outcome weighted learning (SOWL) to extend their methods to multiple stage studies. Here, we consider a double robust version of BOWL (DR-BOWL) for comparison. For a single stage study, the developed DR-BOWL method is similar to the residual weighted learning method (Zhou et al., 2015).
Specifically, we first estimate the propensity score and baseline as in Section 2. We consider the linear decision rule I(xTβ20 > 0) and estimate β20 by minimizing the following loss function:
The penalty term in original OWL is . We replace it with the L1 norm here to simultaneously select variables. Then we construct the pseudo outcome using augmented inverse propensity score estimator (AIPWE Zhang et al., 2012),
where , and is an estimate of . Here, we a fit linear model for Mean(Y |A = 1, X) and use nonconcave penalized regression with SCAD penalty to obtain . Denoted by and estimated propensity score and baseline at the first stage, we consider linear treatment regime of the form and estimate by
Tuning parameters and are obtained by minimizing a value-based BIC criterion.
4.3. Results
Table 2 summarizes variable selection results for optimal treatment decisions and the empirical performance of the estimated optimal treatment regime compared with the true optimal regime, using our penalized A-learning method (denoted as PAL) and DR-BOWL, respectively. Specifically, it reports the false negative (FN) rate (the percentage of important variables that are missed) and false positives (FP) rate (the percentage of unimportant variables that are selected), the ratio of value functions (denoted by VR) calculated using the value function of the estimated optimal treatment regime divided by that of the true optimal regime, and the error rates (ER) of the estimated optimal treatment regimes for treatment decision making, in both stages. Here, the ER at stage 2 is calculated as the mean of and at stage 1 as the mean of . The value function of a given treatment regime is calculated using Monte Carlo simulations based on 10,000 replications. The VR at stage 2 (devoted by VR*) is to compare the estimated optimal treatment regime at stage 2 and a randomly assigned treatment at stage 1 as in simulated data with the true optimal dynamic treatment regime for both stages. The VR at stage 1 is to compare the estimated optimal dynamic treatment regime with the true optimal dynamic treatment regime for both stages.
Table 2.
n | method | Stage 2 | Stage 1 | ||||||
---|---|---|---|---|---|---|---|---|---|
FN | FP | VR* | ER | FN | FP | VR | ER | ||
Setting 1 | |||||||||
150 | PAL | 12.6 | 0.1 | 64.7 | 6.1 | 63.8 | 0.1 | 98.3 | 7.0 |
DR-BOWL | 85.7 | 0.1 | 39.0 | 34.7 | 99.5 | 0.1 | 39.1 | 48.3 | |
300 | PAL | 1.1 | 0.1 | 65.4 | 2.6 | 41.9 | 0.1 | 99.7 | 6.2 |
DR-BOWL | 78.1 | 0.1 | 49.2 | 27.5 | 98.0 | 0.2 | 50.2 | 48.3 | |
| |||||||||
Setting 2 | |||||||||
150 | PAL | 25.9 | 0.1 | 57.8 | 10.4 | 56.2 | 0.2 | 90.8 | 15.7 |
DR-BOWL | 86.3 | 0.1 | 35.1 | 35.6 | 99.0 | 0.2 | 35.9 | 47.2 | |
300 | PAL | 11.0 | 0.1 | 59.6 | 6.2 | 32.5 | <0.05 | 97.9 | 8.0 |
DR-BOWL | 79.8 | 0.1 | 42.4 | 29.9 | 97.2 | 0.2 | 44.6 | 47.1 | |
| |||||||||
Setting 3 | |||||||||
150 | PAL | 33.7 | 0.3 | 59.9 | 13.5 | 64.5 | 0.1 | 93.0 | 9.1 |
DR-BOWL | 18.8 | 1.3 | 60.2 | 7.5 | 72.3 | 0.5 | 92.4 | 24.4 | |
300 | PAL | 12.3 | 0.3 | 64.2 | 7.2 | 52.7 | <0.05 | 98.3 | 6.9 |
DR-BOWL | 74.9 | 0.2 | 55.3 | 23.2 | 97.8 | <0.01 | 56.4 | 48.4 | |
| |||||||||
Setting 4 | |||||||||
150 | PAL | 55.7 | 0.2 | 48.2 | 22.4 | 62.2 | 0.1 | 79.4 | 17.7 |
DR-BOWL | 75.0 | 0.1 | 51.0 | 23.4 | 99.0 | <0.01 | 51.7 | 47.2 | |
300 | PAL | 26.4 | 0.3 | 56.2 | 13.2 | 36.4 | <0.05 | 94.3 | 8.4 |
DR-BOWL | 74.9 | 0.2 | 50.9 | 23.1 | 97.4 | <0.01 | 52.8 | 47.0 |
FN: proportion of related variables with zero coefficients
FP: proportion of unrelated variables with nonzero coefficients
VR: value ratio between estimated and true treatment regimes
ER: error rate of estimated treatment regimes
The DR-BOWL methods fail in all settings. Take Setting 1, n = 300 as an example, FN = 78.1% for the second stage where the baseline, propensity score and contrast functions are all correctly specified. It missed approximately 3/4 of the important variables. Besides, VR = 50.2, indicating the poor performance of the estimated treatment rules.
On the other hand, the overall performance of our penalized A-learning method is good. We make the following observations. First, the FN rates are much higher than the FP rates. This suggests that the Dantzig selector tends to have conservative variable selection results, which is commonly seen in the literature. Second, the variable selection results and the error rates of the estimated optimal treatment regime at stage 2 are generally much better than those at stage 1, which is expected since the optimal linear treatment decision rule is correctly specified at stage 2 but not at stage 1. At stage 2, for n = 150, over 55% important variables are not selected for all 4 settings. Thirdly, our method requires correct specification of either the propensity score or the baseline model, especially when the sample size is small. This is implied by comparing results in Setting 4 with other three settings. For example, when n = 150, the false negative at second stage reaches 55.7%, which is much higher than those FN’s in other three settings. Besides, our estimator is very efficient in Setting 1 where both models are correctly specified. Even when n = 150, the ratio of the value functions reaches 98.7%, and all error rates are abound 6–7%. These results are even comparable with those under Setting 2 and 3 when n = 300. Lastly, the estimation and variable selection performance of the estimated optimal dynamic treatment regimes improves as the sample size increases. In particular, in Setting 1–3 when n = 300, the VR’s are all above 97.9% and ER’s are all below 8%, which implies that the estimated optimal treatment regimes nearly maximize the value functions.
4.4. Nonregularity
As suggested by one of the referee, we further examine our methods under settings with different degrees of nonregularity. Specifically, we consider the setting where all covariates in S(1) are independent Rademacher random variables. We set S(2) to be another Rademacher random variable independent of S(1) and A(1).
Denoted by A(1)* = 2A(1) − 1, the response Y is generated as follows,
(4.2) |
where ε ~ N(0, 0.25).
For each stage, we fit linear models for the baseline and contrast function, and a logistic regression model for the propensity score. The parameter β in (4.2) determines the baseline function on the second stage. Similar to the regular case discussed in Section 4.1 in the revision, we also consider four Settings here:
Setting 1: β = 0q, P (A(2) = 1) = 0.5;
Setting 2: β = (04, 1, −1, 0q−6)T, P(A(2) = 1) = 0.5;
Setting 3: β = 0q, P(A(2) = 1) = Pr(N(0, 1) ≤ STγ);
Setting 4: β = (04, 1, −1, 0q−6)T, P(A(2) = 1) = Pr(N(0, 1) ≤ STγ),
where S = ((S(1))T, S(2))T and γ = (0q−2, 1, −1, 1)T.
Parameters δ1 and δ2 in (4.2) controls the degree of nonregularity on the second stage. We consider three choices of δ1 and δ2. Set δ1 = 1, δ2 = 1, we obtain
Set δ1 = 1.1, δ2 = 1.1, we have
Set δ1 = 1, δ2 = 1.1, we have
With some calculation, we can show the Q-function on the first stage takes the following form:
Hence, the contrast function is correctly specified on the first stage. When δ1 = 1, δ2 = 1 or δ1 = 1.1, δ2 = 1.1, we have f1 = f2 = 1. When δ1 = 1, δ2 = 1.1, we have f1 = f2 = 0.95. Information about model specification and important variables in the contrast function are given in Table 3.
Table 3.
Stage | Baseline | Propensity Score | Important Variables | ||
---|---|---|---|---|---|
Setting 1 | Stage 2 | right | right |
|
|
Stage 1 | right | right |
|
||
| |||||
Setting 2 | Stage 2 | wrong | right |
|
|
Stage 1 | right | right |
|
||
| |||||
Setting 3 | Stage 2 | right | wrong |
|
|
Stage 1 | right | right |
|
||
| |||||
Setting 4 | Stage 2 | wrong | wrong |
|
|
Stage 1 | right | right |
|
We also consider two choices of sample size, n = 150 and n = 300, respectively. This gives us a total of 24 scenarios. For each scenario, we report FN, FP, VR and ER as Section 4.3. ER for the first and second stage are calculated as
and
Compared to definitions in Section 4.3, error rates here are calculated with respect to those patients with nonzero contrast functions. Such definitions are more meaningful since both two treatments are optimal for these patients. We simulate over 200 replications. Results are reported in Table 4.
Table 4.
n | Nonregularity | Stage 2 | Stage 1 | ||||||
---|---|---|---|---|---|---|---|---|---|
FN | FP | VR* | ER | FN | FP | VR | ER | ||
Setting 1 | |||||||||
150 | δ1 = 1, δ2 = 1 | 0 | < 0.01 | 53.6 | 0 | 4.0 | 0.3 | 93.9 | 5.0 |
δ1 = 1.1, δ2 = 1.1 | 0 | < 0.01 | 53.5 | 0.1 | 0.5 | 0.3 | 95.1 | 3.2 | |
δ1 = 1, δ2 = 1.1 | 0 | 0.01 | 52.9 | 5.0 | 1.0 | 0.3 | 93.8 | 4.2 | |
300 | δ1 = 1, δ2 = 1 | 0 | < 0.01 | 53.3 | 0 | 0 | 0.4 | 97.3 | 0.9 |
δ1 = 1.1, δ2 = 1.1 | 0 | < 0.01 | 53.9 | 0 | 0 | 0.3 | 97.9 | 0.9 | |
δ1 = 1, δ2 = 1.1 | 0 | < 0.01 | 52.8 | 2.0 | 0 | 0.3 | 97.0 | 0.9 | |
| |||||||||
Setting 2 | |||||||||
150 | δ1 = 1, δ2 = 1 | 0 | < 0.05 | 46.1 | 0 | 14.7 | 0.3 | 90.7 | 5.6 |
δ1 = 1.1, δ2 = 1.1 | 0 | < 0.05 | 45.9 | 2.0 | 14 | 0.3 | 89.5 | 5.7 | |
δ1 = 1, δ2 = 1.1 | 0 | < 0.05 | 44.4 | 11.6 | 9.7 | 0.3 | 89.7 | 11.5 | |
300 | δ1 = 1, δ2 = 1 | 0 | < 0.01 | 45.8 | 0 | 0 | 0.2 | 97.1 | 0.4 |
δ1 = 1.1, δ2 = 1.1 | 0 | < 0.01 | 45.6 | 0.5 | 0 | 0.2 | 96.4 | 0.4 | |
δ1 = 1, δ2 = 1.1 | 0 | 0.01 | 45.1 | 6.8 | 0 | 0.2 | 98.1 | 7.4 | |
| |||||||||
Setting 3 | |||||||||
150 | δ1 = 1, δ2 = 1 | 5.7 | 0.6 | 45.0 | 2.9 | 19.0 | 0.2 | 85.6 | 4.1 |
δ1 = 1.1, δ2 = 1.1 | 8.2 | 0.5 | 45.1 | 6.6 | 17.0 | 0.2 | 87.3 | 3.4 | |
δ1 = 1, δ2 = 1.1 | 6.3 | 0.6 | 44.6 | 14.6 | 18.5 | 0.2 | 85.8 | 4.1 | |
300 | δ1 = 1, δ2 = 1 | 0 | 0.1 | 53.1 | 0 | 0 | 0.3 | 97.1 | 1.0 |
δ1 = 1.1, δ2 = 1.1 | 0 | 0.1 | 53.8 | 1.4 | 0 | 0.3 | 98.0 | 0.6 | |
δ1 = 1, δ2 = 1.1 | 0 | 0.1 | 52.9 | 8.4 | 0 | 0.3 | 97.9 | 0.8 | |
| |||||||||
Setting 4 | |||||||||
150 | δ1 = 1, δ2 = 1 | 20.7 | 0.5 | 25.4 | 8.6 | 52 | 0.2 | 66.6 | 14.0 |
δ1 = 1.1, δ2 = 1.1 | 20.8 | 0.5 | 25.3 | 12.4 | 54.2 | 0.2 | 62.5 | 14.9 | |
δ1 = 1, δ2 = 1.1 | 21.5 | 0.6 | 23.7 | 22.6 | 51.7 | 0.2 | 61.7 | 22.1 | |
300 | δ1 = 1, δ2 = 1 | 0.3 | 0.2 | 44.9 | 0.2 | 3.3 | 0.2 | 95.8 | 0.7 |
δ1 = 1.1, δ2 = 1.1 | 0 | 0.2 | 44.8 | 3.9 | 0.2 | 0.2 | 97.5 | 0.4 | |
δ1 = 1, δ2 = 1.1 | 0 | 0.2 | 43.8 | 13.2 | 0.3 | 0.2 | 97.1 | 8.3 |
Within each setting, most results are similar across different choices of δ1 and δ2. This suggests the nonregularity issues don’t have a big impact on the variable selection results. Apart from results in Setting 4, false negatives and false positives are all very small. When the sample size increases to 300, false negatives for most scenarios are exactly equal to 0 while false positives for all settings are below 0.4%, demonstrating perfect variables selections performance of our methods. In Settings 1–3, most error rates are below 7% while the ratios of value function are all above 85%, indicating our estimated optimal treatment regimes are very close to the truth in these scenarios.
5. Application to STAR*D Study
We applied the proposed method to a dataset from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, which was conducted to compare different treatments for patients with major depressive disorder (MDD). There were 4041 participants (age 18–75) with nonpsychotic MDD enrolled in this study. At first level, all participants were treated with citalopram (CIT) up to 14 weeks. Subsequently, 3 more levels of treatments were provided for participants without a satisfactory response to CIT. At each level, participants were randomly assigned to treatment options acceptable to them. At Level 2, participants were eligible for seven treatment options: sertraline (SER), venlafaxine (VEN), bupropion (BUP), cognitive therapy (CT), and augmenting CIT with bupropion (CIT+BUP), buspirone (CIT+BUS) or cognitive therapy (CIT+CT). Participants without a satisfactory response to CT were proceeded to Level 2A for additional medication treatments. All participants who did not respond satisfactorily at Level 2 or 2A were eligible for four treatments at Level 3: medication switch to mirtazapine (MIRT) or nortriptyline (NTP), and medication augmentation with either lithium (Li) or thyroid hormone (THY). Participants without satisfactory response to Level 3 were re-randomized at Level 4 to either tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN). See Fava et al. (2003) and Rush et al. (2004) for more details of the STAR*D study. One goal of the study is to determine which treatment strategies, in what order or sequence, provide the optimal treatment effect.
As an illustration, we focused on a subset of participants who were given treatment BUP or SER at Level 2 and did not receive satisfactory responses, and then were randomized to treatment MIRT or NTP at Level 3. For this study, we considered 381 covariates collected at baseline and intermediate levels as possible relevant predictors. For treatment regime at Level 3, all the 381 covariates as well as the assigned treatment at Level 2 were considered as possible predictors for making optimal treatment decision. For treatment regime at Level 2, 305 covariates that were collected before giving treatment at Level 2 were considered for making optimal treatment decision. Negative 16-item Quick Inventory of Depressive Symptomatology-Clinician-Rated (QIDS-C16) was used as the final response, which is a measurement of symptomatic status of depression. There are 73 participants who had complete records in the subset of data we are interested in. Among these participants, 36 were treated with BUP and 37 were treated with SER at Level 2, and 33 were treated with NTP and 40 were treated with MIRT at Level 3.
The selection and estimation results are summarized as follows. At Level 3, our method selected two covariates: “age” in baseline demographics (AGE), and the suicide risk of the patient (SUICD). The estimated optimal treatment regime is I(1.459 − 0.091 × AGE + 0.158 × SUICD ≥ 0), where 1 represents treatment NTP and 0 represents treatment MIRT. This optimal treatment regime assigns 27 participants to NTP and the rest 46 participants to MIRT. At Level 2, our method also selected two covariates: age and QIDS-C percent improvement” in clinic visit clinical record form at Level 1 (QCIMP). The estimated optimal treatment regime is I(−8.600 + 0.145 × AGE + 0.125 × QCIMP ≥ 0), where 1 stands for treatment BUP and 0 stands for treatment SER. This optimal treatment regime assigns 37 participants to BUP and the rest 36 participants to SER.
To further examine the estimated optimal dynamic treatment regime, we compare the estimated value function of our estimated optimal treatment regime with values under those four non-dynamic treatment regimes, BUP+NTP, BUP+MIRT, SER+NTP and SER+MIRT. For a given dynamic treatment regime d = (d(1), d(2)), we evaluate its average value function using AIPWE (Zhang et al., 2013),
where , , , , and the assigned treatment for the ith patient, according to d(2) and d(1). Based on this formula, we report the estimated value functions of the four non-dynamic treatment regimes in Table 5.
Table 5.
Treatment Regime | Estimated Value |
---|---|
estimated optimal regime | −9.02 |
BUP + NTP | −12.86 |
BUP + MIRT | −12.57 |
SER + NTP | −12.57 |
SER + MIRT | −12.28 |
Estimating the value of the optimal treatment regime is well-known to be a non-regular problem when there’s nonzero probability that the contrast function (either at the second or the first stage) is equal to zero. To evaluate the value function under our estimated optimal treatment regime, we consider the online estimator proposed by Luedtke and van der Laan (2016). Specifically, for i = ln + 1, ln + 2, …, n, we obtain the estimated optimal dynamic treatment regime and its associated parameters , , propensity score function , , baseline function ĥ(i)(2), ĥ(i)(1) based on data from patients 1 to i − 1, using penalized A-learning. Then we evaluate the value of on the ith patient using (AIPWE, Zhang et al., 2013)
The variance of conditional on data from patients 1 to i − 1 is evaluated by
where is the estimated value of on the jth patient.
The final estimator is given by
with the estimated standard error
Since the sample size of our dataset is small, we choose ln ≈ 2n/3, i.e, ln = 49. The estimated value is equal to −9.02 with an estimated standard error . From Table 5, we can see the value under our estimated treatment regime is much larger than those under four nondynamic treatment regime.
6. Oracle inequalities for and the value function of the estimated regime at the second stage
We first introduce some notation. For an arbitrary matrix Φ ∈ ℝM×M and an arbitrary vector ϕ ∈ ℝM, the superscript Φj is used to denote the jth column of Φ, ϕj the jth element of ϕ, while the subscript Φi denotes the ith row of Φ. For subsets J, J′ ⊂ {1, …, M}, let |J| be the cardinality of J, Jc be the complement of J. We denote by ϕJ the vector in ℝ|J| that has the same coordinates as ϕ on J, and ΦJ the submatrix formed by columns in J, the submatrix formed by rows in J and columns in J′. The support of ϕ is defined by supp(ϕ) = {j ∈ {1, …, M} : ϕj ≠ 0}. Let ‖ϕ‖p be the Lp norm of ϕ, ‖Φ‖p be the operator norm corresponding to the p-norm vector. If Φ is positive semidefinite, define
Let be the Orlicz norm for any random variable Y, defined as
for some p ≥ 1. For any two positive sequences {an} and {bn}, an ≫ bn means limn bn/an = 0. Throughout this paper, we use c0 and to denote some universal constants, whose values may change from place to place.
6.1. Oracle inequality for
Recall C(2)(x) = xTβ2, according to our assumption. Let β2,0 denote the true values of β2. Define for some 0 ≤ l6 < 1, the nonsparsity size of β2,0, the support of β2,0. We allow the number of covariates p to grow exponentially fast with respect to the sample size n, i.e, for some 0 < a2 < 1. To deal with such NP-dimensionality, following Zhou (2009), we assume
(6.1) |
where and U1, …, Un are i.i.d. copies of a p-dimensional isotropic random vector U0. More specifically, we require that for any vector a ∈ ℝp,
(6.2) |
for some isotropic constants ω.
Remark 6.1
The definition of the isotropic random vector was firstly introduced by Milman and Pajor (2003). Independent normal and independent Rademacher random variables are two most important examples of isotropic random vectors. More generally, coordinates of the isotropic random vector do not need to be independent. They can be distributed uniformly on various convex and symmetric bodies, for example, an appropriate multiple of the unit ball in ℝp equipped with the LK-norm for any 1 ≤ K ≤ ∞. For these distributions, we denote ωK as their isotropic constants. It is further shown in (Milman and Pajor, 1989) that ωK are uniformly bounded for K ≥ 1. However, it remains unknown whether the isotropic property holds for all uniform distributions on arbitrary symmetric convex bodies with Lebesgue measure 1. .
Remark 6.2
The isotropic formulation requires covariates in U0 to be uncorrelated, and hence does not allow for correlated Bernoullis. However, according to our definition X = UΣ1/2, different covariates in the design matrix X can be correlated when Σij ≠ 0. Such formulations allows us to impose conditions on the tail of U0 and the covariance matrix Σ separately.
Since the A-learning estimating equation involves the plug-in estimators and , we need some conditions on these two estimators to establish oracle inequalities for . More precisely, we assume that and converge to some and , respectively. When the propensity score model π(2) and the baseline model h(2) are correctly specified, and represent the true coefficients in these two models. When the models are misspecified, and correspond to the population-level least favorable parameters. Denote and the support of and , respectively. Let and , the number of nonzero elements. We assume and for some 0 ≤ l4, l5 < 1/2.
Condition 1
Assume that there exist some positive constants and , such that with probability at least ,
(6.3) |
(6.4) |
Moreover, assume and , where and .
Remark 6.3
Condition 1 assumes the weak oracle properties of and , i.e., selection consistency and consistency under L∞ norm. The weak oracle properties of and are established in Theorems 8.1 and 8.2 of Section 8, respectively.
Define
and .
Condition 2
Assume that matrices D(2), C(2) and Σ satisfy
Define and with . For any positive semidefinite matrix Ψ ∈ ℝp×p, integer s and positive number c, define function K(s, c, Ψ) as follows,
The following condition ensures that the RE condition holds for the matrix .
Condition 3
Assume that for any 0 < θs < 1 and sufficiently large n, we have
(6.5) |
where denotes the set of vectors α2 that satisfies the weak oracle property (6.3).
Remark 6.4
It is tedious to verify (6.5) due to the plug-in estimator . The key to prove such a result is that the estimator in should be sparse. That is the reason we use penalized regression with a folded-concave penalty to obtain , since it can ensure selection consistency of the estimator. We provide a general result characterizing the UUP and RE conditions for the random matrix in Lemmas 9.1 and 9.2 of Section 9.
To establish the oracle inequality for , we first provide an upper bound for
which is given in the following Lemma.
Lemma 6.1
Assume that Condition 1 and 2 hold, , , a2 + l4 < 1, and that either π(2) or h(2) is correctly specified. Then, for sufficiently large n, there exists some constants c(2), such that with probability at least ,
where
, and .
Remark 6.5
Recall that , for some 0 ≥ a2, l4 < 1. The condition a2 + l4 < 1 implies .
Remark 6.6
Here, E1 describes how the curse of dimensionality takes effect, E2 is due to estimation errors of and , E3 and E4 are due to model misspecification. Since we assume that at least one of h(2) and π(2) is correctly specified, either E3 or E4 is zero.
Theorem 6.1
Assume that conditions in Lemma 6.1 and Condition 3 hold, and where the constant c(2) is defined in Lemma 6.1. Then, for some fixed 0 < θs < 1 and sufficiently large n, the following two inequalities hold with probability at least for some constant :
(6.6) |
(6.7) |
Moreover, we have .
From (6.6), it is immediate to see that as long as
(6.8) |
which implies the doubly robust property of . We provide a sufficient condition for (6.8) in the following Corollary.
Corollary 6.1 (Double robustness of )
Assume that conditions in Theorem 6.1 and the following conditions hold:
(6.9) |
(6.10) |
(6.11) |
If either the baseline h(2) or the propensity score model π(2) is correctly specified, then .
Remark 6.7
Condition (6.9) imposes a constraint between the sparsity of population parameters and the convergence rates of and . When , it requires and to be consistent under L2 norm. Condition (6.10) automatically holds for SCAD penalty function when and .
6.2. Oracle inequality for the value function of the estimated regime at the second stage
Now we establish the error bound for the difference between the mean responses (i.e. the value functions) of the estimated optimal regime at the second stage and the true optimal one for an individual with covariate X0. Here, X0 is also assumed to have the form Σ1/2U with Σ and U defined in (6.1), independent of Xi, i = 1, …, n. In addition, the regime at the first stage is chosen the same as the actually received treatment at the first stage.
Under the assumptions of SUTVA and no unmeasured confounders, the difference of the corresponding value functions is given by
(6.12) |
Since (6.12) is nonnegative, it suffices to provide an upper bound. Here, we impose the following condition.
Condition 4
The probability density function g(2)(·) of exists and is bounded.
Condition 4 is a mild condition on the true optimal decision function, which holds in most cases when at least one of the important covariates (the corresponding component of β2,0 is nonzero) is continuous.
Theorem 6.2
Assume that conditions in Theorem 6.1 and Condition 4 hold. Assume . Then, for fixed 0 < θs < 1 and sufficiently large n,
Remark 6.8
Error bound for the difference of the value functions follows from the error bound on and Condition 4. Since the first term in the upper bound is small, the difference of the value functions is mainly characterized by the second term, which is of the order .
7. Error bounds for and the value function of the estimated dynamic treatment regime
7.1. Misspecified contrast function
In the context of A-learning, a major challenge arising in multi-stage studies is that the contrast functions are likely to be misspecified in backward induction. In order to study the finite sample bounds of , we need to first define least favorable parameters under the misspecification of the contrast function.
Recall that C(1)(Si) is the true contrast function for the ith patient, which can be a very complex function of Si due to the backward induction. For notational convenience, we use a shorthand C(s) for C(1)(s). We posit a linear model for C(·), which is often misspecified. When either the propensity score model π(1) or the baseline mean function h(1) is correctly specified, the associated least favorable parameters is defined as follows:
(7.1) |
where
and κ0 is a nonnegative constant. Define
By simple algebra, we can show , where , describing the degree of misspecification of the contrast function. Define for some 0 ≤ l3 < 1/2, where .
7.2. Error bound for
Assume that for some 0 < a1 < 1 and S1, …, Sn are i.i.d. copies of S0 that
(7.2) |
where Ψ ∈ ℝq×q is some positive definite matrix with Ψjj = 1 for j = 1, …, q, and V0 is a q-dimensional isotropic random vector with some isotropic constants ζ. As in the second stage, we first give conditions on and . Assume that these two estimators converge to some and , respectively, under possible model misspecification. Denote , , , and for some 0 ≤ l1, l2 < 1l2.
Condition 5
Assume that there exist some positive constants and , with probability at least , the following holds:
(7.3) |
(7.4) |
Moreover, assume and , where and .
Condition 6
Assume that D(1), C(1) and Ψ satisfy
where
and .
Since both the propensity score model and the contrast function at the first stage can be misspecified, we need the following condition to control their effect on estimation of .
Condition 7
Assume that
(7.5) |
where and
Remark 7.1
It is immediate to see τ0 = 0 when either the contrast function or the propensity score model is correctly specified.
When going back to the first stage, the error bound of is directly affected by that of . This is because the estimated response in the first stage is obtained based on using the advantage function. To simplify presentation, we introduce the following condition.
Condition 8
Assume that with probability at least , there exists some constant μ1 > 0 such that
(7.6) |
and .
A more explicit form of the error bound for (7.6) is given in Theorem 6.1. In the next Lemma, we provide an upper bound for the term:
(7.7) |
Lemma 7.1
Assume that Conditions 5–8 and those in Theorem 6.1 hold, , , a1 + l1 < 1, , and either π(1) or h(1) is correctly specified. Then, for sufficiently large n, with probability at least , (7.7) can be bounded from above by c(1)(E5 + E6 + E7 + E8 + E9 + E10) for some constant c(1) > 0, where
Remark 7.2
The terms E5−E8 have similar interpretations as E1 − E4 in Lemma 6.1, respectively. The additional term E10 is due to the error bound of in the backward induction, while E9 is due to the misspecification of the contrast function.
Define and with . Similar as in stage 2, we need the following condition to ensure the RE condition for the matrix .
Condition 9
Assume that for any 0 < θs < 1 and sufficiently large n, we have
(7.8) |
where denotes the set of vectors α1 that satisfies the weak oracle property (7.3).
Theorem 7.1
Assume that Condition 9 and those conditions in Lemma 7.1 hold, and . The constant c(1) is defined in Lemma 7.1. Then, there exists a constant c8, such that for sufficiently large n and some fixed 0 < θs < 1, with probability at least 1 − c8/(n + p + q), the error bounds for are given by
(7.9) |
(7.10) |
7.3. Error bound for the value function of the estimated dynamic treatment regime
Under the SUTVA and sequential randomization assumptions, the value function of a given dynamic treatment regime (d1(S0), d2(X0)) is given by
where S0 and X0 denote the baseline covariates and covariates for the second stage, respectively. Then, the difference of the value functions under the estimated optimal dynamic treatment regime (2.4) and the true optimal regime is given by
Similar to Condition 4, we impose the following condition.
Condition 10
Assume that the probability density function g(1)(·) of exists and is bounded.
Theorem 7.2
Assume that conditions in Theorem 7.1 and Condition 10 hold. Assume , . Then, for some fixed 0 < θs < 1 and sufficiently large n,
Remark 7.3
Theorem 7.2 suggests that the upper bound for the difference of the value functions come from three major components: the misspecification of the contrast function, described by , and estimation errors of and .
8. Weak oracle properties of ’s and ’s
In order to prove the error bounds of , and the value functions of the estimated treatment regimes presented in Sections 6 and 7, we need to establish the weak oracle properties of and (j = 1, 2) in the posited models for the propensity score and baseline mean functions. Here, we prove the results based on a posited logistic regression model for the propensity score and a linear model for the baseline mean function under a random design setting. However, these results can be extended to generalized linear models (McCullagh and Nelder, 1989).
8.1. Weak oracle properties of and
We assume that and converge to some population parameters and , respectively. Under Conditions B1–B6 given in the Supplementary Appendix, we establish the weak oracle properties of and in the following two Theorems. Recall that for some 0 ≤ l4 < 1/2.
Theorem 8.1
Assume that Conditions B.1–B.3 hold, l4 + a2 < 1 and . Then, for sufficiently large n, there exists some constants , such that with probability at least ,
Theorem 8.2
Assume that Conditions B.4–B.6 hold, , and , where ei is defined in (2.2). Then, there exist some constants , such that with probability at least ,
Remark 8.1
Theorem 1 in Shi, Song and Lu (2015) established weak oracle results of the penalized estimators for a fixed design setting. This is mainly for technical convenience. Its proofs can be obtained using similar arguments as in Fan and Lv (2011). In this paper, we focus on a random design setting, which is more practical in medical studies. To the best of our knowledge, the weak oracle properties of penalized estimators have not been studied in a random design setting with the NP dimensionality. The major difficulty lies in developing some random matrix theories, such as controlling the maximum eigenvalue of some random matrices. Such results are established in Theorem 8.1 and 8.2.
Remark 8.2
The condition l4 + a2 < 1 ensures that for large n
(8.1) |
with probability approaching 1. A major technical difficulty in deriving (8.1) is that the matrix does not have the subexponential tail (see Definition G.2 in the Supplementary Appendix). When , we can bound from above by with probability at least 1−2/n, which ensures the subexponential tail of the truncated matrix. Lemma B.2 in the Supplementary Appendix proves such a result for a more general case.
8.2. Weak oracle properties of and
The weak oracle properties of can be similarly derived as for . However, unlike the results for , the weak oracle properties of depend on even when the baseline mean function h(1) is correctly specified. This is because the estimated response is obtained based on . A necessary condition to ensure is that , which is established in Corollary 6.1.
Theorem 8.3
Assume that Condition 8 and Conditions B.7–B.12 in the Supplementary Appendix hold. Further assume that , , , a1+l1 < 1, and . Then, for sufficiently large n, there exist some and , with probability at least , such that the estimators and must satisfy
9. Uniform uncertainty principle and restricted eigenvalue conditions in A-learning
In this section, we establish the UUP and RE conditions in the context of A-learning. In our setting, these two conditions are needed on random matrices and .
For brevity, we only study the UUP and RE conditions for the random matrix . Those for can be similarly derived. Recall that refers to the support of , , and . We assume that the weak oracle properties of are achieved such that with probability at least ,
(9.1) |
for some . The following Lemma establishes the UUP condition for .
Lemma 9.1
Assume the convergence rate of satisfies
and the sample size satisfies
(9.2) |
Then for any 0 < θ < 1, with probability at least , we have
(9.3) |
for any y ∈ ℝp and .
Remark 9.1
In our setting, if the following regularity conditions hold
the requirement on the sample size (9.2) reduces to since .
Remark 9.2
The second term on the right-hand side of (9.3) represents the difference between and , where is defined as the expectation of the truncated random matrix
(9.4) |
This term will vanish as n → ∞. The third term represents the estimation error of . When and , (9.3) proves the UUP condition for .
Remark 9.3
A key assumption in Lemma 9.1 is the sparsity of , which is needed to bound the infinity norm in the indicator function of (9.4). This extra requirement comes from the involvement of the estimated propensity scores in , which adds significant difficulties in proving Lemma 9.1.
After some algebra, the RE condition for follows similarly from Lemma 9.1, which is presented below.
Lemma 9.2
For any integer c0, assume that , and the sample size satisfies
(9.5) |
Then, for any 0 < θ < 1 and sufficiently large n, with probability at least , we have
Remark 9.4
The sample size requirement (9.5) is stronger than (9.2). To see this, for any positive semidefinite matrix Ψ, and positive integers s and c0, we have
10. Discussion
10.1. Post selection inference
As pointed by one of the referee, the main goal of constructing optimal DTRs is to find treatments that are significantly superior to other treatment options. This requires addressing a post selection inference issue, i.e, the problem of influencing the estimated optimal value function (or the difference between the estimated value and the value function under other treatment options). In the fixed dimension setting, we can use either the empirical average of the advantage function (Murphy, 2003) or the augmented inverse propensity score type estimates (AIPWE, Zhang et. al, 2012) to estimate the optimal value function. Both type of estimators are asymptotically normally distributed. However, the inference based on the advantage function may not be valid in high dimensions. This is because when the number of predictors is large, the parameter estimates in the contrast function may not have oracle property (i.e, model selection consistency and asymptotic normality).
For a single stage study, assuming a linear interaction form XT β0 for the contrast function. Under certain conditions, we can show AIPWE is asymptotically normal even for NP-dimensionality if (i) , (ii) with probability going to 1, for some constant c0 where Mβ is the support of β0. For our penalized A-learning estimator, Assumption (i) can be achieved assuming certain conditions on the dimension of covariates, sample size and the sparsity of parameters in the contrast, the baseline and propensity score function. Assumption (ii) is typically satisfied for Lasso, Dantzig and folded-concave type estimators. Similar to Theorem 6.1, we can show our estimator satisfies with probability going to 1. The asymptotic normality of AIPWE therefore follows. Standard error of the value estimator can be similarly obtained as in Zhang et. al (2012). Alternatively, we can use the one step online estimator as in Luedtke and van der Laan (2016). However, the asymptotic variance will be larger since it does not use all the data to construct the estimator. In summary, it is important and interesting to develop statistical inference for the estimated value function under the obtained optimal treatment regime in high dimensions, but it is beyond the scope of the current paper.
10.2. Tuning parameter selection
Bayesian information criteria (BIC) is used to tune the penalty functions. BIC has been widely used in model selection for selecting the tuning parameter when the goal is prediction. In high dimensional regressions, Chen and Chen (2008) proposed an extended BIC for model selection, and showed their BIC is consistent when the number of predictors grows polynomially in sample size. Fan and Tang (2013) proposed a similar criterion and showed its consistency when the number of predictors is in the non-polynomial order of the sample size. When the goal is to select treatment effect modifiers, Lu et al. (2011) also used a BIC-type criterion, which showed good empirical performance. This motivated us to use a similar BIC-type criterion for selecting the tuning parameter in our method. Our simulations demonstrated that the proposed BIC-type criterion empirically worked well. We conjecture that following similar arguments in the proof of Theorem 1 of Chen and Chen (2008) and the proof of Theorem 3 in Fan and Tang (2013), we can show our proposed BIC-type criterion is also consistent for selecting important variables in the contrast function. This is another interesting topic that needs further investigation.
10.3. Extensions to multiple stages and general models
In this paper, we mainly focus on a two-stage study. Extension of results to three-stage studies are provided in the supplementary article. It raises additional challenges to establish these results, since the potential model misspecification of contrast functions in the previous two stages can add up and stronger assumptions are needed to guarantee consistency of the parameter estimates. Readers can refer to the supplementary article for details.
For technical convenience, we assume a linear interaction form the contrast function on the last stage. More general results when the contrast function is misspecified can be similarly derived as the three-stage studies discussed in the supplementary article.
Supplementary Material
Acknowledgments
We thank the editor, the AE and two referees for providing helpful suggestions that significantly improved the quality of the paper. The STAR*D data are provided by National Institute of Mental Health. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.
Contributor Information
Chengchun Shi, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.
Alin Fan, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.
Rui Song, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.
Wenbin Lu, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.
References
- Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]
- Candès E, Tao T. Rejoinder: “The Dantzig selector: statistical estimation when p is much larger than n” [Ann. Statist. 2007;35(6):2313–2351. 2007. [Google Scholar]; Ann Statist. 35:2392–2404. [Google Scholar]
- Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
- Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sack-eim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, et al. Background and rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR* D) study. Psychiatric Clinics of North America. 2003;26:457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
- Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann Statist. 2016;44:713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]
- McCullagh P, Nelder JA. Generalized linear models Monographs on Statistics and Applied Probability. Second Chapman & Hall; London: 1989. [Google Scholar]
- Mendelson S, Pajor A, Tomczak-Jaegermann N. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom Funct Anal. 2007;17:1248–1282. [Google Scholar]
- Mendelson S, Pajor A, Tomczak-Jaegermann N. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constr Approx. 2008;28:277–289. [Google Scholar]
- Milman VD, Pajor A. Geometric aspects of functional analysis. Springer; 1989. Isotropic position and inertia ellipsoids and zonoids of the unit ball of a normed n-dimensional space; pp. 64–104. [Google Scholar]
- Milman V, Pajor A. Regularization of star bodies by random hyperplane cut off. Studia Math. 2003;159:247–261. [Google Scholar]
- Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. [Google Scholar]
- Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
- Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. Sequenced treatment alternatives to relieve depression (STAR* D): rationale and design. Controlled clinical trials. 2004;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. [Google Scholar]
- Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, Davidian M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika. 2013;100:681–694. doi: 10.1093/biomet/ast014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. J Amer Statist Assoc. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou S. Restricted eigenvalue conditions on subgaussian random matrices. 2009 arXiv: 0912.4045. [Google Scholar]
- Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR. Journal of the American Statistical Association. 2015. Residual weighted learning for estimating individualized treatment rules. just-accepted 00-00. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.