Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jun 1.
Published in final edited form as: Ann Stat. 2018 May 3;46(3):925–957. doi: 10.1214/17-AOS1570

HIGH-DIMENSIONAL A-LEARNING FOR OPTIMAL DYNAMIC TREATMENT REGIMES

Chengchun Shi 1, Alin Fan 2, Rui Song 3, Wenbin Lu 4
PMCID: PMC5966293  NIHMSID: NIHMS915736  PMID: 29805186

Abstract

Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many complex diseases, such as cancer, treatment decisions need to be tailored over time according to patients’ responses to previous treatments. Such an adaptive strategy is referred as a dynamic treatment regime. A major challenge in deriving an optimal dynamic treatment regime arises when an extraordinary large number of prognostic factors, such as patient’s genetic information, demographic characteristics, medical history and clinical measurements over time are available, but not all of them are necessary for making treatment decision. This makes variable selection an emerging need in precision medicine.

In this paper, we propose a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the non-polynomial (NP) order of the sample size. To preserve the double robustness property of the A-learning method, we adopt the Dantzig selector which directly penalizes the A-leaning estimating equations. Oracle inequalities of the proposed estimators for the parameters in the optimal dynamic treatment regime and error bounds on the difference between the value functions of the estimated optimal dynamic treatment regime and the true optimal dynamic treatment regime are established. Empirical performance of the proposed approach is evaluated by simulations and illustrated with an application to data from the STAR*D study.

Keywords and phrases: A-learning, Dantzig selector, NP-dimensionality, Model misspecification, Optimal dynamic treatment regime, Oracle inequality

1. Introduction

Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many chronic diseases, such as cancer, cardiovascular disease and diabetes, treatment decisions need to be tailored over time according to patients’ responses to previous treatments. Such an adaptive treatment strategy is referred as an dynamic treatment regime. Formally speaking, a dynamic treatment regime is a sequence of decision rules, dictating how the treatment will be tailored through time to individual’s status. The optimal dynamic treatment regime is defined as the one that yields the most favorable outcome on average.

Various methods have been proposed to estimate the optimal dynamic treatment regime, including Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010) and A-learning (Robins, Hernan and Brumback, 2000; Murphy, 2003). Both Q-learning and A-learning rely on a backward induction algorithm to find the optimal dynamic treatment regime, however, Q-learning models the conditional mean of the outcome given predictors and treatment while A-learning directly models the contrast function that is sufficient for treatment decision. In particular, A-learning has the so-called doubly robust property, i.e. when either the baseline mean function or the propensity score model is correctly specified, the resulting A-learning estimating equation for the contrast function is consistent.

With the fast development of new technology, it becomes possible to gather an extradinary large number of prognostic factors for each individual, such as patient’s genetic information, demographic characteristics, medical history and clinical measurements over time. For such big data, it is important to make effective use of information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing precision medicine. In addition, variable selection is an essential tool in making inference for problems in which the number of covariates is comparable or much larger than the sample size. There have been extensive developments of penalized regression methods for variable selection in prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and the Dantzig selector (Candès and Tao, 2007), to name a few. In contrast to most penalized regression methods, which adds a penalty term to an objective function, the Dantzig selector focuses directly on estimating equations.

Although there is a large amount of work on developing variable selection methods for prediction, variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size. Qian and Murphy (2011) proposed to estimate the conditional mean response using a L1-penalized regression and studied the error bound of the value function for the estimated treatment regime. When the number of covariates is fixed, introduced a new penalized least squared regression framework and established the oracle property of the estimator, which is robust against the misspecification of the conditional mean function. extended this result to the setting allowing NP-dimensionality of covariates. However, all these works only consider studies with a single treatment decision. When moving to multiple-stage studies, the asymptotic properties of the estimated optimal dynamic treatment regime are much harder to derive since it needs to handle model misspecification of the contrast functions in the presence of NP-dimensionality of covariates. Moreover, these methods are not doubly robust.

In this paper, we propose a penalized A-learning method for deriving the optimal dynamic treatment regime when the number of covariates is of NP-order of the sample size. To preserve the doubly robust property of the A-learning method, we adopt the Dantzig selector (Candès and Tao, 2007) which directly penalizes the A-leaning estimating equations. The technical challenges and advances of the proposed estimators are described as follows.

First, to prove the theoretical properties of the Dantzig estimator in linear regression setting, the uniform uncertainty principle (UUP, Candès and Tao, 2007) or restricted eigenvalue condition (RE, Bickel, Ritov and Tsybakov, 2009) is required on the Gram matrix XT X, where X stands for the design matrix. The UUP condition essentially requires that every principle submatrix with the number of rows or columns less than some specified s behaves like an orthonormal system. The RE condition is the weakest and hence the most general condition in the literature to ensure the theoretical properties of Lasso and Dantzig estimators. A close connection between these two conditions are discussed in Bickel, Ritov and Tsybakov (2009). In a random design case, Candès and Tao (2007) studied the UUP condition for Gaussian, Bernoulli and Fourier ensembles. Mendelson, Pajor and Tomczak-Jaegermann (2007, 2008) obtained a similar result for a more general class of design matrices, the isotropic subgaussian matrices, based on some empirical process results. These results were further extended by Zhou (2009), where the UUP and RE conditions are developed for subgaussian ensembles with a correlated covariance structure. In the proposed penalized A-learning method, however, such conditions are required on matrices involving estimates, such as

XTdiag(A(1π^))X, (1.1)

where A = (A1, …, An)T denotes the vector of treatments received by n subjects, π^=(π^1,,π^n) denotes the corresponding estimated propensity scores and ◦ denotes the componentwise product operator. The presence of π^ in (1.1) adds extraordinary difficulties in establishing theoretical properties of such a random matrix. We establish the UUP and RE conditions under a proper convergence rate of π^, which provides a new theoretical framework for studying random matrices that involve estimates of unknown parameters.

Second, in the proposed penalized A-learning method, we need to estimate the baseline mean function and the propensity score model with NP-dimensionality of covariates. We adopt the penalized regressions with the folded-concave penalties, for example, a linear regression for the baseline mean function and a logistic regression for the propensity score model, with the SCAD penalties. Several difficulties need to be addressed for deriving the theoretical properties of the resulting penalized estimators. First, to our knowledge, penalized regressions with folded-concave penalties have seldom been studied in a random design setting. A major difficulty of adapting the existing results for the fixed design case to the random design case is to control the maximum eigenvalues of some random matrices,

maxjλmax[XMTdiag(|Xj|)XM],

where λmax[K] denotes the maximum eigenvalue of a matrix K, M is a given subset of [1, ⋯, p], Xj denotes the jth column of a matrix X, and XM the submatrix formed by columns in M. Such a problem is not standard since matrix XMTdiag(|Xj|)XM does not possess subexponential tail. We derive some concentration inequalities for such random matrices and for summations of subexponential and subgaussian random variables. Based on these results, we establish the weak oracle (Lv and Fan, 2009) properties, i.e, sign consistency and L convergence rate of the estimators under subgaussian ensembles, which is one of our major technical contributions. Moreover, the posited models for the baseline mean function or the propensity score may be misspecified. Therefore, the derivation of the asymptotic properties needs to take into account model misspecification with NP-dimensionality of covariates, which is challenging.

Third, a challenge for extending the results for a single treatment decision to sequential treatment decisions is that the contrast functions are likely to be misspecified in the backward induction algorithm, such as A-learning. This together with the NP-dimensionality of covariates make it extremely hard to study theoretical properties of the value function under the estimated optimal dynamic treatment regime. We overcome this difficulty by first defining population-level least favorable parameters in the misspecified contrast functions. Moreover, we derive the error bounds for the corresponding estimates under the model misspecification, which in turn leads to an error bound for the difference between the value functions of the estimated optimal dynamic treatment regime and the underlying true optimal dynamic treatment regime.

The remainder of the paper is organized as follows. We introduce the proposed penalized A-learning method in Section 2. Some implementation issues are addressed in Section 3, followed by simulation results in Section 4. We apply our method to a data from the STAR*D study in Section 5. Section 6 studies the error bounds of the penalized A-learning estimator and the difference between the value functions of the estimated optimal regime and the true optimal regime, at the second stage. Section 7 characterizes such results for the estimates at the first stage. Section 8 presents the weak oracle properties of the penalized estimators in the propensity score and baseline mean models under a random design setting. Section 9 discusses the UUP and RE condition in the context of A-learning. All technical conditions, lemmas and proofs are given in the Appendix.

2. Penalized A-Learning

For simplicity of presentation, we only consider a two-stage study where binary treatment decisions are made at time points t1 and t2. The data of a subject can be summarized as

O=(S(1),A(1),S(2),A(2),Y), (2.1)

where S(1) denotes the covariates collected prior to t1, A(1) ∈ {0, 1} is the treatment received at time t2, S(2) denotes intermediate covariates collected between time points t1 and t2, A(2) ∈ {0, 1} is the treatment received at time t2, and Y is the final outcome of interest. It is assumed that a larger value of Y stands for a better clinical outcome. Denote Y (a1, a2) the potential response of patient if he/she were given a1 as the first treatment and a2 as the second. If a patient follows a given regime (d1, d2), we can write the potential outcome

Y(d1,d2)=a1{0,1},a2{0,1}Y(a1,a2)I(d1=a1,d2=a2),

where I(·) denotes the indicator function. Our goal is to find a dynamic treatment regime to maximize the mean potential outcome. Throughout the paper, we make the commonly used assumptions for studying dynamic treatment regimes: stable unit treatment value assumption and sequential randomization assumption (Murphy, 2003).

The observed data from n subjects can be summarized as

Oi=(Si(1),Ai(1),Si(2),Ai(2),Yi),i=1,,n,

which are assumed to be independently and identically distributed copies of O. We assume the following semiparametric regression model for Y:

Yi=h(2)(Xi)+Ai(2)C(2)(Xi)+ei, (2.2)

where Xi=((Si(1))T,Ai(1),(Si(2))T)T is the vector of covariates for the ith patient, h(2)(·) is an unspecified baseline mean function, C(2)(·) the contrast function, and ei is an independent error with mean 0. The design matrix is denoted as X = (X1, …, Xn)T.

Define

Vi=maxAi(2)E(Yi|Si(1),Ai(1),Si(2),Ai(2))=h(2)(Xi)+C(2)(Xi)I(C(2)(Xi)>0).

At the first stage, we consider the following conditional mean model for V(2):

E(Vi|Si(1),Ai(1))=h(1)(Si(1))+Ai(1)C(1)(Si(1)), (2.3)

where h(1)(·) and C(1)(·) are functions of the baseline covariates. To simplify the notation, we use a shorthand Si for Si(1) and let S = (S1, …, Sn)T, the design matrix at the baseline.

It can be shown that the optimal dynamic treatment regime is given by dopt=(d1opt,d2opt), where

d1opt(Si)=I{C(1)(Si)>0}andd2opt(Xi)=I{C(2)(Xi)>0}. (2.4)

To estimate d1opt and d2opt, we posit the following models for C(1)(·), C(2)(·), h(1)(·), h(2)(·), π(1)(·), and π(2) (·):

π(1)(s,α1)=exp(sTα1)/{1+exp(sTα1)}, (2.5)
π(2)(x,α2)=exp(xTα2)/{1+exp(xTα2)}, (2.6)
h(1)(s)=sTθ1,h(2)(x)=xTθ2,C(1)(s)=sTβ1,C(2)(x)=xTβ2, (2.7)

and

π(1)(s)=Pr(Ai(1)=1|Si=s)andπ(2)(x)=Pr(Ai(2)=1|Xi=x).

Models in (2.5)(2.7) can be misspecified, however, we require that either h(j) or π(j) is correct for j = 1, 2. For simplicity, we require C(2) to be correctly specified. The general case when C(2) is misspecified can be similarly discussed. We use backward induction to estimate the optimal dynamic treatment regime. At the second decision point, we first estimate the parameters in the posited propensity score and baseline mean models using penalized regressions. Specifically, define

α^2=argminα2p1ni=1n[log{1+exp(XiTα2)}Ai(2)XiTα2]+j=1pλ1n(2)ρ1(2)(|α2j|,λ1n(2)),

and

θ^2=argminθ2p1ni=1n(1Ai(2))(YiXiTθ2)2+j=1pλ2n(2)ρ2(2)(|θ2j|,λ2n(2)),

where α2=(α21,,α2p)T, θ2=(θ21,,θ2p)T, ρ1(2) and ρ2(2) belong to the class of folded-concave penalty functions (Lv and Fan, 2009), such as SCAD (Fan and Li, 2001), and λ1n(2), λ2n(2) the associated regularization parameters.

Next, we estimate β2 in (2.2) using the Dantzig selector based on A-learning estimating function (Murphy, 2003), defined by

β^2=argminβ2Λ(2)β21, (2.8)

where

Λ(2)={β2p:1nXTdiag(A(2)π^(2)){YXθ^2A(2)(Xβ2)}λ3n(2)},
Y=(Y1,,Yn)T,A(2)=(A1(2),,An(2))Tπ^(2)=(π(2)(X1,α^2),,π(2)(Xn,α^2))T,

and λ3n(2) the regularization parameter.

To estimate the regime at the first decision point, we define the pseudo-outcome V^i using the advantage function (Murphy, 2003) by

V^i=Yi+XiTβ^2{I(XiTβ^2>0)Ai(2)}. (2.9)

Similarly, define

α^1=argminα1q1ni=1n[log{1+exp(SiTα1)}Ai(1)SiTα1]+j=1qρ1(1)(|α1j|,λ1n(1)),

and

θ^1=argminθ1q1ni=1n(1Ai(1))(V^iSiTθ1)2+j=1qρ2(1)(|θ1j|,λ2n(1)),

where α1=(α11,,α1q)T, θ1=(θ11,,θ1q)T and ρ1(1) and ρ2(1) are folded-concave penalty functions. Then, we estimate β1 in (2.3) by

β^1=argminβ1Λ(1)β11, (2.10)

where

Λ(1)={β1q:1nSTdiag(A(1)π^(1)){V^Sθ^1A(1)(Sβ1)}λ3n(1)},
V^=(V^1,,V^n)T,A(1)=(A1(1),,An(1))Tandπ^(1)=(π(1)(S1,α^1),,π(1)(Sn,α^1))T.

The estimated optimal dynamic treatment regime is given by

d^1(Si)=I(β^1TSi>0)andd^2(Xi)=I(β^2TXi>0). (2.11)

3. Some Implementation Issues

When the tuning parameters in optimization problems (2.8) and (2.10) are fixed, the Dantzig selector can be solved by a standard linear programming algorithm. One issue for implementing Dantzig selector is the choice of the tuning parameters. We use a BIC criterion for selecting tuning parameters. For Dantzig selector (2.8), λ3n(2) is chosen as the minimizer of

BIC(λ)=nlog(RSS(λ)/n)+d(λ){log(n)+log(p+1)}, (3.1)

where RSS(λ)=i=1n[{Ai(2)π(2)(Xi,α^2)}(Yi(2)XiTθ^2Ai(2)XiTβ^2)]2, and d(λ) is the number of nonzero components in β^2. A similar BIC criterion was proposed by Chen and Chen (2008). We use a similar criterion for choosing λ3n(1).

It was observed that the Dantzig estimators may underestimate the true values of parameters due to the shrinkage estimation (Candès and Tao, 2007). Therefore, we use a two-step procedure for practical implementation, which is referred as Gauss-Dantzig selector in Candès and Tao (2007). Specifically, in the first step, we apply the proposed penalized A-learning to select important variables for making an optimal decision, i.e. those variables with non-zero estimated coefficients. Then, in the second step, their corresponding coefficients are re-calculated by solving the unpenalized A-learning estimating equations with important variables only.

4. Simulation Studies

4.1. Settings

To evaluate the numerical performance of the proposed penalized A-learning method, we consider simulation studies with two treatment decision points, based on the following model:

Y=A(1)A(2)+A(2)(β2TS(1)+S(2)+β0)+A(1)(β1TS(1))+ε, (4.1)

where A(j), j = 1, 2, is the treatment given at the jth stage, S(j), j = 1, 2, denote the covariate information collected before the jth treatment is given, and Y is the final response of interest. The random error ε follows a normal distribution with mean 0 and variance 0.25. Here, covariates S(1)=(S1(1),Sq(1))T follow a multivariate normal distribution with mean 0 and variance Iq. In addition, the intermediate covariate S(2) is a scalar and generated as S(2)=S1(1)+A(1)+A(1)S1(1)+e, where e follows a normal distribution with mean 0 and variance 0.25.

We set β0 = 0. Based on model (4.1), the optimal treatment regime at stage 2 is I(A(1)+β2TS(1)+S(2)>0). Following this optimal treatment regime at stage 2, the Q-function at stage 1 is given by

Q1(S(1),A(1))=E{(A(1)+β2TS(1)+S(2))+|S(1),A(1)}+A(1)(β1TS(1))=β28πexp(2μ2)+μ{1Φ(2μ)}+A(1)(β1TS(1)),

where μ=A(1)+β2TS(1)+S1(1)+A(1)+A(1)S1(1) and a+ = (|a| + a)/2. Therefore, the contrast function C(S(1)) = Q1(S(1), 1) − Q1(S(1), 0) and thus the optimal treatment regime at stage 1 is I{C(S(1)) > 0.

To evaluate the double robustness of the proposed method, we consider a variety of scenarios with correctly specified and misspecified baseline mean and/or propensity score models. At stage 2, a linear model with covariates S(1), S(2) and A(1) is fitted for the baseline mean function, while the true baseline mean function is h(2)(X)=A(1)(β1TS(1)). We choose β1 = 0q, for which the baseline mean function is correctly specified, and β1 = (04, 1, −1, 0q−6)T, for which the baseline mean function is misspecified. At stage 1, a linear model with covariates S(1) is fitted for the baseline mean function, which is always misspecified. Logistic models are used for estimating the propensity scores, which are correctly specified for the constant model but misspecified for the probit model. The following four settings are considered:

  • Setting 1: β1 = 0q, P (A(2) = 1) = 0.5;

  • Setting 2: β1 = (04, 1, −1, 0q−6)T, P(A(2) = 1) = 0.5;

  • Setting 3: β1 = 0q, P (A(2) = 1) = Pr(N(0, 1) ≤ ST γ);

  • Setting 4: β1 = (04, 1, −1, 0q−6)T, P (A(2) = 1) = Pr(N(0, 1) ≤ ST γ),

where S = ((S(1))T, S(2))T and N(0, 1) a standard normal random variable. For other parameters, we choose P(A1 = 1) = 0.5, β2 = (0, 0, 1, −1, 0q−4)T, σ1 = σ2 = 0.5, d = (d0, d1, d2, d3,)T = (0, 1, 1, 1)T, and γ = (0q−2, 1, −1, 1)T. Table 1 summarizes the information of model misspecification for the baseline mean and propensity score models and associated important variables under different settings. In next section, we show simulation results of the four settings with q = 1000 and sample size n = 150/300 over 500 replications.

Table 1.

Simulation Settings

Stage Baseline Propensity Score Important Variables
Setting 1 Stage 2 right right
(S(2),A1,S3(1),S4(1))
Stage 1 wrong right
(S1(1),S3(1),S4(1))

Setting 2 Stage 2 wrong right
(S(2),A1,S3(1),S4(1))
Stage 1 wrong right
(S3(1),S3(1),S4(1),S5(1),S6(1))

Setting 3 Stage 2 right wrong
(S(2),A1,S3(1),S4(1))
Stage 1 wrong right
(S1(1),S3(1),S4(1))

Setting 4 Stage 2 wrong wrong
(S(2),A1,S3(1),S4(1))
Stage 1 wrong right
(S1(1),S3(1),S4(1),S5(1),S6(1))

4.2. Competing methods

We further compare our method with outcome weighted learning (OWL, Zhao et al., 2012), which is a robust method which estimates individualized treatment rule by directly maximizing the estimated value function. Zhao et al. (2015) further introduced backward outcome weighted learning (BOWL) and simultaneous outcome weighted learning (SOWL) to extend their methods to multiple stage studies. Here, we consider a double robust version of BOWL (DR-BOWL) for comparison. For a single stage study, the developed DR-BOWL method is similar to the residual weighted learning method (Zhou et al., 2015).

Specifically, we first estimate the propensity score π^(2)=(π^1(2),,π^n(2))T and baseline h(2)=XTθ^(2)=(h1(2),,hn(2))T as in Section 2. We consider the linear decision rule I(xTβ20 > 0) and estimate β20 by minimizing the following loss function:

β2=argminβ21ni(Yihi(2)){1(2Ai(2)1)XiTβ2}+Ai(2)π^i(2)+(1Ai(2))(1π^i(2))+λ3n(2)β21.

The penalty term in original OWL is λ3n(2)β222. We replace it with the L1 norm here to simultaneously select variables. Then we construct the pseudo outcome V^i using augmented inverse propensity score estimator (AIPWE Zhang et al., 2012),

V^i=Ai(2)d2(Xi)+(1Ai(2)){1d2(Xi)}Ai(2)π^i(2)+(1Ai(2))(1π^i(2))Yi(Ai(2)d2(Xi)+(1Ai(2)){1d2(Xi)}Ai(2)π^i(2)+(1Ai(2))(1π^i(2))1)[h^i(2){1d2(Xi)}+Φ^i(2)d2(Xi)]

where d2(Xi)=I(XiTβ2>0), and Φ^i(2) is an estimate of Φ^i(2)=Mean(Y|A=1,X=Xi). Here, we a fit linear model for Mean(Y |A = 1, X) and use nonconcave penalized regression with SCAD penalty to obtain Φ^i(2). Denoted by π^(1)=(π^1(1),,π^n(1))T and h(1)=STθ^(1)=(h1(1),,hn(1))T estimated propensity score and baseline at the first stage, we consider linear treatment regime of the form I(sTβ1>0) and estimate β1 by

β1=argminβ11ni(V^ihi(1)){1(2Ai(1)1)SiTβ1}+Ai(1)π^i(1)+(1Ai(1))(1π^i(1))+λ3n(1)β11.

Tuning parameters λ3n(2) and λ3n(1) are obtained by minimizing a value-based BIC criterion.

4.3. Results

Table 2 summarizes variable selection results for optimal treatment decisions and the empirical performance of the estimated optimal treatment regime compared with the true optimal regime, using our penalized A-learning method (denoted as PAL) and DR-BOWL, respectively. Specifically, it reports the false negative (FN) rate (the percentage of important variables that are missed) and false positives (FP) rate (the percentage of unimportant variables that are selected), the ratio of value functions (denoted by VR) calculated using the value function of the estimated optimal treatment regime divided by that of the true optimal regime, and the error rates (ER) of the estimated optimal treatment regimes for treatment decision making, in both stages. Here, the ER at stage 2 is calculated as the mean of n1i=1|I(β^2TXi>0)I(β2,0TXi>0)| and at stage 1 as the mean of n1i=1|I(β^1TSi>0)I(C(Si)>0)|. The value function of a given treatment regime is calculated using Monte Carlo simulations based on 10,000 replications. The VR at stage 2 (devoted by VR*) is to compare the estimated optimal treatment regime at stage 2 and a randomly assigned treatment at stage 1 as in simulated data with the true optimal dynamic treatment regime for both stages. The VR at stage 1 is to compare the estimated optimal dynamic treatment regime with the true optimal dynamic treatment regime for both stages.

Table 2.

Variable Selection Simulation Results (%).

n method Stage 2 Stage 1
FN FP VR* ER FN FP VR ER
Setting 1
150 PAL 12.6 0.1 64.7 6.1 63.8 0.1 98.3 7.0
DR-BOWL 85.7 0.1 39.0 34.7 99.5 0.1 39.1 48.3
300 PAL 1.1 0.1 65.4 2.6 41.9 0.1 99.7 6.2
DR-BOWL 78.1 0.1 49.2 27.5 98.0 0.2 50.2 48.3

Setting 2
150 PAL 25.9 0.1 57.8 10.4 56.2 0.2 90.8 15.7
DR-BOWL 86.3 0.1 35.1 35.6 99.0 0.2 35.9 47.2
300 PAL 11.0 0.1 59.6 6.2 32.5 <0.05 97.9 8.0
DR-BOWL 79.8 0.1 42.4 29.9 97.2 0.2 44.6 47.1

Setting 3
150 PAL 33.7 0.3 59.9 13.5 64.5 0.1 93.0 9.1
DR-BOWL 18.8 1.3 60.2 7.5 72.3 0.5 92.4 24.4
300 PAL 12.3 0.3 64.2 7.2 52.7 <0.05 98.3 6.9
DR-BOWL 74.9 0.2 55.3 23.2 97.8 <0.01 56.4 48.4

Setting 4
150 PAL 55.7 0.2 48.2 22.4 62.2 0.1 79.4 17.7
DR-BOWL 75.0 0.1 51.0 23.4 99.0 <0.01 51.7 47.2
300 PAL 26.4 0.3 56.2 13.2 36.4 <0.05 94.3 8.4
DR-BOWL 74.9 0.2 50.9 23.1 97.4 <0.01 52.8 47.0

FN: proportion of related variables with zero coefficients

FP: proportion of unrelated variables with nonzero coefficients

VR: value ratio between estimated and true treatment regimes

ER: error rate of estimated treatment regimes

The DR-BOWL methods fail in all settings. Take Setting 1, n = 300 as an example, FN = 78.1% for the second stage where the baseline, propensity score and contrast functions are all correctly specified. It missed approximately 3/4 of the important variables. Besides, VR = 50.2, indicating the poor performance of the estimated treatment rules.

On the other hand, the overall performance of our penalized A-learning method is good. We make the following observations. First, the FN rates are much higher than the FP rates. This suggests that the Dantzig selector tends to have conservative variable selection results, which is commonly seen in the literature. Second, the variable selection results and the error rates of the estimated optimal treatment regime at stage 2 are generally much better than those at stage 1, which is expected since the optimal linear treatment decision rule is correctly specified at stage 2 but not at stage 1. At stage 2, for n = 150, over 55% important variables are not selected for all 4 settings. Thirdly, our method requires correct specification of either the propensity score or the baseline model, especially when the sample size is small. This is implied by comparing results in Setting 4 with other three settings. For example, when n = 150, the false negative at second stage reaches 55.7%, which is much higher than those FN’s in other three settings. Besides, our estimator is very efficient in Setting 1 where both models are correctly specified. Even when n = 150, the ratio of the value functions reaches 98.7%, and all error rates are abound 6–7%. These results are even comparable with those under Setting 2 and 3 when n = 300. Lastly, the estimation and variable selection performance of the estimated optimal dynamic treatment regimes improves as the sample size increases. In particular, in Setting 1–3 when n = 300, the VR’s are all above 97.9% and ER’s are all below 8%, which implies that the estimated optimal treatment regimes nearly maximize the value functions.

4.4. Nonregularity

As suggested by one of the referee, we further examine our methods under settings with different degrees of nonregularity. Specifically, we consider the setting where all covariates in S(1) are independent Rademacher random variables. We set S(2) to be another Rademacher random variable independent of S(1) and A(1).

Denoted by A(1)* = 2A(1) − 1, the response Y is generated as follows,

Y=2A(2)(A(1)+δ1S1(1)+S(2)δ2)+A(1)(βTS(1))+ε, (4.2)

where ε ~ N(0, 0.25).

For each stage, we fit linear models for the baseline and contrast function, and a logistic regression model for the propensity score. The parameter β in (4.2) determines the baseline function on the second stage. Similar to the regular case discussed in Section 4.1 in the revision, we also consider four Settings here:

  • Setting 1: β = 0q, P (A(2) = 1) = 0.5;

  • Setting 2: β = (04, 1, −1, 0q−6)T, P(A(2) = 1) = 0.5;

  • Setting 3: β = 0q, P(A(2) = 1) = Pr(N(0, 1) ≤ STγ);

  • Setting 4: β = (04, 1, −1, 0q−6)T, P(A(2) = 1) = Pr(N(0, 1) ≤ STγ),

where S = ((S(1))T, S(2))T and γ = (0q−2, 1, −1, 1)T.

Parameters δ1 and δ2 in (4.2) controls the degree of nonregularity on the second stage. We consider three choices of δ1 and δ2. Set δ1 = 1, δ2 = 1, we obtain

Pr(C(2)(X)=0)=Pr(A(1)+S1(1)+S(2)=1)=0.375.

Set δ1 = 1.1, δ2 = 1.1, we have

Pr(C(2)(X)=0)=Pr(A(1)+S(2)=1,S(1)=1)=0.25.

Set δ1 = 1, δ2 = 1.1, we have

Pr(C(2)(X)=0)=0.

With some calculation, we can show the Q-function on the first stage takes the following form:

Q(S(1),A(1))=A(1)(βTS(1)+f1S1(1)+f2).

Hence, the contrast function is correctly specified on the first stage. When δ1 = 1, δ2 = 1 or δ1 = 1.1, δ2 = 1.1, we have f1 = f2 = 1. When δ1 = 1, δ2 = 1.1, we have f1 = f2 = 0.95. Information about model specification and important variables in the contrast function are given in Table 3.

Table 3.

Simulation for Nonregular Settings

Stage Baseline Propensity Score Important Variables
Setting 1 Stage 2 right right
(S(2),A1,S1(1))
Stage 1 right right
(S1(1))

Setting 2 Stage 2 wrong right
(S(2),A1,S1(1))
Stage 1 right right
(S1(1),S5(1),S6(1))

Setting 3 Stage 2 right wrong
(S(2),A1,S1(1))
Stage 1 right right
(S1(1))

Setting 4 Stage 2 wrong wrong
(S(2),A1,S1(1))
Stage 1 right right
(S1(1),S5(1),S6(1))

We also consider two choices of sample size, n = 150 and n = 300, respectively. This gives us a total of 24 scenarios. For each scenario, we report FN, FP, VR and ER as Section 4.3. ER for the first and second stage are calculated as

{1ni=1n|I(β^1TSi>0)I(C(Si)>0)|I(C(Si)0)}/{1ni=1nI(C(Si)0)}

and

{1ni=1n|I(β^2TXi>0)I(β2,0TXi>0)|I(β2,0TXi0)}/{1ni=1nI(β2,0TXi0)}.

Compared to definitions in Section 4.3, error rates here are calculated with respect to those patients with nonzero contrast functions. Such definitions are more meaningful since both two treatments are optimal for these patients. We simulate over 200 replications. Results are reported in Table 4.

Table 4.

Variable Selection Simulation Results for Non-regular Settings (%).

n Nonregularity Stage 2 Stage 1
FN FP VR* ER FN FP VR ER
Setting 1
150 δ1 = 1, δ2 = 1 0 < 0.01 53.6 0 4.0 0.3 93.9 5.0
δ1 = 1.1, δ2 = 1.1 0 < 0.01 53.5 0.1 0.5 0.3 95.1 3.2
δ1 = 1, δ2 = 1.1 0 0.01 52.9 5.0 1.0 0.3 93.8 4.2
300 δ1 = 1, δ2 = 1 0 < 0.01 53.3 0 0 0.4 97.3 0.9
δ1 = 1.1, δ2 = 1.1 0 < 0.01 53.9 0 0 0.3 97.9 0.9
δ1 = 1, δ2 = 1.1 0 < 0.01 52.8 2.0 0 0.3 97.0 0.9

Setting 2
150 δ1 = 1, δ2 = 1 0 < 0.05 46.1 0 14.7 0.3 90.7 5.6
δ1 = 1.1, δ2 = 1.1 0 < 0.05 45.9 2.0 14 0.3 89.5 5.7
δ1 = 1, δ2 = 1.1 0 < 0.05 44.4 11.6 9.7 0.3 89.7 11.5
300 δ1 = 1, δ2 = 1 0 < 0.01 45.8 0 0 0.2 97.1 0.4
δ1 = 1.1, δ2 = 1.1 0 < 0.01 45.6 0.5 0 0.2 96.4 0.4
δ1 = 1, δ2 = 1.1 0 0.01 45.1 6.8 0 0.2 98.1 7.4

Setting 3
150 δ1 = 1, δ2 = 1 5.7 0.6 45.0 2.9 19.0 0.2 85.6 4.1
δ1 = 1.1, δ2 = 1.1 8.2 0.5 45.1 6.6 17.0 0.2 87.3 3.4
δ1 = 1, δ2 = 1.1 6.3 0.6 44.6 14.6 18.5 0.2 85.8 4.1
300 δ1 = 1, δ2 = 1 0 0.1 53.1 0 0 0.3 97.1 1.0
δ1 = 1.1, δ2 = 1.1 0 0.1 53.8 1.4 0 0.3 98.0 0.6
δ1 = 1, δ2 = 1.1 0 0.1 52.9 8.4 0 0.3 97.9 0.8

Setting 4
150 δ1 = 1, δ2 = 1 20.7 0.5 25.4 8.6 52 0.2 66.6 14.0
δ1 = 1.1, δ2 = 1.1 20.8 0.5 25.3 12.4 54.2 0.2 62.5 14.9
δ1 = 1, δ2 = 1.1 21.5 0.6 23.7 22.6 51.7 0.2 61.7 22.1
300 δ1 = 1, δ2 = 1 0.3 0.2 44.9 0.2 3.3 0.2 95.8 0.7
δ1 = 1.1, δ2 = 1.1 0 0.2 44.8 3.9 0.2 0.2 97.5 0.4
δ1 = 1, δ2 = 1.1 0 0.2 43.8 13.2 0.3 0.2 97.1 8.3

Within each setting, most results are similar across different choices of δ1 and δ2. This suggests the nonregularity issues don’t have a big impact on the variable selection results. Apart from results in Setting 4, false negatives and false positives are all very small. When the sample size increases to 300, false negatives for most scenarios are exactly equal to 0 while false positives for all settings are below 0.4%, demonstrating perfect variables selections performance of our methods. In Settings 1–3, most error rates are below 7% while the ratios of value function are all above 85%, indicating our estimated optimal treatment regimes are very close to the truth in these scenarios.

5. Application to STAR*D Study

We applied the proposed method to a dataset from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, which was conducted to compare different treatments for patients with major depressive disorder (MDD). There were 4041 participants (age 18–75) with nonpsychotic MDD enrolled in this study. At first level, all participants were treated with citalopram (CIT) up to 14 weeks. Subsequently, 3 more levels of treatments were provided for participants without a satisfactory response to CIT. At each level, participants were randomly assigned to treatment options acceptable to them. At Level 2, participants were eligible for seven treatment options: sertraline (SER), venlafaxine (VEN), bupropion (BUP), cognitive therapy (CT), and augmenting CIT with bupropion (CIT+BUP), buspirone (CIT+BUS) or cognitive therapy (CIT+CT). Participants without a satisfactory response to CT were proceeded to Level 2A for additional medication treatments. All participants who did not respond satisfactorily at Level 2 or 2A were eligible for four treatments at Level 3: medication switch to mirtazapine (MIRT) or nortriptyline (NTP), and medication augmentation with either lithium (Li) or thyroid hormone (THY). Participants without satisfactory response to Level 3 were re-randomized at Level 4 to either tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN). See Fava et al. (2003) and Rush et al. (2004) for more details of the STAR*D study. One goal of the study is to determine which treatment strategies, in what order or sequence, provide the optimal treatment effect.

As an illustration, we focused on a subset of participants who were given treatment BUP or SER at Level 2 and did not receive satisfactory responses, and then were randomized to treatment MIRT or NTP at Level 3. For this study, we considered 381 covariates collected at baseline and intermediate levels as possible relevant predictors. For treatment regime at Level 3, all the 381 covariates as well as the assigned treatment at Level 2 were considered as possible predictors for making optimal treatment decision. For treatment regime at Level 2, 305 covariates that were collected before giving treatment at Level 2 were considered for making optimal treatment decision. Negative 16-item Quick Inventory of Depressive Symptomatology-Clinician-Rated (QIDS-C16) was used as the final response, which is a measurement of symptomatic status of depression. There are 73 participants who had complete records in the subset of data we are interested in. Among these participants, 36 were treated with BUP and 37 were treated with SER at Level 2, and 33 were treated with NTP and 40 were treated with MIRT at Level 3.

The selection and estimation results are summarized as follows. At Level 3, our method selected two covariates: “age” in baseline demographics (AGE), and the suicide risk of the patient (SUICD). The estimated optimal treatment regime is I(1.459 − 0.091 × AGE + 0.158 × SUICD ≥ 0), where 1 represents treatment NTP and 0 represents treatment MIRT. This optimal treatment regime assigns 27 participants to NTP and the rest 46 participants to MIRT. At Level 2, our method also selected two covariates: age and QIDS-C percent improvement” in clinic visit clinical record form at Level 1 (QCIMP). The estimated optimal treatment regime is I(−8.600 + 0.145 × AGE + 0.125 × QCIMP ≥ 0), where 1 stands for treatment BUP and 0 stands for treatment SER. This optimal treatment regime assigns 37 participants to BUP and the rest 36 participants to SER.

To further examine the estimated optimal dynamic treatment regime, we compare the estimated value function of our estimated optimal treatment regime with values under those four non-dynamic treatment regimes, BUP+NTP, BUP+MIRT, SER+NTP and SER+MIRT. For a given dynamic treatment regime d = (d(1), d(2)), we evaluate its average value function using AIPWE (Zhang et al., 2013),

1ni=1ndAi(1)π^Ai(1)(dAi(2)π^Ai(2)YidAi(2)π^Ai(2)π^Ai(2){di(2)(h^i(2)+XiTβ^2)+(1di(2))h^i(2)})1ni=1ndAi(1)π^Ai(1)π^Ai(1){di(1)(h^i(1)+SiTβ^1)+(1di(1))h^i(1)},

where dAi(2)=Ai(2)di(2)+(1Ai(2)), dAi(1)=Ai(1)di(1)+(1Ai(1))(1di(1)), π^Ai(2)=Ai(2)π^i(2)+(1Ai(2))(1π^i(2)), π^Ai(1)=Ai(1)π^i(1)+(1Ai(1))(1π^i(1)), di(2) and di(1) the assigned treatment for the ith patient, according to d(2) and d(1). Based on this formula, we report the estimated value functions of the four non-dynamic treatment regimes in Table 5.

Table 5.

Estimated Values of Different Treatment Regimes

Treatment Regime Estimated Value
estimated optimal regime −9.02
BUP + NTP −12.86
BUP + MIRT −12.57
SER + NTP −12.57
SER + MIRT −12.28

Estimating the value of the optimal treatment regime is well-known to be a non-regular problem when there’s nonzero probability that the contrast function (either at the second or the first stage) is equal to zero. To evaluate the value function under our estimated optimal treatment regime, we consider the online estimator proposed by Luedtke and van der Laan (2016). Specifically, for i = ln + 1, ln + 2, …, n, we obtain the estimated optimal dynamic treatment regime d^opt(i)=(d^opt(i)(1),d^opt(i)(2)) and its associated parameters β^2(i), β^1(i), propensity score function π^(i)(2), π^(i)(1), baseline function ĥ(i)(2), ĥ(i)(1) based on data from patients 1 to i − 1, using penalized A-learning. Then we evaluate the value of d^opt(i)=(d^opt(i)(1),d^opt(i)(2)) on the ith patient using (AIPWE, Zhang et al., 2013)

V^i(i)=d^Aiopt(i)(1)π^Ai(i)(1)(d^Aiopt(i)(2)π^Ai(i)(2)Yid^Aiopt(i)(2)π^Ai(i)(2)π^Ai(i)(2){d^iopt(i)(2)(h^i(i)(2)+XiTβ^2(i))+(1di(2))h^i(i)(2)})d^Aiopt(i)(1)π^Ai(i)(1)π^Ai(i)(1){d^iopt(i)(1)(h^i(i)(1)+SiTβ^1(i))+(1diopt(i)(1))h^i(i)(1)}.

The variance of V^i(i) conditional on data from patients 1 to i − 1 is evaluated by

σi2=1i1j=1i1V^i2(j)(1i1j=1i1V^i(j))2,

where V^i(j) is the estimated value of d^opt(i) on the jth patient.

The final estimator is given by

V^=j=ln+1nσj1V^j(j)j=ln+1nσj1,

with the estimated standard error

σ^=nlnj=ln+1nσj1.

Since the sample size of our dataset is small, we choose ln ≈ 2n/3, i.e, ln = 49. The estimated value V^ is equal to −9.02 with an estimated standard error σ^=1.66. From Table 5, we can see the value under our estimated treatment regime is much larger than those under four nondynamic treatment regime.

6. Oracle inequalities for β^2 and the value function of the estimated regime at the second stage

We first introduce some notation. For an arbitrary matrix Φ ∈ ℝM×M and an arbitrary vector ϕ ∈ ℝM, the superscript Φj is used to denote the jth column of Φ, ϕj the jth element of ϕ, while the subscript Φi denotes the ith row of Φ. For subsets J, J′ ⊂ {1, …, M}, let |J| be the cardinality of J, Jc be the complement of J. We denote by ϕJ the vector in ℝ|J| that has the same coordinates as ϕ on J, and ΦJ the submatrix formed by columns in J, ΦJJ the submatrix formed by rows in J and columns in J′. The support of ϕ is defined by supp(ϕ) = {j ∈ {1, …, M} : ϕj ≠ 0}. Let ‖ϕp be the Lp norm of ϕ, ‖Φ‖p be the operator norm corresponding to the p-norm vector. If Φ is positive semidefinite, define

ρmins(Φ)=miny2=1|supp(y)|sΦ1/2y2andρmaxs(Φ)=miny2=1|supp(y)|sΦ1/2y2.

Let Yψp be the Orlicz norm for any random variable Y, defined as

Yψpinfu{u>0:E exp(|Y|u)m2},

for some p ≥ 1. For any two positive sequences {an} and {bn}, anbn means limn bn/an = 0. Throughout this paper, we use c0 and c¯ to denote some universal constants, whose values may change from place to place.

6.1. Oracle inequality for β^2

Recall C(2)(x) = xTβ2, according to our assumption. Let β2,0 denote the true values of β2. Define sβ2=|Mβ2|=O(nl6) for some 0 ≤ l6 < 1, the nonsparsity size of β2,0, Mβ2 the support of β2,0. We allow the number of covariates p to grow exponentially fast with respect to the sample size n, i.e, logp=O(na2) for some 0 < a2 < 1. To deal with such NP-dimensionality, following Zhou (2009), we assume

X=U1/2,jj=1,j=1,,p, (6.1)

where U=(U1T,,UnT)T and U1, …, Un are i.i.d. copies of a p-dimensional isotropic random vector U0. More specifically, we require that for any vector a ∈ ℝp,

E(aTU0)2=aTaandaTU0ψ2ωa2, (6.2)

for some isotropic constants ω.

Remark 6.1

The definition of the isotropic random vector was firstly introduced by Milman and Pajor (2003). Independent normal and independent Rademacher random variables are two most important examples of isotropic random vectors. More generally, coordinates of the isotropic random vector do not need to be independent. They can be distributed uniformly on various convex and symmetric bodies, for example, an appropriate multiple of the unit ball in ℝp equipped with the LK-norm for any 1 ≤ K∞. For these distributions, we denote ωK as their isotropic constants. It is further shown in (Milman and Pajor, 1989) that ωK are uniformly bounded for K ≥ 1. However, it remains unknown whether the isotropic property holds for all uniform distributions on arbitrary symmetric convex bodies with Lebesgue measure 1. .

Remark 6.2

The isotropic formulation requires covariates in U0 to be uncorrelated, and hence does not allow for correlated Bernoullis. However, according to our definition X = UΣ1/2, different covariates in the design matrix X can be correlated when Σij ≠ 0. Such formulations allows us to impose conditions on the tail of U0 and the covariance matrix Σ separately.

Since the A-learning estimating equation involves the plug-in estimators α^2 and θ^2, we need some conditions on these two estimators to establish oracle inequalities for β^2. More precisely, we assume that α^2 and θ^2 converge to some α2 and θ2, respectively. When the propensity score model π(2) and the baseline model h(2) are correctly specified, α2 and θ2 represent the true coefficients in these two models. When the models are misspecified, α2 and θ2 correspond to the population-level least favorable parameters. Denote Mα2 and Mθ2 the support of α2 and θ2, respectively. Let sα2=|Mα2| and sθ2=|Mθ2|, the number of nonzero elements. We assume sα2=O(nl4) and sθ2=O(nl5) for some 0 ≤ l4, l5 < 1/2.

Condition 1

Assume that there exist some positive constants γα2 and γθ2, such that with probability at least 1c¯/(n+p),

α^2Mα2c=0,α^2Mα2α2Mα2=O(nγα2logn), (6.3)
θ^2Mθ2c=0,θ^2Mθ2θ2Mθ2=O(nγθ2logn). (6.4)

Moreover, assume dα2nγα2logn and dθ2nγθ2logn, where dα2=minj|α2j|/2 and dθ2=minj|θ2j|/2.

Remark 6.3

Condition 1 assumes the weak oracle properties of α^2 and θ^2, i.e., selection consistency and consistency under L norm. The weak oracle properties of α^2 and θ^2 are established in Theorems 8.1 and 8.2 of Section 8, respectively.

Define

C(2)=E{Xiπi(2)(1πi(2))XiT},D(2)=E{XiXiT(1Ai(2))},

and πi(2)π(2)(Xi,α2).

Condition 2

Assume that matrices D(2), C(2) and Σ satisfy

λmax(Mα2Mα2)=O(1),λmax(Mθ2Mθ2)=O(1),liminfnλmin(DMθ2(2)Mθ2)>0,liminfnλmin(CMα2(2)Mα2)>0.

Define Ω(2)(α2)=E[XiXiTAi(2){1π(2)(Xi,α2)}] and Ωn(2)=n1iXiXiTAi(2)(1π^i(2)) with π^i(2)=π(2)(Xi,α^2). For any positive semidefinite matrix Ψ ∈ ℝp×p, integer s and positive number c, define function K(s, c, Ψ) as follows,

K(s,c,Ψ)=minJ{1,,p}|J|sminy0yJc1cyJ1Ψ1/2y2yJ2>0.

The following condition ensures that the RE condition holds for the matrix Ωn(2).

Condition 3

Assume that for any 0 < θs < 1 and sufficiently large n, we have

K(sβ2,1,Ωn(2))>(1θs)infα2Hα2K(sβ2,1,Ω(2)(α2))>0, (6.5)

where Hα2 denotes the set of vectors α2 that satisfies the weak oracle property (6.3).

Remark 6.4

It is tedious to verify (6.5) due to the plug-in estimator π^i(2). The key to prove such a result is that the estimator α^2 in π^i(2) should be sparse. That is the reason we use penalized regression with a folded-concave penalty to obtain α^2, since it can ensure selection consistency of the estimator. We provide a general result characterizing the UUP and RE conditions for the random matrix Ωn(2) in Lemmas 9.1 and 9.2 of Section 9.

To establish the oracle inequality for β^2, we first provide an upper bound for

1nXTdiag(A(2)π^(2))(YXθ^2A(2)Xβ2,0),

which is given in the following Lemma.

Lemma 6.1

Assume that Condition 1 and 2 hold, h(2)(Xi)XiTθ2ψ1<, eiψ2<, a2 + l4 < 1, and that either π(2) or h(2) is correctly specified. Then, for sufficiently large n, there exists some constants c(2), such that with probability at least 1c¯/(n+p),

1nXTdiag(A(2)π^(2))(YXθ^2A(2)Xβ2,0)c(2)(E1+E2+E3+E4),

where

E1=logp/n,E2=sα2n2γα2log2n+sθ2n2γθ2log2n,
E3=σ3{sα2logn/n+sα2λ1n(2)ρ2(1)(dnα2)},
E4=σ4{sθ2logn/n+sθ2λ2n(2)ρ2(2)(dnθ2)},

σ32=E{h(2)(Xi)XiTθ2}2, and σ42=E{π(2)(Xi)πi(2)}2.

Remark 6.5

Recall that logp=O(na2), sα2=O(nl4) for some 0 ≥ a2, l4 < 1. The condition a2 + l4 < 1 implies nsα2logp.

Remark 6.6

Here, E1 describes how the curse of dimensionality takes effect, E2 is due to estimation errors of α^(2) and θ^(2), E3 and E4 are due to model misspecification. Since we assume that at least one of h(2) and π(2) is correctly specified, either E3 or E4 is zero.

Theorem 6.1

Assume that conditions in Lemma 6.1 and Condition 3 hold, and λ3n(2)c(2)(E1+E2+E3+E4) where the constant c(2) is defined in Lemma 6.1. Then, for some fixed 0 < θs < 1 and sufficiently large n, the following two inequalities hold with probability at least 1c¯/(n+p) for some constant c¯>0:

β^2β2,0212λ3n(2)sβ2(1θs)2infα2Hα2K2(sβ2,1,Ω(2)(α2)), (6.6)
β^2β2,018λ3n(2)sβ2(1θs)2infα2Hα2K2(sβ2,1,Ω(2)(α2)). (6.7)

Moreover, we have β^2Mβ2c1β^2Mβ2β2,0Mβ21.

From (6.6), it is immediate to see that β^2β2,02P0 as long as

sβ2(E1+E2+E3+E4)infα2Hα2K2(sβ2,1,Ω(2)(α2))0, (6.8)

which implies the doubly robust property of β^2. We provide a sufficient condition for (6.8) in the following Corollary.

Corollary 6.1 (Double robustness of β^2)

Assume that conditions in Theorem 6.1 and the following conditions hold:

l6<min(4γθ22l5,4γα22l4), (6.9)
λ2n(2)ρ2(2)(dnθ2)=O(n1/2)andλ1n(2)ρ1(2)(dnα2)=O(n1/2). (6.10)
liminfinfα2Hα2K(sβ2,1,Ω(2)(α2))>0. (6.11)

If either the baseline h(2) or the propensity score model π(2) is correctly specified, then β^2β2,02P0.

Remark 6.7

Condition (6.9) imposes a constraint between the sparsity of population parameters and the convergence rates of α^2 and θ^2. When sβ2=O(1), it requires α^2 and θ^2 to be consistent under L2 norm. Condition (6.10) automatically holds for SCAD penalty function when dnθ2λ2n(2) and dnα2λ1n(2).

6.2. Oracle inequality for the value function of the estimated regime at the second stage

Now we establish the error bound for the difference between the mean responses (i.e. the value functions) of the estimated optimal regime at the second stage d^2(X0)=I(X0Tβ^2>0) and the true optimal one d2opt(X0)=I(X0Tβ2,0>0) for an individual with covariate X0. Here, X0 is also assumed to have the form Σ1/2U with Σ and U defined in (6.1), independent of Xi, i = 1, …, n. In addition, the regime at the first stage is chosen the same as the actually received treatment A0(1) at the first stage.

Under the assumptions of SUTVA and no unmeasured confounders, the difference of the corresponding value functions is given by

E{Y0(A0(1),d2opt)}E{Y0(A0(1),d2)}=E[X0Tβ2,0{I(X0Tβ2,0>0)I(X0Tβ^2>0)}]. (6.12)

Since (6.12) is nonnegative, it suffices to provide an upper bound. Here, we impose the following condition.

Condition 4

The probability density function g(2)(·) of X0Tβ2,0 exists and is bounded.

Condition 4 is a mild condition on the true optimal decision function, which holds in most cases when at least one of the important covariates (the corresponding component of β2,0 is nonzero) is continuous.

Theorem 6.2

Assume that conditions in Theorem 6.1 and Condition 4 hold. Assume E(X0Tβ2,0)2=O(1). Then, for fixed 0 < θs < 1 and sufficiently large n,

E[X0Tβ2,0{I(X0Tβ2,0>0)I(X0Tβ^2>0)}]c¯ωn+c0ω2ρmaxsβ2()(λ3n(2))2sβ2log2n(1θs)4infα2Hα2K4(s,1,Ω(2)(α2)).

Remark 6.8

Error bound for the difference of the value functions follows from the error bound on β^2 and Condition 4. Since the first term in the upper bound is small, the difference of the value functions is mainly characterized by the second term, which is of the order O(ρmaxsβ2()β^2β2,022log2n).

7. Error bounds for β^1 and the value function of the estimated dynamic treatment regime

7.1. Misspecified contrast function

In the context of A-learning, a major challenge arising in multi-stage studies is that the contrast functions are likely to be misspecified in backward induction. In order to study the finite sample bounds of β^1, we need to first define least favorable parameters under the misspecification of the contrast function.

Recall that C(1)(Si) is the true contrast function for the ith patient, which can be a very complex function of Si due to the backward induction. For notational convenience, we use a shorthand C(s) for C(1)(s). We posit a linear model SiTβ1 for C(·), which is often misspecified. When either the propensity score model π(1) or the baseline mean function h(1) is correctly specified, the associated least favorable parameters β1 is defined as follows:

β1=argminβ1Λβ11, (7.1)

where

Λ={β1q:E[SiAi(1)(1πi(1)){C(Si)SiTβ1}]κ0},

πi(1)=π(1)(Si,α1) and κ0 is a nonnegative constant. Define

κ0=E[SiAi(1)(1πi(1)){C(Si)SiTβ1}].

By simple algebra, we can show κ0min{κ0,O(σ0)}, where σ02=E[{C(Si)SiTβi}2], describing the degree of misspecification of the contrast function. Define sβ1=|Mβ1|=O(nl3) for some 0 ≤ l3 < 1/2, where Mβ1=supp(β1).

7.2. Error bound for β^1

Assume that logq=O(na1) for some 0 < a1 < 1 and S1, …, Sn are i.i.d. copies of S0 that

S0=dΨ1/2V0, (7.2)

where Ψ ∈ ℝq×q is some positive definite matrix with Ψjj = 1 for j = 1, …, q, and V0 is a q-dimensional isotropic random vector with some isotropic constants ζ. As in the second stage, we first give conditions on α^1 and θ^1. Assume that these two estimators converge to some α1 and θ1, respectively, under possible model misspecification. Denote Mα1=supp(α1), Mθ1=supp(θ1), sα1=|Mα1|=O(nl1), and sθ1=|Mθ1|=O(nl1) for some 0 ≤ l1, l2 < 1l2.

Condition 5

Assume that there exist some positive constants γα1 and γθ1, with probability at least 1c¯/(n+p+q), the following holds:

α^1Mα1c=0,α^1Mα1α1Mα1=O(nγα1logn), (7.3)
θ^1Mθ1c=0,θ^1Mθ1θ1Mθ1=O(nγθ1logn). (7.4)

Moreover, assume dα1nγα1logn and dθ1nγθ1logn, where dα1=minj|α1j|/2 and dθ1=minj|θ1j|/2.

Condition 6

Assume that D(1), C(1) and Ψ satisfy

λmax(ΨMα1Mα1)=O(1),λmax(ΨMθ1Mθ1)=O(1),liminfnλmin(DMθ1(1)Mθ1)>0,liminfnλmin(CMα1(1)Mα1)>0,

where

D(1)=E{SiSiT(1Ai(1))},C(1)=E{SiSiTπi(1)(1πi(1))},

and π(1)=π(1)(Si,α1).

Since both the propensity score model and the contrast function at the first stage can be misspecified, we need the following condition to control their effect on estimation of β1.

Condition 7

Assume that

τ0FMα1[C(1)Mα1Mα1]1b(1)Mα1<, (7.5)

where b(1)=E{Si(Ai(1)πi(1))} and

F=E[SiAi(1)πi(1)(1πi(1)){C(Si)Siβi}SiT].

Remark 7.1

It is immediate to see τ0 = 0 when either the contrast function or the propensity score model is correctly specified.

When going back to the first stage, the error bound of β^1 is directly affected by that of β^2. This is because the estimated response V^1 in the first stage is obtained based on β^2 using the advantage function. To simplify presentation, we introduce the following condition.

Condition 8

Assume that with probability at least 1c¯/(n+p), there exists some constant μ1 > 0 such that

ρmaxsβ2()β^2β2,02=O(nμ1logn), (7.6)

and β^2Mβ2c1β^2Mβ2β2,0Mβ21.

A more explicit form of the error bound for (7.6) is given in Theorem 6.1. In the next Lemma, we provide an upper bound for the term:

STdiag(A(1)π^(1))(V^Sθ^1A(1)Sβ1)/n. (7.7)

Lemma 7.1

Assume that Conditions 5–8 and those in Theorem 6.1 hold, C(Si)SiTβ1ψ1<, ViE(Vi|Si,Ai(1))ψ2<, a1 + l1 < 1, nsβ2logpρmaxsβ2()2/ρmaxsβ2(), and either π(1) or h(1) is correctly specified. Then, for sufficiently large n, with probability at least 1c¯/(n+p+q), (7.7) can be bounded from above by c(1)(E5 + E6 + E7 + E8 + E9 + E10) for some constant c(1) > 0, where

E5=logq/nlog2n,E6=sα1n2γα1log2n+sθ1n2γθ1log2n,
E7=σ1{sα1logn/n+sα1λ1n(1)ρ1(1)(dnα1)},
E8=σ2{sθ1logn/n+sθ1λ2n(1)ρ2(1)(dnθ1)},
E9=σ0{sα1logn/n+sα1λ1n(1)ρ1(1)(dnα1)+τ0+κ0},
E10=nμ1logn,σ02=E{C(Si)SiTβ1}2,σ12=E(h(1)SiTθ1)2,andσ22=E{πi(1)π(1)(Si)}2.

Remark 7.2

The terms E5E8 have similar interpretations as E1E4 in Lemma 6.1, respectively. The additional term E10 is due to the error bound of β^2 in the backward induction, while E9 is due to the misspecification of the contrast function.

Define Ω(1)(α1)=E[SiSiTAi(1){1π(1)(Si,α1)}] and Ωn(1)=n1iSiSiTAi(1)(1π^i(1)) with π^i(1)=π(1)(Xi,α^1). Similar as in stage 2, we need the following condition to ensure the RE condition for the matrix Ωn(1).

Condition 9

Assume that for any 0 < θs < 1 and sufficiently large n, we have

K(sβ1,1,Ωn(1))>(1θs)infα1Hα1K(sβ1,1,Ω(1)(α1))>0, (7.8)

where Hα1 denotes the set of vectors α1 that satisfies the weak oracle property (7.3).

Theorem 7.1

Assume that Condition 9 and those conditions in Lemma 7.1 hold, and λ3n(1)c(1)k=510Ek. The constant c(1) is defined in Lemma 7.1. Then, there exists a constant c8, such that for sufficiently large n and some fixed 0 < θs < 1, with probability at least 1 − c8/(n + p + q), the error bounds for β^1 are given by

β^1β1212λ3n(1)sβ1(1θs)2infα1Hα1K2(sβ1,1,Ω(1)(α1)), (7.9)
β^1β118λ3n(1)sβ1(1θs)2infα1Hα1K2(sβ1,1,Ω(1)(α1)). (7.10)

7.3. Error bound for the value function of the estimated dynamic treatment regime

Under the SUTVA and sequential randomization assumptions, the value function of a given dynamic treatment regime (d1(S0), d2(X0)) is given by

E{Y0(d1,d2)}=E[h(2)(X0)+(β2,0TX0)d2(X0)+C(S0){d1(S0)A0(1)}],

where S0 and X0 denote the baseline covariates and covariates for the second stage, respectively. Then, the difference of the value functions under the estimated optimal dynamic treatment regime (2.4) and the true optimal regime (d1opt,d2opt) is given by

E{Y0(d1opt,d2opt)}E{Y0(d^1,d^2)}=E[C(S0){I(C(S0)>0)I(S0Tβ^1>0)}]+E[X0Tβ2,0{I(X0Tβ2,0>0)I(X0Tβ^2>0)}].

Similar to Condition 4, we impose the following condition.

Condition 10

Assume that the probability density function g(1)(·) of S0Tβ1 exists and is bounded.

Theorem 7.2

Assume that conditions in Theorem 7.1 and Condition 10 hold. Assume E(X0Tβ2,0)2=O(1), E(S0Tβ1)2=O(1). Then, for some fixed 0 < θs < 1 and sufficiently large n,

0E{Y0(d1opt,d2opt)}E{Y0(d^1,d^2)}c¯(ω+ζ)n+c0σ04/3+c0ω2ρmaxsβ2()λ3n(2)2sβ2log2n(1θs)4infα2Hα2K4(sβ2,1,Ω(2)(α2))+c0ζ2ρmaxsβ1(Ψ)λ3n(1)2sβ1log2n(1θs)4infα1Hα1K4(sβ1,1,Ω(1)(α1)).

Remark 7.3

Theorem 7.2 suggests that the upper bound for the difference of the value functions come from three major components: the misspecification of the contrast function, described by σ02, and estimation errors of β^2 and β^1.

8. Weak oracle properties of α^j’s and θ^j’s

In order to prove the error bounds of β^1, β^2 and the value functions of the estimated treatment regimes presented in Sections 6 and 7, we need to establish the weak oracle properties of α^j and θ^j (j = 1, 2) in the posited models for the propensity score and baseline mean functions. Here, we prove the results based on a posited logistic regression model for the propensity score and a linear model for the baseline mean function under a random design setting. However, these results can be extended to generalized linear models (McCullagh and Nelder, 1989).

8.1. Weak oracle properties of α^2 and θ^2

We assume that α^2 and θ^2 converge to some population parameters α2 and θ^2, respectively. Under Conditions B1–B6 given in the Supplementary Appendix, we establish the weak oracle properties of α^2 and θ^2 in the following two Theorems. Recall that sα2=|Mα2|=O(nl4) for some 0 ≤ l4 < 1/2.

Theorem 8.1

Assume that Conditions B.1–B.3 hold, l4 + a2 < 1 and λmax(Mα2Mα2)=O(1). Then, for sufficiently large n, there exists some constants γα2>0, such that with probability at least 1c¯/(n+p),

  1. α^2Mα2c=0.

  2. α^2Mα2α2Mα2=O(nγα2logn).

Theorem 8.2

Assume that Conditions B.4–B.6 hold, λmax(Mθ2Mθ2)=O(1), and eiψ2<, where ei is defined in (2.2). Then, there exist some constants γθ2>0, such that with probability at least 1c¯/(n+p),

  1. θ^2Mθ2c=0.

  2. θ^2Mθ2θ2Mθ2=O(nγθ2logn).

Remark 8.1

Theorem 1 in Shi, Song and Lu (2015) established weak oracle results of the penalized estimators for a fixed design setting. This is mainly for technical convenience. Its proofs can be obtained using similar arguments as in Fan and Lv (2011). In this paper, we focus on a random design setting, which is more practical in medical studies. To the best of our knowledge, the weak oracle properties of penalized estimators have not been studied in a random design setting with the NP dimensionality. The major difficulty lies in developing some random matrix theories, such as controlling the maximum eigenvalue of some random matrices. Such results are established in Theorem 8.1 and 8.2.

Remark 8.2

The condition l4 + a2 < 1 ensures that for large n

pmaxj=1λmax[(XMα2)Tdiag(|Xj|)XMα2]=O(n), (8.1)

with probability approaching 1. A major technical difficulty in deriving (8.1) is that the matrix (XMα2)Tdiag(|Xj|)XMα2 does not have the subexponential tail (see Definition G.2 in the Supplementary Appendix). When sα2n, we can bound maxjMα2|Xij| from above by 2ωlogn with probability at least 1−2/n, which ensures the subexponential tail of the truncated matrix. Lemma B.2 in the Supplementary Appendix proves such a result for a more general case.

8.2. Weak oracle properties of α^1 and θ^1

The weak oracle properties of α^1 can be similarly derived as for α^2. However, unlike the results for θ^2, the weak oracle properties of θ^1 depend on β^2 even when the baseline mean function h(1) is correctly specified. This is because the estimated response V^i is obtained based on β^2. A necessary condition to ensure θ^1θ1P0 is that β^2β2,02P0, which is established in Corollary 6.1.

Theorem 8.3

Assume that Condition 8 and Conditions B.7–B.12 in the Supplementary Appendix hold. Further assume that λmax(ΨMα1Mα1)=O(1), λmax(ΨMθ1Mθ1)=O(1), nsβ2logp{ρmaxsβ2()}2/ρminsβ2(), a1+l1 < 1, eiψ2< and ViE(Vi|Si(1),Ai(1))ψ2<. Then, for sufficiently large n, there exist some γα1>0 and γθ1>0, with probability at least 1c¯/(n+q+p), such that the estimators α^1 and θ^1 must satisfy

  1. α^1Mα1c=0,θ^1Mθ1c=0,

  2. α^1Mα1α1Mα1=O(nγα1logn),θ^1Mθ1θ1Mθ1=O(nγθ1logn).

9. Uniform uncertainty principle and restricted eigenvalue conditions in A-learning

In this section, we establish the UUP and RE conditions in the context of A-learning. In our setting, these two conditions are needed on random matrices Ωn(2) and Ωn(1).

For brevity, we only study the UUP and RE conditions for the random matrix Ωn(2). Those for Ωn(1) can be similarly derived. Recall that Mα2 refers to the support of α2, Mβ2=supp(β2,0), and sβ2=|Mβ2|. We assume that the weak oracle properties of α^2 are achieved such that with probability at least 1c¯/(n+p),

α^2Mα2c=0andα^2α2=O(nγα2logn), (9.1)

for some γα2>0. The following Lemma establishes the UUP condition for Ωn(2).

Lemma 9.1

Assume the convergence rate of α^2 satisfies

α^2α22=O(sα2nγα2logn)=O(1),

and the sample size satisfies

n{ρmaxsβ2()}2(sβ2logp+sα22)infα2Hα2pminsβ2(Ω(2)(α2)). (9.2)

Then for any 0 < θ < 1, with probability at least 1c¯/(n+p), we have

1nyTΩn(2)yyTΩ(2)y2{θ+4ω2n+2ω2α^2α22λmax(Mα2Mα2)}ρmaxsβ2()y22, (9.3)

for any y ∈ ℝp and |supp(y)|sβ2.

Remark 9.1

In our setting, if the following regularity conditions hold

liminfα2Hα2ρminsα2(Ω(2)(α2))>0andρmaxsβ2()=O(1),

the requirement on the sample size (9.2) reduces to nsβ2logp since sα22=O(n2l4)n.

Remark 9.2

The second term on the right-hand side of (9.3) represents the difference between yTΩn(2)y and yTΩn(2)y, where Ωn(2) is defined as the expectation of the truncated random matrix

1niXiXiTAi(2){1π(2)(Xi,α^2)}I(XiMα22ωlogn). (9.4)

This term will vanish as n → ∞. The third term represents the estimation error of α^2. When ρmaxsβ2()<2 and λmax(Mα2Mα2)α^2α220, (9.3) proves the UUP condition for Ωn(2).

Remark 9.3

A key assumption in Lemma 9.1 is the sparsity of α2, which is needed to bound the infinity norm in the indicator function of (9.4). This extra requirement comes from the involvement of the estimated propensity scores in Ωn(2), which adds significant difficulties in proving Lemma 9.1.

After some algebra, the RE condition for Ωn(2) follows similarly from Lemma 9.1, which is presented below.

Lemma 9.2

For any integer c0, assume that α^2α22=O(1), and the sample size satisfies

n{ρmaxsβ2()}2(sβ2logp+sα22)infα2Hα2K2(sβ,c0,Ω(2)(α2)). (9.5)

Then, for any 0 < θ < 1 and sufficiently large n, with probability at least 1c¯/(n+p), we have

K(sβ2,c0,Ωn(2))>(1θ)infαHα2K(sβ2,c0,Ω(2)(α2)).

Remark 9.4

The sample size requirement (9.5) is stronger than (9.2). To see this, for any positive semidefinite matrix Ψ, and positive integers s and c0, we have

K2(s,c0,Ψ)K2(s,0,Ψ)=ρmins(Ψ).

10. Discussion

10.1. Post selection inference

As pointed by one of the referee, the main goal of constructing optimal DTRs is to find treatments that are significantly superior to other treatment options. This requires addressing a post selection inference issue, i.e, the problem of influencing the estimated optimal value function (or the difference between the estimated value and the value function under other treatment options). In the fixed dimension setting, we can use either the empirical average of the advantage function (Murphy, 2003) or the augmented inverse propensity score type estimates (AIPWE, Zhang et. al, 2012) to estimate the optimal value function. Both type of estimators are asymptotically normally distributed. However, the inference based on the advantage function may not be valid in high dimensions. This is because when the number of predictors is large, the parameter estimates in the contrast function may not have oracle property (i.e, model selection consistency and asymptotic normality).

For a single stage study, assuming a linear interaction form XT β0 for the contrast function. Under certain conditions, we can show AIPWE is asymptotically normal even for NP-dimensionality if (i) β^β02=op(n1/4), (ii) with probability going to 1, β^Mβcβ0Mβc1c0β^Mββ0Mβ1 for some constant c0 where Mβ is the support of β0. For our penalized A-learning estimator, Assumption (i) can be achieved assuming certain conditions on the dimension of covariates, sample size and the sparsity of parameters in the contrast, the baseline and propensity score function. Assumption (ii) is typically satisfied for Lasso, Dantzig and folded-concave type estimators. Similar to Theorem 6.1, we can show our estimator satisfies β^Mβcβ0Mβc1β^Mββ0Mβ1 with probability going to 1. The asymptotic normality of AIPWE therefore follows. Standard error of the value estimator can be similarly obtained as in Zhang et. al (2012). Alternatively, we can use the one step online estimator as in Luedtke and van der Laan (2016). However, the asymptotic variance will be larger since it does not use all the data to construct the estimator. In summary, it is important and interesting to develop statistical inference for the estimated value function under the obtained optimal treatment regime in high dimensions, but it is beyond the scope of the current paper.

10.2. Tuning parameter selection

Bayesian information criteria (BIC) is used to tune the penalty functions. BIC has been widely used in model selection for selecting the tuning parameter when the goal is prediction. In high dimensional regressions, Chen and Chen (2008) proposed an extended BIC for model selection, and showed their BIC is consistent when the number of predictors grows polynomially in sample size. Fan and Tang (2013) proposed a similar criterion and showed its consistency when the number of predictors is in the non-polynomial order of the sample size. When the goal is to select treatment effect modifiers, Lu et al. (2011) also used a BIC-type criterion, which showed good empirical performance. This motivated us to use a similar BIC-type criterion for selecting the tuning parameter in our method. Our simulations demonstrated that the proposed BIC-type criterion empirically worked well. We conjecture that following similar arguments in the proof of Theorem 1 of Chen and Chen (2008) and the proof of Theorem 3 in Fan and Tang (2013), we can show our proposed BIC-type criterion is also consistent for selecting important variables in the contrast function. This is another interesting topic that needs further investigation.

10.3. Extensions to multiple stages and general models

In this paper, we mainly focus on a two-stage study. Extension of results to three-stage studies are provided in the supplementary article. It raises additional challenges to establish these results, since the potential model misspecification of contrast functions in the previous two stages can add up and stronger assumptions are needed to guarantee consistency of the parameter estimates. Readers can refer to the supplementary article for details.

For technical convenience, we assume a linear interaction form the contrast function on the last stage. More general results when the contrast function is misspecified can be similarly derived as the three-stage studies discussed in the supplementary article.

Supplementary Material

Supplement

Acknowledgments

We thank the editor, the AE and two referees for providing helpful suggestions that significantly improved the quality of the paper. The STAR*D data are provided by National Institute of Mental Health. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.

Contributor Information

Chengchun Shi, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

Alin Fan, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

Rui Song, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

Wenbin Lu, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

References

  1. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]
  2. Candès E, Tao T. Rejoinder: “The Dantzig selector: statistical estimation when p is much larger than n” [Ann. Statist. 2007;35(6):2313–2351. 2007. [Google Scholar]; Ann Statist. 35:2392–2404. [Google Scholar]
  3. Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. [Google Scholar]
  5. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
  6. Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sack-eim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, et al. Background and rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR* D) study. Psychiatric Clinics of North America. 2003;26:457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
  7. Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann Statist. 2016;44:713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]
  9. McCullagh P, Nelder JA. Generalized linear models Monographs on Statistics and Applied Probability. Second Chapman & Hall; London: 1989. [Google Scholar]
  10. Mendelson S, Pajor A, Tomczak-Jaegermann N. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom Funct Anal. 2007;17:1248–1282. [Google Scholar]
  11. Mendelson S, Pajor A, Tomczak-Jaegermann N. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constr Approx. 2008;28:277–289. [Google Scholar]
  12. Milman VD, Pajor A. Geometric aspects of functional analysis. Springer; 1989. Isotropic position and inertia ellipsoids and zonoids of the unit ball of a normed n-dimensional space; pp. 64–104. [Google Scholar]
  13. Milman V, Pajor A. Regularization of star bodies by random hyperplane cut off. Studia Math. 2003;159:247–261. [Google Scholar]
  14. Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. [Google Scholar]
  15. Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
  17. Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. Sequenced treatment alternatives to relieve depression (STAR* D): rationale and design. Controlled clinical trials. 2004;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]
  18. Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. [Google Scholar]
  19. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
  20. Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Zhang B, Tsiatis AA, Laber EB, Davidian M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika. 2013;100:681–694. doi: 10.1093/biomet/ast014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. J Amer Statist Assoc. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Zhou S. Restricted eigenvalue conditions on subgaussian random matrices. 2009 arXiv: 0912.4045. [Google Scholar]
  25. Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR. Journal of the American Statistical Association. 2015. Residual weighted learning for estimating individualized treatment rules. just-accepted 00-00. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES