HIGH-DIMENSIONAL A-LEARNING FOR OPTIMAL DYNAMIC TREATMENT REGIMES

Chengchun Shi; Alin Fan; Rui Song; Wenbin Lu

doi:10.1214/17-AOS1570

. Author manuscript; available in PMC: 2019 Jun 1.

Published in final edited form as: Ann Stat. 2018 May 3;46(3):925–957. doi: 10.1214/17-AOS1570

HIGH-DIMENSIONAL A-LEARNING FOR OPTIMAL DYNAMIC TREATMENT REGIMES

Chengchun Shi ¹, Alin Fan ², Rui Song ³, Wenbin Lu ⁴

PMCID: PMC5966293 NIHMSID: NIHMS915736 PMID: 29805186

Abstract

Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many complex diseases, such as cancer, treatment decisions need to be tailored over time according to patients’ responses to previous treatments. Such an adaptive strategy is referred as a dynamic treatment regime. A major challenge in deriving an optimal dynamic treatment regime arises when an extraordinary large number of prognostic factors, such as patient’s genetic information, demographic characteristics, medical history and clinical measurements over time are available, but not all of them are necessary for making treatment decision. This makes variable selection an emerging need in precision medicine.

In this paper, we propose a penalized multi-stage A-learning for deriving the optimal dynamic treatment regime when the number of covariates is of the non-polynomial (NP) order of the sample size. To preserve the double robustness property of the A-learning method, we adopt the Dantzig selector which directly penalizes the A-leaning estimating equations. Oracle inequalities of the proposed estimators for the parameters in the optimal dynamic treatment regime and error bounds on the difference between the value functions of the estimated optimal dynamic treatment regime and the true optimal dynamic treatment regime are established. Empirical performance of the proposed approach is evaluated by simulations and illustrated with an application to data from the STAR*D study.

Keywords and phrases: A-learning, Dantzig selector, NP-dimensionality, Model misspecification, Optimal dynamic treatment regime, Oracle inequality

1. Introduction

Precision medicine is a medical paradigm that focuses on finding the most effective treatment decision based on individual patient information. For many chronic diseases, such as cancer, cardiovascular disease and diabetes, treatment decisions need to be tailored over time according to patients’ responses to previous treatments. Such an adaptive treatment strategy is referred as an dynamic treatment regime. Formally speaking, a dynamic treatment regime is a sequence of decision rules, dictating how the treatment will be tailored through time to individual’s status. The optimal dynamic treatment regime is defined as the one that yields the most favorable outcome on average.

Various methods have been proposed to estimate the optimal dynamic treatment regime, including Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010) and A-learning (Robins, Hernan and Brumback, 2000; Murphy, 2003). Both Q-learning and A-learning rely on a backward induction algorithm to find the optimal dynamic treatment regime, however, Q-learning models the conditional mean of the outcome given predictors and treatment while A-learning directly models the contrast function that is sufficient for treatment decision. In particular, A-learning has the so-called doubly robust property, i.e. when either the baseline mean function or the propensity score model is correctly specified, the resulting A-learning estimating equation for the contrast function is consistent.

With the fast development of new technology, it becomes possible to gather an extradinary large number of prognostic factors for each individual, such as patient’s genetic information, demographic characteristics, medical history and clinical measurements over time. For such big data, it is important to make effective use of information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing precision medicine. In addition, variable selection is an essential tool in making inference for problems in which the number of covariates is comparable or much larger than the sample size. There have been extensive developments of penalized regression methods for variable selection in prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and the Dantzig selector (Candès and Tao, 2007), to name a few. In contrast to most penalized regression methods, which adds a penalty term to an objective function, the Dantzig selector focuses directly on estimating equations.

Although there is a large amount of work on developing variable selection methods for prediction, variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size. Qian and Murphy (2011) proposed to estimate the conditional mean response using a L₁-penalized regression and studied the error bound of the value function for the estimated treatment regime. When the number of covariates is fixed, introduced a new penalized least squared regression framework and established the oracle property of the estimator, which is robust against the misspecification of the conditional mean function. extended this result to the setting allowing NP-dimensionality of covariates. However, all these works only consider studies with a single treatment decision. When moving to multiple-stage studies, the asymptotic properties of the estimated optimal dynamic treatment regime are much harder to derive since it needs to handle model misspecification of the contrast functions in the presence of NP-dimensionality of covariates. Moreover, these methods are not doubly robust.

In this paper, we propose a penalized A-learning method for deriving the optimal dynamic treatment regime when the number of covariates is of NP-order of the sample size. To preserve the doubly robust property of the A-learning method, we adopt the Dantzig selector (Candès and Tao, 2007) which directly penalizes the A-leaning estimating equations. The technical challenges and advances of the proposed estimators are described as follows.

First, to prove the theoretical properties of the Dantzig estimator in linear regression setting, the uniform uncertainty principle (UUP, Candès and Tao, 2007) or restricted eigenvalue condition (RE, Bickel, Ritov and Tsybakov, 2009) is required on the Gram matrix X^T X, where X stands for the design matrix. The UUP condition essentially requires that every principle submatrix with the number of rows or columns less than some specified s behaves like an orthonormal system. The RE condition is the weakest and hence the most general condition in the literature to ensure the theoretical properties of Lasso and Dantzig estimators. A close connection between these two conditions are discussed in Bickel, Ritov and Tsybakov (2009). In a random design case, Candès and Tao (2007) studied the UUP condition for Gaussian, Bernoulli and Fourier ensembles. Mendelson, Pajor and Tomczak-Jaegermann (2007, 2008) obtained a similar result for a more general class of design matrices, the isotropic subgaussian matrices, based on some empirical process results. These results were further extended by Zhou (2009), where the UUP and RE conditions are developed for subgaussian ensembles with a correlated covariance structure. In the proposed penalized A-learning method, however, such conditions are required on matrices involving estimates, such as

X^{T} diag (A \circ (1 - \hat{π})) X,

(1.1)

where A = (A₁, …, A_n)^T denotes the vector of treatments received by n subjects, $\hat{π} = ({\hat{π}}_{1}, \dots, {\hat{π}}_{n})$ denotes the corresponding estimated propensity scores and ◦ denotes the componentwise product operator. The presence of $\hat{π}$ in (1.1) adds extraordinary difficulties in establishing theoretical properties of such a random matrix. We establish the UUP and RE conditions under a proper convergence rate of $\hat{π}$ , which provides a new theoretical framework for studying random matrices that involve estimates of unknown parameters.

Second, in the proposed penalized A-learning method, we need to estimate the baseline mean function and the propensity score model with NP-dimensionality of covariates. We adopt the penalized regressions with the folded-concave penalties, for example, a linear regression for the baseline mean function and a logistic regression for the propensity score model, with the SCAD penalties. Several difficulties need to be addressed for deriving the theoretical properties of the resulting penalized estimators. First, to our knowledge, penalized regressions with folded-concave penalties have seldom been studied in a random design setting. A major difficulty of adapting the existing results for the fixed design case to the random design case is to control the maximum eigenvalues of some random matrices,

max_{j} λ_{max} [X^{M^{T}} diag (| X^{j} |) X^{M}],

where λ_max[K] denotes the maximum eigenvalue of a matrix K, M is a given subset of [1, ⋯, p], X^j denotes the jth column of a matrix X, and X^M the submatrix formed by columns in M. Such a problem is not standard since matrix $X^{M^{T}} diag (| X^{j} |) X^{M}$ does not possess subexponential tail. We derive some concentration inequalities for such random matrices and for summations of subexponential and subgaussian random variables. Based on these results, we establish the weak oracle (Lv and Fan, 2009) properties, i.e, sign consistency and L_∞ convergence rate of the estimators under subgaussian ensembles, which is one of our major technical contributions. Moreover, the posited models for the baseline mean function or the propensity score may be misspecified. Therefore, the derivation of the asymptotic properties needs to take into account model misspecification with NP-dimensionality of covariates, which is challenging.

Third, a challenge for extending the results for a single treatment decision to sequential treatment decisions is that the contrast functions are likely to be misspecified in the backward induction algorithm, such as A-learning. This together with the NP-dimensionality of covariates make it extremely hard to study theoretical properties of the value function under the estimated optimal dynamic treatment regime. We overcome this difficulty by first defining population-level least favorable parameters in the misspecified contrast functions. Moreover, we derive the error bounds for the corresponding estimates under the model misspecification, which in turn leads to an error bound for the difference between the value functions of the estimated optimal dynamic treatment regime and the underlying true optimal dynamic treatment regime.

The remainder of the paper is organized as follows. We introduce the proposed penalized A-learning method in Section 2. Some implementation issues are addressed in Section 3, followed by simulation results in Section 4. We apply our method to a data from the STAR*D study in Section 5. Section 6 studies the error bounds of the penalized A-learning estimator and the difference between the value functions of the estimated optimal regime and the true optimal regime, at the second stage. Section 7 characterizes such results for the estimates at the first stage. Section 8 presents the weak oracle properties of the penalized estimators in the propensity score and baseline mean models under a random design setting. Section 9 discusses the UUP and RE condition in the context of A-learning. All technical conditions, lemmas and proofs are given in the Appendix.

2. Penalized A-Learning

For simplicity of presentation, we only consider a two-stage study where binary treatment decisions are made at time points t₁ and t₂. The data of a subject can be summarized as

O = (S^{(1)}, A^{(1)}, S^{(2)}, A^{(2)}, Y),

(2.1)

where S⁽¹⁾ denotes the covariates collected prior to t₁, A⁽¹⁾ ∈ {0, 1} is the treatment received at time t₂, S⁽²⁾ denotes intermediate covariates collected between time points t₁ and t₂, A⁽²⁾ ∈ {0, 1} is the treatment received at time t₂, and Y is the final outcome of interest. It is assumed that a larger value of Y stands for a better clinical outcome. Denote Y^★ (a₁, a₂) the potential response of patient if he/she were given a₁ as the first treatment and a₂ as the second. If a patient follows a given regime (d₁, d₂), we can write the potential outcome

Y^{⋆} (d_{1}, d_{2}) = \sum_{a_{1} \in {0, 1}, a_{2} \in {0, 1}} Y (a_{1}, a_{2}) I (d_{1} = a_{1}, d_{2} = a_{2}),

where I(·) denotes the indicator function. Our goal is to find a dynamic treatment regime to maximize the mean potential outcome. Throughout the paper, we make the commonly used assumptions for studying dynamic treatment regimes: stable unit treatment value assumption and sequential randomization assumption (Murphy, 2003).

The observed data from n subjects can be summarized as

O_{i} = (S_{i}^{(1)}, A_{i}^{(1)}, S_{i}^{(2)}, A_{i}^{(2)}, Y_{i}), i = 1, \dots, n,

which are assumed to be independently and identically distributed copies of O. We assume the following semiparametric regression model for Y:

Y_{i} = h^{(2)} (X_{i}) + A_{i}^{(2)} C^{(2)} (X_{i}) + e_{i},

(2.2)

where $X_{i} = {({(S_{i}^{(1)})}^{T}, A_{i}^{(1)}, {(S_{i}^{(2)})}^{T})}^{T}$ is the vector of covariates for the ith patient, h⁽²⁾(·) is an unspecified baseline mean function, C⁽²⁾(·) the contrast function, and e_i is an independent error with mean 0. The design matrix is denoted as X = (X₁, …, X_n)^T.

Define

V_{i} = max_{A_{i}^{(2)}} E (Y_{i} | S_{i}^{(1)}, A_{i}^{(1)}, S_{i}^{(2)}, A_{i}^{(2)}) = h^{(2)} (X_{i}) + C^{(2)} (X_{i}) I (C^{(2)} (X_{i}) > 0) .

At the first stage, we consider the following conditional mean model for V⁽²⁾:

E (V_{i} | S_{i}^{(1)}, A_{i}^{(1)}) = h^{(1)} (S_{i}^{(1)}) + A_{i}^{(1)} C^{(1)} (S_{i}^{(1)}),

(2.3)

where h⁽¹⁾(·) and C⁽¹⁾(·) are functions of the baseline covariates. To simplify the notation, we use a shorthand S_i for $S_{i}^{(1)}$ and let S = (S₁, …, S_n)^T, the design matrix at the baseline.

It can be shown that the optimal dynamic treatment regime is given by $d^{opt} = (d_{1}^{opt}, d_{2}^{opt})$ , where

d_{1}^{opt} (S_{i}) = I {C^{(1)} (S_{i}) > 0} and d_{2}^{opt} (X_{i}) = I {C^{(2)} (X_{i}) > 0} .

(2.4)

To estimate $d_{1}^{opt}$ and $d_{2}^{opt}$ , we posit the following models for C⁽¹⁾(·), C⁽²⁾(·), h⁽¹⁾(·), h⁽²⁾(·), π⁽¹⁾(·), and π⁽²⁾ (·):

π^{(1)} (s, α_{1}) = \exp (s^{T} α_{1}) / {1 + \exp (s^{T} α_{1})},

(2.5)

π^{(2)} (x, α_{2}) = \exp (x^{T} α_{2}) / {1 + \exp (x^{T} α_{2})},

(2.6)

h^{(1)} (s) = s^{T} θ_{1}, h^{(2)} (x) = x^{T} θ_{2}, C^{(1)} (s) = s^{T} β_{1}, C^{(2)} (x) = x^{T} β_{2},

(2.7)

and

π^{(1)} (s) = \Pr (A_{i}^{(1)} = 1 | S_{i} = s) and π^{(2)} (x) = \Pr (A_{i}^{(2)} = 1 | X_{i} = x) .

Models in (2.5)–(2.7) can be misspecified, however, we require that either h⁽^j⁾ or π⁽^j⁾ is correct for j = 1, 2. For simplicity, we require C⁽²⁾ to be correctly specified. The general case when C⁽²⁾ is misspecified can be similarly discussed. We use backward induction to estimate the optimal dynamic treatment regime. At the second decision point, we first estimate the parameters in the posited propensity score and baseline mean models using penalized regressions. Specifically, define

{\hat{α}}_{2} = \arg min_{α_{2} \in ℝ^{p}} \frac{1}{n} \sum_{i = 1}^{n} [\log {1 + \exp (X_{i}^{T} α_{2})} - A_{i}^{(2)} X_{i}^{T} α_{2}] + \sum_{j = 1}^{p} λ_{1 n}^{(2)} ρ_{1}^{(2)} (| α_{2}^{j} |, λ_{1 n}^{(2)}),

and

{\hat{θ}}_{2} = \arg min_{θ_{2} \in ℝ^{p}} \frac{1}{n} \sum_{i = 1}^{n} (1 - A_{i}^{(2)}) {(Y_{i} - X_{i}^{T} θ_{2})}^{2} + \sum_{j = 1}^{p} λ_{2 n}^{(2)} ρ_{2}^{(2)} (| θ_{2}^{j} |, λ_{2 n}^{(2)}),

where $α_{2} = {(α_{2}^{1}, \dots, α_{2}^{p})}^{T}$ , $θ_{2} = {(θ_{2}^{1}, \dots, θ_{2}^{p})}^{T}$ , $ρ_{1}^{(2)}$ and $ρ_{2}^{(2)}$ belong to the class of folded-concave penalty functions (Lv and Fan, 2009), such as SCAD (Fan and Li, 2001), and $λ_{1 n}^{(2)}$ , $λ_{2 n}^{(2)}$ the associated regularization parameters.

Next, we estimate β₂ in (2.2) using the Dantzig selector based on A-learning estimating function (Murphy, 2003), defined by

{\hat{β}}_{2} = \arg min_{β_{2} \in Λ^{(2)}} ‖ β_{2} ‖_{1},

(2.8)

where

Λ^{(2)} = {β_{2} \in ℝ^{p} : ‖ \frac{1}{n} X^{T} diag (A^{(2)} - {\hat{π}}^{(2)}) {Y - X {\hat{θ}}_{2} - A^{(2)} \circ (X β_{2})} ‖_{\infty} \leq λ_{3 n}^{(2)}},

Y = {(Y_{1}, \dots, Y_{n})}^{T}, A^{(2)} = {(A_{1}^{(2)}, \dots, A_{n}^{(2)})}^{T} {\hat{π}}^{(2)} = {(π^{(2)} (X_{1}, {\hat{α}}_{2}), \dots, π^{(2)} (X_{n}, {\hat{α}}_{2}))}^{T},

and $λ_{3 n}^{(2)}$ the regularization parameter.

To estimate the regime at the first decision point, we define the pseudo-outcome ${\hat{V}}_{i}$ using the advantage function (Murphy, 2003) by

{\hat{V}}_{i} = Y_{i} + X_{i}^{T} {\hat{β}}_{2} {I (X_{i}^{T} {\hat{β}}_{2} > 0) - A_{i}^{(2)}} .

(2.9)

Similarly, define

{\hat{α}}_{1} = \arg min_{α_{1} \in ℝ^{q}} \frac{1}{n} \sum_{i = 1}^{n} [\log {1 + \exp (S_{i}^{T} α_{1})} - A_{i}^{(1)} S_{i}^{T} α_{1}] + \sum_{j = 1}^{q} ρ_{1}^{(1)} (| α_{1}^{j} |, λ_{1 n}^{(1)}),

and

{\hat{θ}}_{1} = \arg min_{θ_{1} \in ℝ^{q}} \frac{1}{n} \sum_{i = 1}^{n} (1 - A_{i}^{(1)}) {({\hat{V}}_{i} - S_{i}^{T} θ_{1})}^{2} + \sum_{j = 1}^{q} ρ_{2}^{(1)} (| θ_{1}^{j} |, λ_{2 n}^{(1)}),

where $α_{1} = {(α_{1}^{1}, \dots, α_{1}^{q})}^{T}$ , $θ_{1} = {(θ_{1}^{1}, \dots, θ_{1}^{q})}^{T}$ and $ρ_{1}^{(1)}$ and $ρ_{2}^{(1)}$ are folded-concave penalty functions. Then, we estimate β₁ in (2.3) by

{\hat{β}}_{1} = \arg min_{β_{1} \in Λ^{(1)}} ‖ β_{1} ‖_{1},

(2.10)

where

Λ^{(1)} = {β_{1} \in ℝ^{q} : ‖ \frac{1}{n} S^{T} diag (A^{(1)} - {\hat{π}}^{(1)}) {\hat{V} - S {\hat{θ}}_{1} - A^{(1)} \circ (S β_{1})} ‖_{\infty} \leq λ_{3 n}^{(1)}},

\hat{V} = {({\hat{V}}_{1}, \dots, {\hat{V}}_{n})}^{T}, A^{(1)} = {(A_{1}^{(1)}, \dots, A_{n}^{(1)})}^{T} and {\hat{π}}^{(1)} = {(π^{(1)} (S_{1}, {\hat{α}}_{1}), \dots, π^{(1)} (S_{n}, {\hat{α}}_{1}))}^{T} .

The estimated optimal dynamic treatment regime is given by

{\hat{d}}_{1} (S_{i}) = I ({\hat{β}}_{1}^{T} S_{i} > 0) and {\hat{d}}_{2} (X_{i}) = I ({\hat{β}}_{2}^{T} X_{i} > 0) .

(2.11)

3. Some Implementation Issues

When the tuning parameters in optimization problems (2.8) and (2.10) are fixed, the Dantzig selector can be solved by a standard linear programming algorithm. One issue for implementing Dantzig selector is the choice of the tuning parameters. We use a BIC criterion for selecting tuning parameters. For Dantzig selector (2.8), $λ_{3 n}^{(2)}$ is chosen as the minimizer of

BIC (λ) = n \log (RSS (λ) / n) + d (λ) {\log (n) + \log (p + 1)},

(3.1)

where $RSS (λ) = {\sum_{i = 1}^{n} [{A_{i}^{(2)} - π^{(2)} (X_{i}, {\hat{α}}_{2})} (Y_{i}^{(2)} - X_{i}^{T} {\hat{θ}}_{2} - A_{i}^{(2)} X_{i}^{T} {\hat{β}}_{2})]}^{2}$ , and d(λ) is the number of nonzero components in ${\hat{β}}_{2}$ . A similar BIC criterion was proposed by Chen and Chen (2008). We use a similar criterion for choosing $λ_{3 n}^{(1)}$ .

It was observed that the Dantzig estimators may underestimate the true values of parameters due to the shrinkage estimation (Candès and Tao, 2007). Therefore, we use a two-step procedure for practical implementation, which is referred as Gauss-Dantzig selector in Candès and Tao (2007). Specifically, in the first step, we apply the proposed penalized A-learning to select important variables for making an optimal decision, i.e. those variables with non-zero estimated coefficients. Then, in the second step, their corresponding coefficients are re-calculated by solving the unpenalized A-learning estimating equations with important variables only.

4. Simulation Studies

4.1. Settings

To evaluate the numerical performance of the proposed penalized A-learning method, we consider simulation studies with two treatment decision points, based on the following model:

Y = A^{(1)} A^{(2)} + A^{(2)} (β_{2}^{T} S^{(1)} + S^{(2)} + β_{0}) + A^{(1)} (β_{1}^{T} S^{(1)}) + ε,

(4.1)

where A⁽^j⁾, j = 1, 2, is the treatment given at the jth stage, S⁽^j⁾, j = 1, 2, denote the covariate information collected before the jth treatment is given, and Y is the final response of interest. The random error ε follows a normal distribution with mean 0 and variance 0.25. Here, covariates $S^{(1)} = {(S_{1}^{(1)}, \dots S_{q}^{(1)})}^{T}$ follow a multivariate normal distribution with mean 0 and variance I_q. In addition, the intermediate covariate S⁽²⁾ is a scalar and generated as $S^{(2)} = S_{1}^{(1)} + A^{(1)} + A^{(1)} S_{1}^{(1)} + e$ , where e follows a normal distribution with mean 0 and variance 0.25.

We set β₀ = 0. Based on model (4.1), the optimal treatment regime at stage 2 is $I (A^{(1)} + β_{2}^{T} S^{(1)} + S^{(2)} > 0)$ . Following this optimal treatment regime at stage 2, the Q-function at stage 1 is given by

Q_{1} (S^{(1)}, A^{(1)}) = E {(A^{(1)} + β_{2}^{T} S^{(1)} + S^{(2)}) + | S^{(1)}, A^{(1)}} + A^{(1)} (β_{1}^{T} S^{(1)}) = \frac{β_{2}}{\sqrt{8 π}} \exp (- 2 μ^{2}) + μ {1 - Φ (- 2 μ)} + A^{(1)} (β_{1}^{T} S^{(1)}),

where $μ = A^{(1)} + β_{2}^{T} S^{(1)} + S_{1}^{(1)} + A^{(1)} + A^{(1)} S_{1}^{(1)}$ and a₊ = (|a| + a)/2. Therefore, the contrast function C(S⁽¹⁾) = Q₁(S⁽¹⁾, 1) − Q₁(S⁽¹⁾, 0) and thus the optimal treatment regime at stage 1 is I{C(S⁽¹⁾) > 0.

To evaluate the double robustness of the proposed method, we consider a variety of scenarios with correctly specified and misspecified baseline mean and/or propensity score models. At stage 2, a linear model with covariates S⁽¹⁾, S⁽²⁾ and A⁽¹⁾ is fitted for the baseline mean function, while the true baseline mean function is $h^{(2)} (X) = A^{(1)} (β_{1}^{T} S^{(1)})$ . We choose β₁ = 0_q, for which the baseline mean function is correctly specified, and β₁ = (0₄, 1, −1, 0_q₋₆)^T, for which the baseline mean function is misspecified. At stage 1, a linear model with covariates S⁽¹⁾ is fitted for the baseline mean function, which is always misspecified. Logistic models are used for estimating the propensity scores, which are correctly specified for the constant model but misspecified for the probit model. The following four settings are considered:

Setting 1: β₁ = 0_q, P (A⁽²⁾ = 1) = 0.5;
Setting 2: β₁ = (0₄, 1, −1, 0_q₋₆)^T, P(A⁽²⁾ = 1) = 0.5;
Setting 3: β₁ = 0_q, P (A⁽²⁾ = 1) = Pr(N(0, 1) ≤ S^T γ);
Setting 4: β₁ = (0₄, 1, −1, 0_q−₆)^T, P (A⁽²⁾ = 1) = Pr(N(0, 1) ≤ S^T γ),

where S = ((S⁽¹⁾)^T, S⁽²⁾)^T and N(0, 1) a standard normal random variable. For other parameters, we choose P(A₁ = 1) = 0.5, β₂ = (0, 0, 1, −1, 0_q−₄)^T, σ₁ = σ₂ = 0.5, d = (d₀, d₁, d₂, d₃,)^T = (0, 1, 1, 1)^T, and γ = (0_q−₂, 1, −1, 1)^T. Table 1 summarizes the information of model misspecification for the baseline mean and propensity score models and associated important variables under different settings. In next section, we show simulation results of the four settings with q = 1000 and sample size n = 150/300 over 500 replications.

Table 1.

Simulation Settings

Stage

Baseline

Propensity Score

Important Variables

Setting 1

Stage 2

right

(S^{(2)}, A_{1}, S_{3}^{(1)}, S_{4}^{(1)})

Stage 1

wrong

right

(S_{1}^{(1)}, S_{3}^{(1)}, S_{4}^{(1)})

Setting 2

Stage 2

wrong

right

(S^{(2)}, A_{1}, S_{3}^{(1)}, S_{4}^{(1)})

Stage 1

wrong

right

(S_{3}^{(1)}, S_{3}^{(1)}, S_{4}^{(1)}, S_{5}^{(1)}, S_{6}^{(1)})

Setting 3

Stage 2

right

wrong

(S^{(2)}, A_{1}, S_{3}^{(1)}, S_{4}^{(1)})

Stage 1

wrong

right

(S_{1}^{(1)}, S_{3}^{(1)}, S_{4}^{(1)})

Setting 4

Stage 2

wrong

(S^{(2)}, A_{1}, S_{3}^{(1)}, S_{4}^{(1)})

Stage 1

wrong

right

(S_{1}^{(1)}, S_{3}^{(1)}, S_{4}^{(1)}, S_{5}^{(1)}, S_{6}^{(1)})

Open in a new tab

4.2. Competing methods

We further compare our method with outcome weighted learning (OWL, Zhao et al., 2012), which is a robust method which estimates individualized treatment rule by directly maximizing the estimated value function. Zhao et al. (2015) further introduced backward outcome weighted learning (BOWL) and simultaneous outcome weighted learning (SOWL) to extend their methods to multiple stage studies. Here, we consider a double robust version of BOWL (DR-BOWL) for comparison. For a single stage study, the developed DR-BOWL method is similar to the residual weighted learning method (Zhou et al., 2015).

Specifically, we first estimate the propensity score ${\hat{π}}^{(2)} = {({\hat{π}}_{1}^{(2)}, \dots, {\hat{π}}_{n}^{(2)})}^{T}$ and baseline $h^{(2)} = X^{T} {\hat{θ}}^{(2)} = {(h_{1}^{(2)}, \dots, h_{n}^{(2)})}^{T}$ as in Section 2. We consider the linear decision rule I(x^Tβ₂₀ > 0) and estimate β₂₀ by minimizing the following loss function:

{\tilde{β}}_{2} = \arg min_{β_{2}} \frac{1}{n} \sum_{i} \frac{(Y_{i} - h_{i}^{(2)}) {1 - (2 A_{i}^{(2)} - 1) X_{i}^{T} β_{2}}_{+}}{A_{i}^{(2)} {\hat{π}}_{i}^{(2)} + (1 - A_{i}^{(2)}) (1 - {\hat{π}}_{i}^{(2)})} + λ_{3 n}^{(2)} ‖ β_{2} ‖_{1} .

The penalty term in original OWL is $λ_{3 n}^{(2)} ‖ β_{2} ‖_{2}^{2}$ . We replace it with the L₁ norm here to simultaneously select variables. Then we construct the pseudo outcome ${\hat{V}}_{i}$ using augmented inverse propensity score estimator (AIPWE Zhang et al., 2012),

{\hat{V}}_{i} = \frac{A_{i}^{(2)} {\tilde{d}}_{2} (X_{i}) + (1 - A_{i}^{(2)}) {1 - {\tilde{d}}_{2} (X_{i})}}{A_{i}^{(2)} {\hat{π}}_{i}^{(2)} + (1 - A_{i}^{(2)}) (1 - {\hat{π}}_{i}^{(2)})} Y_{i} - (\frac{A_{i}^{(2)} {\tilde{d}}_{2} (X_{i}) + (1 - A_{i}^{(2)}) {1 - {\tilde{d}}_{2} (X_{i})}}{A_{i}^{(2)} {\hat{π}}_{i}^{(2)} + (1 - A_{i}^{(2)}) (1 - {\hat{π}}_{i}^{(2)})} - 1) [{\hat{h}}_{i}^{(2)} {1 - {\tilde{d}}_{2} (X_{i})} + {\hat{Φ}}_{i}^{(2)} {\tilde{d}}_{2} (X_{i})]

where ${\tilde{d}}_{2} (X_{i}) = I (X_{i}^{T} {\tilde{β}}_{2} > 0)$ , and ${\hat{Φ}}_{i}^{(2)}$ is an estimate of ${\hat{Φ}}_{i}^{(2)} = Mean (Y | A = 1, X = X_{i})$ . Here, we a fit linear model for Mean(Y |A = 1, X) and use nonconcave penalized regression with SCAD penalty to obtain ${\hat{Φ}}_{i}^{(2)}$ . Denoted by ${\hat{π}}^{(1)} = {({\hat{π}}_{1}^{(1)}, \dots, {\hat{π}}_{n}^{(1)})}^{T}$ and $h^{(1)} = S^{T} {\hat{θ}}^{(1)} = {(h_{1}^{(1)}, \dots, h_{n}^{(1)})}^{T}$ estimated propensity score and baseline at the first stage, we consider linear treatment regime of the form $I (s^{T} β_{1}^{⋆} > 0)$ and estimate $β_{1}^{⋆}$ by

{\tilde{β}}_{1} = \arg min_{β_{1}} \frac{1}{n} \sum_{i} \frac{({\hat{V}}_{i} - h_{i}^{(1)}) {1 - (2 A_{i}^{(1)} - 1) S_{i}^{T} β_{1}}_{+}}{A_{i}^{(1)} {\hat{π}}_{i}^{(1)} + (1 - A_{i}^{(1)}) (1 - {\hat{π}}_{i}^{(1)})} + λ_{3 n}^{(1)} ‖ β_{1} ‖_{1} .

Tuning parameters $λ_{3 n}^{(2)}$ and $λ_{3 n}^{(1)}$ are obtained by minimizing a value-based BIC criterion.

4.3. Results

Table 2 summarizes variable selection results for optimal treatment decisions and the empirical performance of the estimated optimal treatment regime compared with the true optimal regime, using our penalized A-learning method (denoted as PAL) and DR-BOWL, respectively. Specifically, it reports the false negative (FN) rate (the percentage of important variables that are missed) and false positives (FP) rate (the percentage of unimportant variables that are selected), the ratio of value functions (denoted by VR) calculated using the value function of the estimated optimal treatment regime divided by that of the true optimal regime, and the error rates (ER) of the estimated optimal treatment regimes for treatment decision making, in both stages. Here, the ER at stage 2 is calculated as the mean of $n^{- 1} \sum_{i = 1} | I ({\hat{β}}_{2}^{T} X_{i} > 0) - I (β_{2, 0}^{T} X_{i} > 0) |$ and at stage 1 as the mean of $n^{- 1} \sum_{i = 1} | I ({\hat{β}}_{1}^{T} S_{i} > 0) - I (C (S_{i}) > 0) |$ . The value function of a given treatment regime is calculated using Monte Carlo simulations based on 10,000 replications. The VR at stage 2 (devoted by VR*) is to compare the estimated optimal treatment regime at stage 2 and a randomly assigned treatment at stage 1 as in simulated data with the true optimal dynamic treatment regime for both stages. The VR at stage 1 is to compare the estimated optimal dynamic treatment regime with the true optimal dynamic treatment regime for both stages.

Table 2.

Variable Selection Simulation Results (%).

n	method	Stage 2				Stage 1
n	method	FN	FP	VR*	ER	FN	FP	VR	ER
Setting 1
150	PAL	12.6	0.1	64.7	6.1	63.8	0.1	98.3	7.0
	DR-BOWL	85.7	0.1	39.0	34.7	99.5	0.1	39.1	48.3
300	PAL	1.1	0.1	65.4	2.6	41.9	0.1	99.7	6.2
	DR-BOWL	78.1	0.1	49.2	27.5	98.0	0.2	50.2	48.3

Setting 2
150	PAL	25.9	0.1	57.8	10.4	56.2	0.2	90.8	15.7
	DR-BOWL	86.3	0.1	35.1	35.6	99.0	0.2	35.9	47.2
300	PAL	11.0	0.1	59.6	6.2	32.5	<0.05	97.9	8.0
	DR-BOWL	79.8	0.1	42.4	29.9	97.2	0.2	44.6	47.1

Setting 3
150	PAL	33.7	0.3	59.9	13.5	64.5	0.1	93.0	9.1
	DR-BOWL	18.8	1.3	60.2	7.5	72.3	0.5	92.4	24.4
300	PAL	12.3	0.3	64.2	7.2	52.7	<0.05	98.3	6.9
	DR-BOWL	74.9	0.2	55.3	23.2	97.8	<0.01	56.4	48.4

Setting 4
150	PAL	55.7	0.2	48.2	22.4	62.2	0.1	79.4	17.7
	DR-BOWL	75.0	0.1	51.0	23.4	99.0	<0.01	51.7	47.2
300	PAL	26.4	0.3	56.2	13.2	36.4	<0.05	94.3	8.4
	DR-BOWL	74.9	0.2	50.9	23.1	97.4	<0.01	52.8	47.0

Open in a new tab

FN: proportion of related variables with zero coefficients

FP: proportion of unrelated variables with nonzero coefficients

VR: value ratio between estimated and true treatment regimes

ER: error rate of estimated treatment regimes

The DR-BOWL methods fail in all settings. Take Setting 1, n = 300 as an example, FN = 78.1% for the second stage where the baseline, propensity score and contrast functions are all correctly specified. It missed approximately 3/4 of the important variables. Besides, VR = 50.2, indicating the poor performance of the estimated treatment rules.

On the other hand, the overall performance of our penalized A-learning method is good. We make the following observations. First, the FN rates are much higher than the FP rates. This suggests that the Dantzig selector tends to have conservative variable selection results, which is commonly seen in the literature. Second, the variable selection results and the error rates of the estimated optimal treatment regime at stage 2 are generally much better than those at stage 1, which is expected since the optimal linear treatment decision rule is correctly specified at stage 2 but not at stage 1. At stage 2, for n = 150, over 55% important variables are not selected for all 4 settings. Thirdly, our method requires correct specification of either the propensity score or the baseline model, especially when the sample size is small. This is implied by comparing results in Setting 4 with other three settings. For example, when n = 150, the false negative at second stage reaches 55.7%, which is much higher than those FN’s in other three settings. Besides, our estimator is very efficient in Setting 1 where both models are correctly specified. Even when n = 150, the ratio of the value functions reaches 98.7%, and all error rates are abound 6–7%. These results are even comparable with those under Setting 2 and 3 when n = 300. Lastly, the estimation and variable selection performance of the estimated optimal dynamic treatment regimes improves as the sample size increases. In particular, in Setting 1–3 when n = 300, the VR’s are all above 97.9% and ER’s are all below 8%, which implies that the estimated optimal treatment regimes nearly maximize the value functions.

4.4. Nonregularity

As suggested by one of the referee, we further examine our methods under settings with different degrees of nonregularity. Specifically, we consider the setting where all covariates in S⁽¹⁾ are independent Rademacher random variables. We set S⁽²⁾ to be another Rademacher random variable independent of S⁽¹⁾ and A⁽¹⁾.

Denoted by A⁽¹⁾* = 2A⁽¹⁾ − 1, the response Y is generated as follows,

Y = 2 A^{(2)} (A^{(1) *} + δ_{1} S_{1}^{(1)} + S^{(2)} - δ_{2}) + A^{(1)} (β^{T} S^{(1)}) + ε,

(4.2)

where ε ~ N(0, 0.25).

For each stage, we fit linear models for the baseline and contrast function, and a logistic regression model for the propensity score. The parameter β in (4.2) determines the baseline function on the second stage. Similar to the regular case discussed in Section 4.1 in the revision, we also consider four Settings here:

Setting 1: β = 0_q, P (A⁽²⁾ = 1) = 0.5;
Setting 2: β = (0₄, 1, −1, 0_q−₆)^T, P(A⁽²⁾ = 1) = 0.5;
Setting 3: β = 0_q, P(A⁽²⁾ = 1) = Pr(N(0, 1) ≤ S^Tγ);
Setting 4: β = (0₄, 1, −1, 0_q−₆)^T, P(A⁽²⁾ = 1) = Pr(N(0, 1) ≤ S^Tγ),

where S = ((S⁽¹⁾)^T, S⁽²⁾)^T and γ = (0_q−₂, 1, −1, 1)^T.

Parameters δ₁ and δ₂ in (4.2) controls the degree of nonregularity on the second stage. We consider three choices of δ₁ and δ₂. Set δ₁ = 1, δ₂ = 1, we obtain

\Pr (C^{(2)} (X) = 0) = \Pr (A^{(1) *} + S_{1}^{(1)} + S^{(2)} = 1) = 0.375.

Set δ₁ = 1.1, δ₂ = 1.1, we have

\Pr (C^{(2)} (X) = 0) = \Pr (A^{(1) *} + S^{(2)} = 1, S^{(1)} = 1) = 0.25.

Set δ₁ = 1, δ₂ = 1.1, we have

\Pr (C^{(2)} (X) = 0) = 0.

With some calculation, we can show the Q-function on the first stage takes the following form:

Q (S^{(1)}, A^{(1)}) = A^{(1)} (β^{T} S^{(1)} + f_{1} S_{1}^{(1)} + f_{2}) .

Hence, the contrast function is correctly specified on the first stage. When δ₁ = 1, δ₂ = 1 or δ₁ = 1.1, δ₂ = 1.1, we have f₁ = f₂ = 1. When δ₁ = 1, δ₂ = 1.1, we have f₁ = f₂ = 0.95. Information about model specification and important variables in the contrast function are given in Table 3.

Table 3.

Simulation for Nonregular Settings

Stage

Baseline

Propensity Score

Important Variables

Setting 1

Stage 2

right

(S^{(2)}, A_{1}, S_{1}^{(1)})

Stage 1

right

(S_{1}^{(1)})

Setting 2

Stage 2

wrong

right

(S^{(2)}, A_{1}, S_{1}^{(1)})

Stage 1

right

(S_{1}^{(1)}, S_{5}^{(1)}, S_{6}^{(1)})

Setting 3

Stage 2

right

wrong

(S^{(2)}, A_{1}, S_{1}^{(1)})

Stage 1

right

(S_{1}^{(1)})

Setting 4

Stage 2

wrong

(S^{(2)}, A_{1}, S_{1}^{(1)})

Stage 1

right

(S_{1}^{(1)}, S_{5}^{(1)}, S_{6}^{(1)})

Open in a new tab

We also consider two choices of sample size, n = 150 and n = 300, respectively. This gives us a total of 24 scenarios. For each scenario, we report FN, FP, VR and ER as Section 4.3. ER for the first and second stage are calculated as

{\frac{1}{n} \sum_{i = 1}^{n} | I ({\hat{β}}_{1}^{T} S_{i} > 0) - I (C (S_{i}) > 0) | I (C (S_{i}) \neq 0)} / {\frac{1}{n} \sum_{i = 1}^{n} I (C (S_{i}) \neq 0)}

and

{\frac{1}{n} \sum_{i = 1}^{n} | I ({\hat{β}}_{2}^{T} X_{i} > 0) - I (β_{2, 0}^{T} X_{i} > 0) | I (β_{2, 0}^{T} X_{i} \neq 0)} / {\frac{1}{n} \sum_{i = 1}^{n} I (β_{2, 0}^{T} X_{i} \neq 0)} .

Compared to definitions in Section 4.3, error rates here are calculated with respect to those patients with nonzero contrast functions. Such definitions are more meaningful since both two treatments are optimal for these patients. We simulate over 200 replications. Results are reported in Table 4.

Table 4.

Variable Selection Simulation Results for Non-regular Settings (%).

n	Nonregularity	Stage 2				Stage 1
n	Nonregularity	FN	FP	VR*	ER	FN	FP	VR	ER
Setting 1
150	δ₁ = 1, δ₂ = 1	0	< 0.01	53.6	0	4.0	0.3	93.9	5.0
	δ₁ = 1.1, δ₂ = 1.1	0	< 0.01	53.5	0.1	0.5	0.3	95.1	3.2
	δ₁ = 1, δ₂ = 1.1	0	0.01	52.9	5.0	1.0	0.3	93.8	4.2
300	δ₁ = 1, δ₂ = 1	0	< 0.01	53.3	0	0	0.4	97.3	0.9
	δ₁ = 1.1, δ₂ = 1.1	0	< 0.01	53.9	0	0	0.3	97.9	0.9
	δ₁ = 1, δ₂ = 1.1	0	< 0.01	52.8	2.0	0	0.3	97.0	0.9

Setting 2
150	δ₁ = 1, δ₂ = 1	0	< 0.05	46.1	0	14.7	0.3	90.7	5.6
	δ₁ = 1.1, δ₂ = 1.1	0	< 0.05	45.9	2.0	14	0.3	89.5	5.7
	δ₁ = 1, δ₂ = 1.1	0	< 0.05	44.4	11.6	9.7	0.3	89.7	11.5
300	δ₁ = 1, δ₂ = 1	0	< 0.01	45.8	0	0	0.2	97.1	0.4
	δ₁ = 1.1, δ₂ = 1.1	0	< 0.01	45.6	0.5	0	0.2	96.4	0.4
	δ₁ = 1, δ₂ = 1.1	0	0.01	45.1	6.8	0	0.2	98.1	7.4

Setting 3
150	δ₁ = 1, δ₂ = 1	5.7	0.6	45.0	2.9	19.0	0.2	85.6	4.1
	δ₁ = 1.1, δ₂ = 1.1	8.2	0.5	45.1	6.6	17.0	0.2	87.3	3.4
	δ₁ = 1, δ₂ = 1.1	6.3	0.6	44.6	14.6	18.5	0.2	85.8	4.1
300	δ₁ = 1, δ₂ = 1	0	0.1	53.1	0	0	0.3	97.1	1.0
	δ₁ = 1.1, δ₂ = 1.1	0	0.1	53.8	1.4	0	0.3	98.0	0.6
	δ₁ = 1, δ₂ = 1.1	0	0.1	52.9	8.4	0	0.3	97.9	0.8

Setting 4
150	δ₁ = 1, δ₂ = 1	20.7	0.5	25.4	8.6	52	0.2	66.6	14.0
	δ₁ = 1.1, δ₂ = 1.1	20.8	0.5	25.3	12.4	54.2	0.2	62.5	14.9
	δ₁ = 1, δ₂ = 1.1	21.5	0.6	23.7	22.6	51.7	0.2	61.7	22.1
300	δ₁ = 1, δ₂ = 1	0.3	0.2	44.9	0.2	3.3	0.2	95.8	0.7
	δ₁ = 1.1, δ₂ = 1.1	0	0.2	44.8	3.9	0.2	0.2	97.5	0.4
	δ₁ = 1, δ₂ = 1.1	0	0.2	43.8	13.2	0.3	0.2	97.1	8.3

Open in a new tab

Within each setting, most results are similar across different choices of δ₁ and δ₂. This suggests the nonregularity issues don’t have a big impact on the variable selection results. Apart from results in Setting 4, false negatives and false positives are all very small. When the sample size increases to 300, false negatives for most scenarios are exactly equal to 0 while false positives for all settings are below 0.4%, demonstrating perfect variables selections performance of our methods. In Settings 1–3, most error rates are below 7% while the ratios of value function are all above 85%, indicating our estimated optimal treatment regimes are very close to the truth in these scenarios.

5. Application to STAR*D Study

We applied the proposed method to a dataset from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, which was conducted to compare different treatments for patients with major depressive disorder (MDD). There were 4041 participants (age 18–75) with nonpsychotic MDD enrolled in this study. At first level, all participants were treated with citalopram (CIT) up to 14 weeks. Subsequently, 3 more levels of treatments were provided for participants without a satisfactory response to CIT. At each level, participants were randomly assigned to treatment options acceptable to them. At Level 2, participants were eligible for seven treatment options: sertraline (SER), venlafaxine (VEN), bupropion (BUP), cognitive therapy (CT), and augmenting CIT with bupropion (CIT+BUP), buspirone (CIT+BUS) or cognitive therapy (CIT+CT). Participants without a satisfactory response to CT were proceeded to Level 2A for additional medication treatments. All participants who did not respond satisfactorily at Level 2 or 2A were eligible for four treatments at Level 3: medication switch to mirtazapine (MIRT) or nortriptyline (NTP), and medication augmentation with either lithium (Li) or thyroid hormone (THY). Participants without satisfactory response to Level 3 were re-randomized at Level 4 to either tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN). See Fava et al. (2003) and Rush et al. (2004) for more details of the STAR*D study. One goal of the study is to determine which treatment strategies, in what order or sequence, provide the optimal treatment effect.

As an illustration, we focused on a subset of participants who were given treatment BUP or SER at Level 2 and did not receive satisfactory responses, and then were randomized to treatment MIRT or NTP at Level 3. For this study, we considered 381 covariates collected at baseline and intermediate levels as possible relevant predictors. For treatment regime at Level 3, all the 381 covariates as well as the assigned treatment at Level 2 were considered as possible predictors for making optimal treatment decision. For treatment regime at Level 2, 305 covariates that were collected before giving treatment at Level 2 were considered for making optimal treatment decision. Negative 16-item Quick Inventory of Depressive Symptomatology-Clinician-Rated (QIDS-C₁₆) was used as the final response, which is a measurement of symptomatic status of depression. There are 73 participants who had complete records in the subset of data we are interested in. Among these participants, 36 were treated with BUP and 37 were treated with SER at Level 2, and 33 were treated with NTP and 40 were treated with MIRT at Level 3.

The selection and estimation results are summarized as follows. At Level 3, our method selected two covariates: “age” in baseline demographics (AGE), and the suicide risk of the patient (SUICD). The estimated optimal treatment regime is I(1.459 − 0.091 × AGE + 0.158 × SUICD ≥ 0), where 1 represents treatment NTP and 0 represents treatment MIRT. This optimal treatment regime assigns 27 participants to NTP and the rest 46 participants to MIRT. At Level 2, our method also selected two covariates: age and QIDS-C percent improvement” in clinic visit clinical record form at Level 1 (QCIMP). The estimated optimal treatment regime is I(−8.600 + 0.145 × AGE + 0.125 × QCIMP ≥ 0), where 1 stands for treatment BUP and 0 stands for treatment SER. This optimal treatment regime assigns 37 participants to BUP and the rest 36 participants to SER.

To further examine the estimated optimal dynamic treatment regime, we compare the estimated value function of our estimated optimal treatment regime with values under those four non-dynamic treatment regimes, BUP+NTP, BUP+MIRT, SER+NTP and SER+MIRT. For a given dynamic treatment regime d = (d⁽¹⁾, d⁽²⁾), we evaluate its average value function using AIPWE (Zhang et al., 2013),

\frac{1}{n} \sum_{i = 1}^{n} \frac{d_{A_{i}}^{(1)}}{{\hat{π}}_{A_{i}}^{(1)}} (\frac{d_{A_{i}}^{(2)}}{{\hat{π}}_{A_{i}}^{(2)}} Y_{i} - \frac{d_{A_{i}}^{(2)} - {\hat{π}}_{A_{i}}^{(2)}}{{\hat{π}}_{A_{i}}^{(2)}} {d_{i}^{(2)} ({\hat{h}}_{i}^{(2)} + X_{i}^{T} {\hat{β}}_{2}) + (1 - d_{i}^{(2)}) {\hat{h}}_{i}^{(2)}}) - \frac{1}{n} \sum_{i = 1}^{n} \frac{d_{A_{i}}^{(1)} - {\hat{π}}_{A_{i}}^{(1)}}{{\hat{π}}_{A_{i}}^{(1)}} {d_{i}^{(1)} ({\hat{h}}_{i}^{(1)} + S_{i}^{T} {\hat{β}}_{1}) + (1 - d_{i}^{(1)}) {\hat{h}}_{i}^{(1)}},

where $d_{A_{i}}^{(2)} = A_{i}^{(2)} d_{i}^{(2)} + (1 - A_{i}^{(2)})$ , $d_{A_{i}}^{(1)} = A_{i}^{(1)} d_{i}^{(1)} + (1 - A_{i}^{(1)}) (1 - d_{i}^{(1)})$ , ${\hat{π}}_{A_{i}}^{(2)} = A_{i}^{(2)} {\hat{π}}_{i}^{(2)} + (1 - A_{i}^{(2)}) (1 - {\hat{π}}_{i}^{(2)})$ , ${\hat{π}}_{A_{i}}^{(1)} = A_{i}^{(1)} {\hat{π}}_{i}^{(1)} + (1 - A_{i}^{(1)}) (1 - {\hat{π}}_{i}^{(1)})$ , $d_{i}^{(2)}$ and $d_{i}^{(1)}$ the assigned treatment for the ith patient, according to d⁽²⁾ and d⁽¹⁾. Based on this formula, we report the estimated value functions of the four non-dynamic treatment regimes in Table 5.

Table 5.

Estimated Values of Different Treatment Regimes

Treatment Regime	Estimated Value
estimated optimal regime	−9.02
BUP + NTP	−12.86
BUP + MIRT	−12.57
SER + NTP	−12.57
SER + MIRT	−12.28

Open in a new tab

Estimating the value of the optimal treatment regime is well-known to be a non-regular problem when there’s nonzero probability that the contrast function (either at the second or the first stage) is equal to zero. To evaluate the value function under our estimated optimal treatment regime, we consider the online estimator proposed by Luedtke and van der Laan (2016). Specifically, for i = l_n + 1, l_n + 2, …, n, we obtain the estimated optimal dynamic treatment regime ${\hat{d}}^{opt (i)} = ({\hat{d}}^{opt (i) (1)}, {\hat{d}}^{opt (i) (2)})$ and its associated parameters ${\hat{β}}_{2}^{(i)}$ , ${\hat{β}}_{1}^{(i)}$ , propensity score function ${\hat{π}}^{(i) (2)}$ , ${\hat{π}}^{(i) (1)}$ , baseline function ĥ⁽ⁱ⁾⁽²⁾, ĥ⁽ⁱ⁾⁽¹⁾ based on data from patients 1 to i − 1, using penalized A-learning. Then we evaluate the value of ${\hat{d}}^{opt (i)} = ({\hat{d}}^{opt (i) (1)}, {\hat{d}}^{opt (i) (2)})$ on the ith patient using (AIPWE, Zhang et al., 2013)

{\hat{V}}_{i} (i) = \frac{{\hat{d}}_{A_{i}}^{opt (i) (1)}}{{\hat{π}}_{A_{i}}^{(i) (1)}} (\frac{{\hat{d}}_{A_{i}}^{opt (i) (2)}}{{\hat{π}}_{A_{i}}^{(i) (2)}} Y_{i} - \frac{{\hat{d}}_{A_{i}}^{opt (i) (2)} - {\hat{π}}_{A_{i}}^{(i) (2)}}{{\hat{π}}_{A_{i}}^{(i) (2)}} {{\hat{d}}_{i}^{opt (i) (2)} ({\hat{h}}_{i}^{(i) (2)} + X_{i}^{T} {\hat{β}}_{2}^{(i)}) + (1 - d_{i}^{(2)}) {\hat{h}}_{i}^{(i) (2)}}) - \frac{{\hat{d}}_{A_{i}}^{opt (i) (1)} - {\hat{π}}_{A_{i}}^{(i) (1)}}{{\hat{π}}_{A_{i}}^{(i) (1)}} {{\hat{d}}_{i}^{opt (i) (1)} ({\hat{h}}_{i}^{(i) (1)} + S_{i}^{T} {\hat{β}}_{1}^{(i)}) + (1 - d_{i}^{opt (i) (1)}) {\hat{h}}_{i}^{(i) (1)}} .

The variance of ${\hat{V}}_{i} (i)$ conditional on data from patients 1 to i − 1 is evaluated by

{\tilde{σ}}_{i}^{2} = \frac{1}{i - 1} \sum_{j = 1}^{i - 1} {\hat{V}}_{i}^{2} (j) - {(\frac{1}{i - 1} \sum_{j = 1}^{i - 1} {\hat{V}}_{i} (j))}^{2},

where ${\hat{V}}_{i} (j)$ is the estimated value of ${\hat{d}}^{opt (i)}$ on the jth patient.

The final estimator is given by

\hat{V} = \frac{\sum_{j = l_{n} + 1}^{n} {\tilde{σ}}_{j}^{- 1} {\hat{V}}_{j} (j)}{\sum_{j = l_{n} + 1}^{n} {\tilde{σ}}_{j}^{- 1}},

with the estimated standard error

\hat{σ} = \frac{\sqrt{n - l_{n}}}{\sum_{j = l_{n} + 1}^{n} {\tilde{σ}}_{j}^{- 1}} .

Since the sample size of our dataset is small, we choose l_n ≈ 2n/3, i.e, l_n = 49. The estimated value $\hat{V}$ is equal to −9.02 with an estimated standard error $\hat{σ} = 1.66$ . From Table 5, we can see the value under our estimated treatment regime is much larger than those under four nondynamic treatment regime.

6. Oracle inequalities for ${\hat{β}}_{2}$ and the value function of the estimated regime at the second stage

We first introduce some notation. For an arbitrary matrix Φ ∈ ℝ^M×M and an arbitrary vector ϕ ∈ ℝ^M, the superscript Φ^j is used to denote the jth column of Φ, ϕ^j the jth element of ϕ, while the subscript Φ_i denotes the ith row of Φ. For subsets J, J′ ⊂ {1, …, M}, let |J| be the cardinality of J, J^c be the complement of J. We denote by ϕ^J the vector in ℝ^|J| that has the same coordinates as ϕ on J, and Φ^J the submatrix formed by columns in J, $Φ_{J}^{J^{'}}$ the submatrix formed by rows in J and columns in J′. The support of ϕ is defined by supp(ϕ) = {j ∈ {1, …, M} : ϕ^j ≠ 0}. Let ‖ϕ‖_p be the L_p norm of ϕ, ‖Φ‖_p be the operator norm corresponding to the p-norm vector. If Φ is positive semidefinite, define

ρ_{min}^{s} (Φ) = min_{\begin{array}{l} ‖ y ‖_{2} = 1 \\ | supp (y) | \leq s \end{array}} ‖ Φ^{1 / 2} y ‖_{2} and ρ_{max}^{s} (Φ) = min_{\begin{array}{l} ‖ y ‖_{2} = 1 \\ | supp (y) | \leq s \end{array}} ‖ Φ^{1 / 2} y ‖_{2} .

Let $‖ Y ‖_{ψ_{p}}$ be the Orlicz norm for any random variable Y, defined as

‖ Y ‖_{ψ_{p}} ≜ \inf_{u} {u > 0 : E \exp {(\frac{| Y |}{u})}^{m} \leq 2},

for some p ≥ 1. For any two positive sequences {a_n} and {b_n}, a_n ≫ b_n means lim_n b_n/a_n = 0. Throughout this paper, we use c₀ and $\bar{c}$ to denote some universal constants, whose values may change from place to place.

6.1. Oracle inequality for ${\hat{β}}_{2}$

Recall C⁽²⁾(x) = x^Tβ₂, according to our assumption. Let β₂_,₀ denote the true values of β₂. Define $s_{β_{2}} = | M_{β_{2}} | = O (n^{l_{6}})$ for some 0 ≤ l₆ < 1, the nonsparsity size of β_2,0, $M_{β_{2}}$ the support of β₂_,₀. We allow the number of covariates p to grow exponentially fast with respect to the sample size n, i.e, $\log p = O (n^{a_{2}})$ for some 0 < a₂ < 1. To deal with such NP-dimensionality, following Zhou (2009), we assume

X = U \sum^{1 / 2}, \sum_{j j} = 1, \forall_{j} = 1, \dots, p,

(6.1)

where $U = {(U_{1}^{T}, \dots, U_{n}^{T})}^{T}$ and U₁, …, U_n are i.i.d. copies of a p-dimensional isotropic random vector U₀. More specifically, we require that for any vector a ∈ ℝ^p,

E {(a^{T} U_{0})}^{2} = a^{T} a and ‖ a^{T} U_{0} ‖_{ψ_{2}} \leq ω ‖ a ‖_{2},

(6.2)

for some isotropic constants ω.

Remark 6.1

The definition of the isotropic random vector was firstly introduced by Milman and Pajor (2003). Independent normal and independent Rademacher random variables are two most important examples of isotropic random vectors. More generally, coordinates of the isotropic random vector do not need to be independent. They can be distributed uniformly on various convex and symmetric bodies, for example, an appropriate multiple of the unit ball in ℝ^p equipped with the L_K-norm for any 1 ≤ K ≤ ∞. For these distributions, we denote ω_K as their isotropic constants. It is further shown in (Milman and Pajor, 1989) that ω_K are uniformly bounded for K ≥ 1. However, it remains unknown whether the isotropic property holds for all uniform distributions on arbitrary symmetric convex bodies with Lebesgue measure 1. .

Remark 6.2

The isotropic formulation requires covariates in U₀ to be uncorrelated, and hence does not allow for correlated Bernoullis. However, according to our definition X = UΣ¹^/², different covariates in the design matrix X can be correlated when Σ_ij ≠ 0. Such formulations allows us to impose conditions on the tail of U₀ and the covariance matrix Σ separately.

Since the A-learning estimating equation involves the plug-in estimators ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ , we need some conditions on these two estimators to establish oracle inequalities for ${\hat{β}}_{2}$ . More precisely, we assume that ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ converge to some $α_{2}^{⋆}$ and $θ_{2}^{⋆}$ , respectively. When the propensity score model π⁽²⁾ and the baseline model h⁽²⁾ are correctly specified, $α_{2}^{⋆}$ and $θ_{2}^{⋆}$ represent the true coefficients in these two models. When the models are misspecified, $α_{2}^{⋆}$ and $θ_{2}^{⋆}$ correspond to the population-level least favorable parameters. Denote $M_{α_{2}}$ and $M_{θ_{2}}$ the support of $α_{2}^{⋆}$ and $θ_{2}^{⋆}$ , respectively. Let $s_{α_{2}} = | M_{α_{2}} |$ and $s_{θ_{2}} = | M_{θ_{2}} |$ , the number of nonzero elements. We assume $s_{α_{2}} = O (n^{l_{4}})$ and $s_{θ_{2}} = O (n^{l_{5}})$ for some 0 ≤ l₄, l₅ < 1/2.

Condition 1

Assume that there exist some positive constants $γ_{α_{2}}$ and $γ_{θ_{2}}$ , such that with probability at least $1 - \bar{c} / (n + p)$ ,

{\hat{α}}_{2}^{M_{α_{2}}^{c}} = 0, ‖ {\hat{α}}_{2}^{M_{α_{2}}} - {α_{2}^{⋆}}^{M_{α_{2}}} ‖_{\infty} = O (n^{- γ_{α_{2}}} \log n),

(6.3)

{\hat{θ}}_{2}^{M_{θ_{2}}^{c}} = 0, ‖ {\hat{θ}}_{2}^{M_{θ_{2}}} - θ_{2}^{⋆ M_{θ_{2}}} ‖_{\infty} = O (n^{- γ_{θ_{2}}} \log n) .

(6.4)

Moreover, assume $d_{α_{2}} ≫ n^{- γ_{α_{2}}} \log n$ and $d_{θ_{2}} ≫ n^{- γ_{θ_{2}}} \log n$ , where $d_{α_{2}} = {min}_{j} | α_{2}^{⋆ j} | / 2$ and $d_{θ_{2}} = {min}_{j} | θ_{2}^{⋆ j} | / 2$ .

Remark 6.3

Condition 1 assumes the weak oracle properties of ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ , i.e., selection consistency and consistency under L_∞ norm. The weak oracle properties of ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ are established in Theorems 8.1 and 8.2 of Section 8, respectively.

Define

C^{(2)} = E {X_{i} π_{i}^{(2) *} (1 - π_{i}^{(2) *}) X_{i}^{T}}, D^{(2)} = E {X_{i} X_{i}^{T} (1 - A_{i}^{(2)})},

and $π_{i}^{(2) *} \equiv π^{(2)} (X_{i}, α_{2}^{⋆})$ .

Condition 2

Assume that matrices D⁽²⁾, C⁽²⁾ and Σ satisfy

\begin{array}{l} λ_{max} (\sum_{M_{α_{2}}}^{M_{α_{2}}}) = O (1), λ_{max} (\sum_{M_{θ_{2}}}^{M_{θ_{2}}}) = O (1), \\ \underset{n}{\lim \inf} λ_{min} (D_{M_{θ_{2}}}^{(2) M_{θ_{2}}}) > 0, \underset{n}{\lim \inf} λ_{min} (C_{M_{α_{2}}}^{(2) M_{α_{2}}}) > 0. \end{array}

Define $Ω^{(2)} (α_{2}) = E [X_{i} X_{i}^{T} A_{i}^{(2)} {1 - π^{(2)} (X_{i}, α_{2})}]$ and $Ω_{n}^{(2)} = n^{- 1} \sum_{i} X_{i} X_{i}^{T} A_{i}^{(2)} (1 - {\hat{π}}_{i}^{(2)})$ with ${\hat{π}}_{i}^{(2)} = π^{(2)} (X_{i}, {\hat{α}}_{2})$ . For any positive semidefinite matrix Ψ ∈ ℝ^p×p, integer s and positive number c, define function K(s, c, Ψ) as follows,

K (s, c, Ψ) = min_{\begin{matrix} J \subset {1, \dots, p} \\ | J | \leq s \end{matrix}} min_{\begin{matrix} y \neq 0 \\ ‖ y^{J^{c}} ‖_{1} \leq c ‖ y^{J} ‖_{1} \end{matrix}} \frac{‖ Ψ^{1 / 2} y ‖_{2}}{‖ y^{J} ‖_{2}} > 0.

The following condition ensures that the RE condition holds for the matrix $Ω_{n}^{(2)}$ .

Condition 3

Assume that for any 0 < θ_s < 1 and sufficiently large n, we have

K (s_{β_{2}}, 1, Ω_{n}^{(2)}) > (1 - θ_{s}) \inf_{α_{2} \in H_{α_{2}}} K (s_{β_{2}}, 1, Ω^{(2)} (α_{2})) > 0,

(6.5)

where $H_{α_{2}}$ denotes the set of vectors α₂ that satisfies the weak oracle property (6.3).

Remark 6.4

It is tedious to verify (6.5) due to the plug-in estimator ${\hat{π}}_{i}^{(2)}$ . The key to prove such a result is that the estimator ${\hat{α}}_{2}$ in ${\hat{π}}_{i}^{(2)}$ should be sparse. That is the reason we use penalized regression with a folded-concave penalty to obtain ${\hat{α}}_{2}$ , since it can ensure selection consistency of the estimator. We provide a general result characterizing the UUP and RE conditions for the random matrix $Ω_{n}^{(2)}$ in Lemmas 9.1 and 9.2 of Section 9.

To establish the oracle inequality for ${\hat{β}}_{2}$ , we first provide an upper bound for

‖ \frac{1}{n} X^{T} diag (A^{(2)} - {\hat{π}}^{(2)}) (Y - X {\hat{θ}}_{2} - A^{(2)} \circ X β_{2, 0}) ‖_{\infty},

which is given in the following Lemma.

Lemma 6.1

Assume that Condition 1 and 2 hold, $‖ h^{(2)} (X_{i}) - X_{i}^{T} θ_{2}^{⋆} ‖_{ψ_{1}} < \infty$ , $‖ e_{i} ‖_{ψ_{2}} < \infty$ , a₂ + l₄ < 1, and that either π⁽²⁾ or h⁽²⁾ is correctly specified. Then, for sufficiently large n, there exists some constants c⁽²⁾, such that with probability at least $1 - \bar{c} / (n + p)$ ,

‖ \frac{1}{n} X^{T} diag (A^{(2)} - {\hat{π}}^{(2)}) (Y - X {\hat{θ}}_{2} - A^{(2)} \circ X β_{2, 0}) ‖_{\infty} \leq c^{(2)} (E_{1} + E_{2} + E_{3} + E_{4}),

where

E_{1} = \sqrt{\log p / n}, E_{2} = s_{α_{2}} n^{- 2 γ_{α_{2}}} \log^{2} n + s_{θ_{2}} n^{- 2 γ_{θ_{2}}} \log^{2} n,

E_{3} = σ_{3} {\sqrt{s_{α_{2}} \log n / n} + \sqrt{s_{α_{2}}} λ_{1 n}^{(2)} ρ_{2}^{(1)} (d_{n α_{2}})},

E_{4} = σ_{4} {\sqrt{s_{θ_{2}} \log n / n} + \sqrt{s_{θ_{2}}} λ_{2 n}^{(2)} ρ_{2}^{(2)} (d_{n θ_{2}})},

$σ_{3}^{2} = E {h^{(2)} (X_{i}) - X_{i}^{T} θ_{2}^{⋆}}^{2}$ , and $σ_{4}^{2} = E {π^{(2)} (X_{i}) - π_{i}^{(2) *}}^{2}$ .

Remark 6.5

Recall that $\log p = O (n^{a_{2}})$ , $s_{α_{2}} = O (n^{l_{4}})$ for some 0 ≥ a₂, l₄ < 1. The condition a₂ + l₄ < 1 implies $n ≫ s_{α_{2}} \log p$ .

Remark 6.6

Here, E₁ describes how the curse of dimensionality takes effect, E₂ is due to estimation errors of ${\hat{α}}^{(2)}$ and ${\hat{θ}}^{(2)}$ , E₃ and E₄ are due to model misspecification. Since we assume that at least one of h⁽²⁾ and π⁽²⁾ is correctly specified, either E₃ or E₄ is zero.

Theorem 6.1

Assume that conditions in Lemma 6.1 and Condition 3 hold, and $λ_{3 n}^{(2)} \geq c^{(2)} (E_{1} + E_{2} + E_{3} + E_{4})$ where the constant c⁽²⁾ is defined in Lemma 6.1. Then, for some fixed 0 < θ_s < 1 and sufficiently large n, the following two inequalities hold with probability at least $1 - \bar{c} / (n + p)$ for some constant $\bar{c} > 0$ :

‖ {\hat{β}}_{2} - β_{2, 0} ‖_{2} \leq \frac{12 λ_{3 n}^{(2)} \sqrt{s_{β_{2}}}}{{(1 - θ_{s})}^{2} \inf_{α_{2} \in H_{α_{2}}} K^{2} (s_{β_{2}}, 1, Ω^{(2)} (α_{2}))},

(6.6)

‖ {\hat{β}}_{2} - β_{2, 0} ‖_{1} \leq \frac{8 λ_{3 n}^{(2)} \sqrt{s_{β_{2}}}}{{(1 - θ_{s})}^{2} \inf_{α_{2} \in H_{α_{2}}} K^{2} (s_{β_{2}}, 1, Ω^{(2)} (α_{2}))} .

(6.7)

Moreover, we have $‖ {\hat{β}}_{2}^{M_{β_{2}}^{c}} ‖_{1} \leq ‖ {\hat{β}}_{2}^{M_{β_{2}}} - β_{2, 0}^{M_{β_{2}}} ‖_{1}$ .

From (6.6), it is immediate to see that $‖ {\hat{β}}_{2} - β_{2, 0} ‖_{2} \overset{P}{\to} 0$ as long as

\frac{\sqrt{s_{β_{2}}} (E_{1} + E_{2} + E_{3} + E_{4})}{\inf_{α_{2} \in H_{α_{2}}} K^{2} (s_{β_{2}}, 1, Ω^{(2)} (α_{2}))} \to 0,

(6.8)

which implies the doubly robust property of ${\hat{β}}_{2}$ . We provide a sufficient condition for (6.8) in the following Corollary.

Corollary 6.1 (Double robustness of ${\hat{β}}_{2}$ )

Assume that conditions in Theorem 6.1 and the following conditions hold:

l_{6} < min (4 γ_{θ_{2}} - 2 l_{5}, 4 γ_{α_{2}} - 2 l_{4}),

(6.9)

λ_{2 n}^{(2)} ρ_{2}^{(2)} (d_{n θ_{2}}) = O (n^{- 1 / 2}) and λ_{1 n}^{(2)} ρ_{1}^{(2)} (d_{n α_{2}}) = O (n^{- 1 / 2}) .

(6.10)

\lim \inf \inf_{α_{2} \in H_{α_{2}}} K (s_{β_{2}}, 1, Ω^{(2)} (α_{2})) > 0.

(6.11)

If either the baseline h⁽²⁾ or the propensity score model π⁽²⁾ is correctly specified, then $‖ {\hat{β}}_{2} - β_{2, 0} ‖_{2} \overset{P}{\to} 0$ .

Remark 6.7

Condition (6.9) imposes a constraint between the sparsity of population parameters and the convergence rates of ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ . When $s_{β_{2}} = O (1)$ , it requires ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ to be consistent under L₂ norm. Condition (6.10) automatically holds for SCAD penalty function when $d_{n θ_{2}} ≫ λ_{2 n}^{(2)}$ and $d_{n α_{2}} ≫ λ_{1 n}^{(2)}$ .

6.2. Oracle inequality for the value function of the estimated regime at the second stage

Now we establish the error bound for the difference between the mean responses (i.e. the value functions) of the estimated optimal regime at the second stage ${\hat{d}}_{2} (X_{0}) = I (X_{0}^{T} {\hat{β}}_{2} > 0)$ and the true optimal one $d_{2}^{opt} (X_{0}) = I (X_{0}^{T} β_{2, 0} > 0)$ for an individual with covariate X₀. Here, X₀ is also assumed to have the form Σ¹^/²U with Σ and U defined in (6.1), independent of X_i, i = 1, …, n. In addition, the regime at the first stage is chosen the same as the actually received treatment $A_{0}^{(1)}$ at the first stage.

Under the assumptions of SUTVA and no unmeasured confounders, the difference of the corresponding value functions is given by

E {Y_{0}^{⋆} (A_{0}^{(1)}, d_{2}^{opt})} - E {Y_{0}^{⋆} (A_{0}^{(1)}, {\tilde{d}}_{2})} = E [X_{0}^{T} β_{2, 0} {I (X_{0}^{T} β_{2, 0} > 0) - I (X_{0}^{T} {\hat{β}}_{2} > 0)}] .

(6.12)

Since (6.12) is nonnegative, it suffices to provide an upper bound. Here, we impose the following condition.

Condition 4

The probability density function g⁽²⁾(·) of $X_{0}^{T} β_{2, 0}$ exists and is bounded.

Condition 4 is a mild condition on the true optimal decision function, which holds in most cases when at least one of the important covariates (the corresponding component of β₂_,₀ is nonzero) is continuous.

Theorem 6.2

Assume that conditions in Theorem 6.1 and Condition 4 hold. Assume $E {(X_{0}^{T} β_{2, 0})}^{2} = O (1)$ . Then, for fixed 0 < θ_s < 1 and sufficiently large n,

E [X_{0}^{T} β_{2, 0} {I (X_{0}^{T} β_{2, 0} > 0) - I (X_{0}^{T} {\hat{β}}_{2} > 0)}] \leq \frac{\bar{c} ω}{n} + \frac{c_{0} ω^{2} ρ_{max}^{s_{β_{2}}} (\sum) {(λ_{3 n}^{(2)})}^{2} s_{β_{2}} \log^{2} n}{{(1 - θ_{s})}^{4} \inf_{α_{2} \in H_{α_{2}}} K^{4} (s, 1, Ω^{(2)} (α_{2}))} .

Remark 6.8

Error bound for the difference of the value functions follows from the error bound on ${\hat{β}}_{2}$ and Condition 4. Since the first term in the upper bound is small, the difference of the value functions is mainly characterized by the second term, which is of the order $O (ρ_{max}^{s_{β_{2}}} (\sum) ‖ {\hat{β}}_{2} - β_{2, 0} ‖_{2}^{2} \log^{2} n)$ .

7. Error bounds for ${\hat{β}}_{1}$ and the value function of the estimated dynamic treatment regime

7.1. Misspecified contrast function

In the context of A-learning, a major challenge arising in multi-stage studies is that the contrast functions are likely to be misspecified in backward induction. In order to study the finite sample bounds of ${\hat{β}}_{1}$ , we need to first define least favorable parameters under the misspecification of the contrast function.

Recall that C⁽¹⁾(S_i) is the true contrast function for the ith patient, which can be a very complex function of S_i due to the backward induction. For notational convenience, we use a shorthand C(s) for C⁽¹⁾(s). We posit a linear model $S_{i}^{T} β_{1}$ for C(·), which is often misspecified. When either the propensity score model π⁽¹⁾ or the baseline mean function h⁽¹⁾ is correctly specified, the associated least favorable parameters $β_{1}^{*}$ is defined as follows:

β_{1}^{*} = \arg min_{β_{1} \in Λ^{*}} {‖ β_{1} ‖}_{1},

(7.1)

where

Λ^{*} = {β_{1} \in ℝ^{q} : {‖ E [S_{i} A_{i}^{(1)} (1 - π_{i}^{(1) *}) {C (S_{i}) - S_{i}^{T} β_{1}}] ‖}_{\infty} \leq κ_{0}},

$π_{i}^{(1) *} = π^{(1)} (S_{i}, α_{1}^{*})$ and κ₀ is a nonnegative constant. Define

κ_{0}^{*} = {‖ E [S_{i} A_{i}^{(1)} (1 - π_{i}^{(1) *}) {C (S_{i}) - S_{i}^{T} β_{1}^{*}}] ‖}_{\infty} .

By simple algebra, we can show $κ_{0}^{*} \leq min {κ_{0}, O (σ_{0})}$ , where $σ_{0}^{2} = E [{C (S_{i}) - S_{i}^{T} β_{i}^{*}}^{2}]$ , describing the degree of misspecification of the contrast function. Define $s_{β_{1}} = | M_{β_{1}} | = O (n^{l_{3}})$ for some 0 ≤ l₃ < 1/2, where $M_{β_{1}} = supp (β_{1}^{*})$ .

7.2. Error bound for ${\hat{β}}_{1}$

Assume that $\log q = O (n^{a_{1}})$ for some 0 < a₁ < 1 and S₁, …, S_n are i.i.d. copies of S₀ that

S_{0} \overset{d}{=} Ψ^{1 / 2} V_{0},

(7.2)

where Ψ ∈ ℝ^q^×^q is some positive definite matrix with Ψ_jj = 1 for j = 1, …, q, and V₀ is a q-dimensional isotropic random vector with some isotropic constants ζ. As in the second stage, we first give conditions on ${\hat{α}}_{1}$ and ${\hat{θ}}_{1}$ . Assume that these two estimators converge to some $α_{1}^{*}$ and $θ_{1}^{*}$ , respectively, under possible model misspecification. Denote $M_{α_{1}} = supp (α_{1}^{*})$ , $M_{θ_{1}} = supp (θ_{1}^{*})$ , $s_{α_{1}} = | M_{α_{1}} | = O (n^{l_{1}})$ , and $s_{θ_{1}} = | M_{θ_{1}} | = O (n^{l_{1}})$ for some 0 ≤ l₁, l₂ < 1l2.

Condition 5

Assume that there exist some positive constants $γ_{α_{1}}$ and $γ_{θ_{1}}$ , with probability at least $1 - \bar{c} / (n + p + q)$ , the following holds:

{\hat{α}}_{1}^{M_{α_{1}}^{c}} = 0, {‖ {\hat{α}}_{1}^{M_{α_{1}}} - {α_{1}^{*}}^{M_{α_{1}}} ‖}_{\infty} = O (n^{- γ_{α_{1}}} \log n),

(7.3)

{\hat{θ}}_{1}^{M_{θ_{1}}^{c}} = 0, {‖ {\hat{θ}}_{1}^{M_{θ_{1}}} - {θ_{1}^{*}}^{M_{θ_{1}}} ‖}_{\infty} = O (n^{- γ_{θ_{1}}} \log n) .

(7.4)

Moreover, assume $d_{α_{1}} ≫ n^{- γ_{α_{1}}} \log n$ and $d_{θ_{1}} ≫ n^{- γ_{θ_{1}}} \log n$ , where $d_{α_{1}} = {min}_{j} | α_{1}^{* j} | / 2$ and $d_{θ_{1}} = {min}_{j} | θ_{1}^{* j} | / 2$ .

Condition 6

Assume that D⁽¹⁾, C⁽¹⁾ and Ψ satisfy

\begin{array}{l} λ_{max} (Ψ_{M_{α_{1}}}^{M_{α_{1}}}) = O (1), λ_{max} (Ψ_{M_{θ_{1}}}^{M_{θ_{1}}}) = O (1), \\ \underset{n}{\lim \inf} λ_{min} (D_{M_{θ_{1}}}^{(1) M_{θ_{1}}}) > 0, \underset{n}{\lim \inf} λ_{min} (C_{M_{α_{1}}}^{(1) M_{α_{1}}}) > 0, \end{array}

where

D^{(1)} = E {S_{i} S_{i}^{T} (1 - A_{i}^{(1)})}, C^{(1)} = E {S_{i} S_{i}^{T} π_{i}^{(1) *} (1 - π_{i}^{(1) *})},

and $π^{(1) *} = π^{(1)} (S_{i}, α_{1}^{*})$ .

Since both the propensity score model and the contrast function at the first stage can be misspecified, we need the following condition to control their effect on estimation of $β_{1}^{*}$ .

Condition 7

Assume that

τ_{0} \equiv {‖ F^{M_{α_{1}}} - {[C^{(1)} M_{α_{1} M_{α_{1}}}]}^{- 1} b^{(1) M_{α_{1}}} ‖}_{\infty} < \infty,

(7.5)

where $b^{(1)} = E {S_{i} (A_{i}^{(1)} - π_{i}^{(1) *})}$ and

F = E [S_{i} A_{i}^{(1)} π_{i}^{(1) *} (1 - π_{i}^{(1) *}) {C (S_{i}) - S_{i} β_{i}^{*}} S_{i}^{T}] .

Remark 7.1

It is immediate to see τ₀ = 0 when either the contrast function or the propensity score model is correctly specified.

When going back to the first stage, the error bound of ${\hat{β}}_{1}$ is directly affected by that of ${\hat{β}}_{2}$ . This is because the estimated response ${\hat{V}}_{1}$ in the first stage is obtained based on ${\hat{β}}_{2}$ using the advantage function. To simplify presentation, we introduce the following condition.

Condition 8

Assume that with probability at least $1 - \bar{c} / (n + p)$ , there exists some constant μ₁ > 0 such that

\sqrt{ρ_{max}^{s_{β_{2}}} (\sum)} {‖ {\hat{β}}_{2} - β_{2, 0} ‖}_{2} = O (n^{- μ_{1}} \log n),

(7.6)

and ${‖ {\hat{β}}_{2}^{M_{β_{2}}^{c}} ‖}_{1} \leq {‖ {\hat{β}}_{2}^{M_{β_{2}}} - β_{2, 0}^{M_{β_{2}}} ‖}_{1}$ .

A more explicit form of the error bound for (7.6) is given in Theorem 6.1. In the next Lemma, we provide an upper bound for the term:

{‖ S^{T} diag (A^{(1)} - {\hat{π}}^{(1)}) (\hat{V} - S {\hat{θ}}_{1} - A^{(1)} \circ S β_{1}^{*}) ‖}_{\infty} / n .

(7.7)

Lemma 7.1

Assume that Conditions 5–8 and those in Theorem 6.1 hold, ${‖ C (S_{i}) - S_{i}^{T} β_{1}^{*} ‖}_{ψ_{1}} < \infty$ , ${‖ V_{i} - E (V_{i} | S_{i}, A_{i}^{(1)}) ‖}_{ψ_{2}} < \infty$ , a₁ + l₁ < 1, $n ≫ s_{β_{2}} \log p ρ_{max}^{s_{β_{2}}} {(\sum)}^{2} / ρ_{max}^{s_{β_{2}}} (\sum)$ , and either π⁽¹⁾ or h⁽¹⁾ is correctly specified. Then, for sufficiently large n, with probability at least $1 - \bar{c} / (n + p + q)$ , (7.7) can be bounded from above by c⁽¹⁾(E₅ + E₆ + E₇ + E₈ + E₉ + E₁₀) for some constant c⁽¹⁾ > 0, where

E_{5} = \sqrt{\log q / n} \log^{2} n, E_{6} = s_{α_{1}} n^{- 2 γ_{α_{1}}} \log^{2} n + s_{θ_{1}} n^{- 2 γ_{θ_{1}}} \log^{2} n,

E_{7} = σ_{1} {\sqrt{s_{α_{1}} \log n / n} + \sqrt{s_{α_{1}}} λ_{1 n}^{(1)} ρ_{1}^{(1)} (d_{n α_{1}})},

E_{8} = σ_{2} {\sqrt{s_{θ_{1}} \log n / n} + \sqrt{s_{θ_{1}}} λ_{2 n}^{(1)} ρ_{2}^{(1)} (d_{n θ_{1}})},

E_{9} = σ_{0} {\sqrt{s_{α_{1}} \log n / n} + \sqrt{s_{α_{1}}} λ_{1 n}^{(1)} ρ_{1}^{(1)} (d_{n α_{1}}) + τ_{0} + κ_{0}^{*}},

E_{10} = n^{- μ_{1}} \log n, σ_{0}^{2} = E {C (S_{i}) - S_{i}^{T} β_{1}^{*}}^{2}, σ_{1}^{2} = E {(h^{(1)} - S_{i}^{T} θ_{1}^{*})}^{2}, and σ_{2}^{2} = E {π_{i}^{(1) *} - π^{(1)} (S_{i})}^{2} .

Remark 7.2

The terms E₅−E₈ have similar interpretations as E₁ − E₄ in Lemma 6.1, respectively. The additional term E₁₀ is due to the error bound of ${\hat{β}}_{2}$ in the backward induction, while E₉ is due to the misspecification of the contrast function.

Define $Ω^{(1)} (α_{1}) = E [S_{i} S_{i}^{T} A_{i}^{(1)} {1 - π^{(1)} (S_{i}, α_{1})}]$ and $Ω_{n}^{(1)} = n^{- 1} \sum_{i} S_{i} S_{i}^{T} A_{i}^{(1)} (1 - {\hat{π}}_{i}^{(1)})$ with ${\hat{π}}_{i}^{(1)} = π^{(1)} (X_{i}, {\hat{α}}_{1})$ . Similar as in stage 2, we need the following condition to ensure the RE condition for the matrix $Ω_{n}^{(1)}$ .

Condition 9

Assume that for any 0 < θ_s < 1 and sufficiently large n, we have

K (s_{β_{1}}, 1, Ω_{n}^{(1)}) > (1 - θ_{s}) \inf_{α_{1} \in H_{α_{1}}} K (s_{β_{1}}, 1, Ω^{(1)} (α_{1})) > 0,

(7.8)

where $H_{α_{1}}$ denotes the set of vectors α₁ that satisfies the weak oracle property (7.3).

Theorem 7.1

Assume that Condition 9 and those conditions in Lemma 7.1 hold, and $λ_{3 n}^{(1)} \geq c^{(1)} \sum_{k = 5}^{10} E_{k}$ . The constant c⁽¹⁾ is defined in Lemma 7.1. Then, there exists a constant c₈, such that for sufficiently large n and some fixed 0 < θ_s < 1, with probability at least 1 − c₈/(n + p + q), the error bounds for ${\hat{β}}_{1}$ are given by

{‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} \leq \frac{12 λ_{3 n}^{(1)} \sqrt{s_{β_{1}}}}{{(1 - θ_{s})}^{2} \inf_{α_{1} \in H_{α_{1}}} K^{2} (s_{β_{1}}, 1, Ω^{(1)} (α_{1}))},

(7.9)

{‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{1} \leq \frac{8 λ_{3 n}^{(1)} s_{β_{1}}}{{(1 - θ_{s})}^{2} \inf_{α_{1} \in H_{α_{1}}} K^{2} (s_{β_{1}}, 1, Ω^{(1)} (α_{1}))} .

(7.10)

7.3. Error bound for the value function of the estimated dynamic treatment regime

Under the SUTVA and sequential randomization assumptions, the value function of a given dynamic treatment regime (d₁(S₀), d₂(X₀)) is given by

E {Y_{0}^{*} (d_{1}, d_{2})} = E [h^{(2)} (X_{0}) + (β_{2, 0}^{T} X_{0}) d_{2} (X_{0}) + C (S_{0}) {d_{1} (S_{0}) - A_{0}^{(1)}}],

where S₀ and X₀ denote the baseline covariates and covariates for the second stage, respectively. Then, the difference of the value functions under the estimated optimal dynamic treatment regime (2.4) and the true optimal regime $(d_{1}^{opt}, d_{2}^{opt})$ is given by

E {Y_{0}^{*} (d_{1}^{opt}, d_{2}^{opt})} - E {Y_{0}^{*} ({\hat{d}}_{1}, {\hat{d}}_{2})} = E [C (S_{0}) {I (C (S_{0}) > 0) - I (S_{0}^{T} {\hat{β}}_{1} > 0)}] + E [X_{0}^{T} β_{2, 0} {I (X_{0}^{T} β_{2, 0} > 0) - I (X_{0}^{T} {\hat{β}}_{2} > 0)}] .

Similar to Condition 4, we impose the following condition.

Condition 10

Assume that the probability density function g⁽¹⁾(·) of $S_{0}^{T} β_{1}^{*}$ exists and is bounded.

Theorem 7.2

Assume that conditions in Theorem 7.1 and Condition 10 hold. Assume $E {(X_{0}^{T} β_{2, 0})}^{2} = O (1)$ , $E {(S_{0}^{T} β_{1}^{*})}^{2} = O (1)$ . Then, for some fixed 0 < θ_s < 1 and sufficiently large n,

0 \leq E {Y_{0}^{*} (d_{1}^{opt}, d_{2}^{opt})} - E {Y_{0}^{*} ({\hat{d}}_{1}, {\hat{d}}_{2})} \leq \frac{\bar{c} (ω + ζ)}{n} + c_{0} σ_{0}^{4 / 3} + \frac{c_{0} ω^{2} ρ_{max}^{s_{β_{2}}} (\sum) λ_{3 n}^{{(2)}^{2}} s_{β_{2}} \log^{2} n}{{(1 - θ_{s})}^{4} \inf_{α_{2} \in H_{α_{2}}} K^{4} (s_{β_{2}}, 1, Ω^{(2)} (α_{2}))} + \frac{c_{0} ζ^{2} ρ_{max}^{s_{β_{1}}} (Ψ) λ_{3 n}^{{(1)}^{2}} s_{β_{1}} \log^{2} n}{{(1 - θ_{s})}^{4} \inf_{α_{1} \in H_{α_{1}}} K^{4} (s_{β_{1}}, 1, Ω^{(1)} (α_{1}))} .

Remark 7.3

Theorem 7.2 suggests that the upper bound for the difference of the value functions come from three major components: the misspecification of the contrast function, described by $σ_{0}^{2}$ , and estimation errors of ${\hat{β}}_{2}$ and ${\hat{β}}_{1}$ .

8. Weak oracle properties of ${\hat{α}}_{j}$ ’s and ${\hat{θ}}_{j}$ ’s

In order to prove the error bounds of ${\hat{β}}_{1}$ , ${\hat{β}}_{2}$ and the value functions of the estimated treatment regimes presented in Sections 6 and 7, we need to establish the weak oracle properties of ${\hat{α}}_{j}$ and ${\hat{θ}}_{j}$ (j = 1, 2) in the posited models for the propensity score and baseline mean functions. Here, we prove the results based on a posited logistic regression model for the propensity score and a linear model for the baseline mean function under a random design setting. However, these results can be extended to generalized linear models (McCullagh and Nelder, 1989).

8.1. Weak oracle properties of ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$

We assume that ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ converge to some population parameters $α_{2}^{⋆}$ and ${\hat{θ}}_{2}^{⋆}$ , respectively. Under Conditions B1–B6 given in the Supplementary Appendix, we establish the weak oracle properties of ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$ in the following two Theorems. Recall that $s_{α_{2}} = | M_{α_{2}} | = O (n^{l_{4}})$ for some 0 ≤ l₄ < 1/2.

Theorem 8.1

Assume that Conditions B.1–B.3 hold, l₄ + a₂ < 1 and $λ_{max} (\sum_{M_{α_{2}}}^{M_{α_{2}}}) = O (1)$ . Then, for sufficiently large n, there exists some constants $γ_{α_{2}} > 0$ , such that with probability at least $1 - \bar{c} / (n + p)$ ,

${\hat{α}}_{2}^{M_{α_{2}^{c}}} = 0.$
$‖ {\hat{α}}_{2}^{M_{α_{2}}} - {α_{2}^{⋆}}^{M_{α_{2}}} ‖_{\infty} = O (n^{- γ_{α_{2}}} \log n) .$

Theorem 8.2

Assume that Conditions B.4–B.6 hold, $λ_{max} (\sum_{M_{θ_{2}}}^{M_{θ_{2}}}) = O (1)$ , and $‖ e_{i} ‖_{ψ_{2}} < \infty$ , where e_i is defined in (2.2). Then, there exist some constants $γ_{θ_{2}} > 0$ , such that with probability at least $1 - \bar{c} / (n + p)$ ,

${\hat{θ}}_{2}^{M_{θ_{2}}^{c}} = 0.$
$‖ {\hat{θ}}_{2}^{M_{θ_{2}}} - {θ_{2}^{⋆}}^{M_{θ_{2}}} ‖_{\infty} = O (n^{- γ_{θ_{2}}} \log n) .$

Remark 8.1

Theorem 1 in Shi, Song and Lu (2015) established weak oracle results of the penalized estimators for a fixed design setting. This is mainly for technical convenience. Its proofs can be obtained using similar arguments as in Fan and Lv (2011). In this paper, we focus on a random design setting, which is more practical in medical studies. To the best of our knowledge, the weak oracle properties of penalized estimators have not been studied in a random design setting with the NP dimensionality. The major difficulty lies in developing some random matrix theories, such as controlling the maximum eigenvalue of some random matrices. Such results are established in Theorem 8.1 and 8.2.

Remark 8.2

The condition l₄ + a₂ < 1 ensures that for large n

\underset{j = 1}{\begin{matrix} p \\ max \end{matrix}} λ_{max} [{(X^{M_{α_{2}}})}^{T} diag (| X^{j} |) X^{M_{α_{2}}}] = O (n),

(8.1)

with probability approaching 1. A major technical difficulty in deriving (8.1) is that the matrix ${(X^{M_{α_{2}}})}^{T} diag (| X^{j} |) X^{M_{α_{2}}}$ does not have the subexponential tail (see Definition G.2 in the Supplementary Appendix). When $s_{α_{2}} \leq n$ , we can bound ${max}_{j \in M_{α_{2}}} | X_{i}^{j} |$ from above by $\sqrt{2} ω \log n$ with probability at least 1−2/n, which ensures the subexponential tail of the truncated matrix. Lemma B.2 in the Supplementary Appendix proves such a result for a more general case.

8.2. Weak oracle properties of ${\hat{α}}_{1}$ and ${\hat{θ}}_{1}$

The weak oracle properties of ${\hat{α}}_{1}$ can be similarly derived as for ${\hat{α}}_{2}$ . However, unlike the results for ${\hat{θ}}_{2}$ , the weak oracle properties of ${\hat{θ}}_{1}$ depend on ${\hat{β}}_{2}$ even when the baseline mean function h⁽¹⁾ is correctly specified. This is because the estimated response ${\hat{V}}_{i}$ is obtained based on ${\hat{β}}_{2}$ . A necessary condition to ensure $‖ {\hat{θ}}_{1} - θ_{1}^{⋆} ‖_{\infty} \overset{P}{\to} 0$ is that $‖ {\hat{β}}_{2} - β_{2, 0} ‖_{2} \overset{P}{\to} 0$ , which is established in Corollary 6.1.

Theorem 8.3

Assume that Condition 8 and Conditions B.7–B.12 in the Supplementary Appendix hold. Further assume that $λ_{max} (Ψ_{M_{α_{1}}}^{M_{α_{1}}}) = O (1)$ , $λ_{max} (Ψ_{M_{θ_{1}}}^{M_{θ_{1}}}) = O (1)$ , $n ≫ s_{β_{2}} \log p {ρ_{max}^{s_{β_{2}}} (\sum)}^{2} / ρ_{min}^{s_{β_{2}}} (\sum)$ , a₁+l₁ < 1, $‖ e_{i} ‖_{ψ_{2}} < \infty$ and $‖ V_{i} - E (V_{i} | S_{i}^{(1)}, A_{i}^{(1)}) ‖_{ψ_{2}} < \infty$ . Then, for sufficiently large n, there exist some $γ_{α_{1}} > 0$ and $γ_{θ_{1}} > 0$ , with probability at least $1 - \bar{c} / (n + q + p)$ , such that the estimators ${\hat{α}}_{1}$ and ${\hat{θ}}_{1}$ must satisfy

${\hat{α}}_{1}^{M_{α_{1}}^{c}} = 0, {\hat{θ}}_{1}^{M_{θ_{1}}^{c}} = 0,$
$‖ {\hat{α}}_{1}^{M_{α_{1}}} - α_{1}^{⋆ M_{α_{1}}} ‖_{\infty} = O (n^{- γ_{α_{1}}} \log n), ‖ {\hat{θ}}_{1}^{M_{θ_{1}}} - θ_{1}^{⋆ M_{θ_{1}}} ‖_{\infty} = O (n^{- γ_{θ_{1}}} \log n) .$

9. Uniform uncertainty principle and restricted eigenvalue conditions in A-learning

In this section, we establish the UUP and RE conditions in the context of A-learning. In our setting, these two conditions are needed on random matrices $Ω_{n}^{(2)}$ and $Ω_{n}^{(1)}$ .

For brevity, we only study the UUP and RE conditions for the random matrix $Ω_{n}^{(2)}$ . Those for $Ω_{n}^{(1)}$ can be similarly derived. Recall that $M_{α_{2}}$ refers to the support of $α_{2}^{⋆}$ , $M_{β_{2}} = supp (β_{2, 0})$ , and $s_{β_{2}} = | M_{β_{2}} |$ . We assume that the weak oracle properties of ${\hat{α}}_{2}$ are achieved such that with probability at least $1 - \bar{c} / (n + p)$ ,

{\hat{α}}_{2}^{M_{α_{2}}^{c}} = 0 and ‖ {\hat{α}}_{2} - α_{2}^{⋆} ‖_{\infty} = O (n^{- γ_{α_{2}}} \log n),

(9.1)

for some $γ_{α_{2}} > 0$ . The following Lemma establishes the UUP condition for $Ω_{n}^{(2)}$ .

Lemma 9.1

Assume the convergence rate of ${\hat{α}}_{2}$ satisfies

‖ {\hat{α}}_{2} - α_{2}^{⋆} ‖_{2} = O (\sqrt{s_{α_{2}}} n^{- γ_{α_{2}}} \log n) = O (1),

and the sample size satisfies

n ≫ \frac{{ρ_{max}^{s_{β_{2}}} (\sum)}^{2} (s_{β_{2}} \log p + s_{α_{2}}^{2})}{\inf_{α_{2} \in H_{α_{2}}} p_{min}^{s_{β_{2}}} (Ω^{(2)} (α_{2}))} .

(9.2)

Then for any 0 < θ < 1, with probability at least $1 - \bar{c} / (n + p)$ , we have

‖ \frac{1}{n} y^{T} Ω_{n}^{(2)} y - y^{T} Ω^{(2)} y ‖_{2} \leq {θ + \frac{4 ω^{2}}{n} + \sqrt{2} ω^{2} ‖ {\hat{α}}_{2} - α_{2}^{⋆} ‖_{2} \sqrt{λ_{max} (\sum_{M_{α_{2}}}^{M_{α_{2}}})}} ρ_{max}^{s_{β_{2}}} (\sum) ‖ y ‖_{2}^{2},

(9.3)

for any y ∈ ℝ^p and $| supp (y) | \leq s_{β_{2}}$ .

Remark 9.1

In our setting, if the following regularity conditions hold

\underset{α_{2} \in H_{α_{2}}}{\lim \inf} ρ_{min}^{s_{α_{2}}} (Ω^{(2)} (α_{2})) > 0 and ρ_{max}^{s_{β_{2}}} (\sum) = O (1),

the requirement on the sample size (9.2) reduces to $n ≫ s_{β_{2}} \log p$ since $s_{α_{2}}^{2} = O (n^{2 l_{4}}) ≪ n$ .

Remark 9.2

The second term on the right-hand side of (9.3) represents the difference between $y^{T} {\tilde{Ω}}_{n}^{(2)} y$ and $y^{T} Ω_{n}^{(2)} y$ , where ${\tilde{Ω}}_{n}^{(2)}$ is defined as the expectation of the truncated random matrix

\frac{1}{n} \sum_{i} X_{i} X_{i}^{T} A_{i}^{(2)} {1 - π^{(2)} (X_{i}, {\hat{α}}_{2})} I (‖ X_{i}^{M_{α_{2}}} ‖_{\infty} \leq \sqrt{2} ω \log n) .

(9.4)

This term will vanish as n → ∞. The third term represents the estimation error of ${\hat{α}}_{2}$ . When $ρ_{max}^{s_{β_{2}}} (\sum) < 2$ and $\sqrt{λ_{max} (\sum_{M_{α_{2}}}^{M_{α_{2}}})} ‖ {\hat{α}}_{2} - α_{2}^{⋆} ‖_{2} \to 0$ , (9.3) proves the UUP condition for $Ω_{n}^{(2)}$ .

Remark 9.3

A key assumption in Lemma 9.1 is the sparsity of $α_{2}^{⋆}$ , which is needed to bound the infinity norm in the indicator function of (9.4). This extra requirement comes from the involvement of the estimated propensity scores in $Ω_{n}^{(2)}$ , which adds significant difficulties in proving Lemma 9.1.

After some algebra, the RE condition for $Ω_{n}^{(2)}$ follows similarly from Lemma 9.1, which is presented below.

Lemma 9.2

For any integer c₀, assume that $‖ {\hat{α}}_{2} - α_{2}^{⋆} ‖_{2} = O (1)$ , and the sample size satisfies

n ≫ \frac{{ρ_{max}^{s_{β_{2}}} (\sum)}^{2} (s_{β_{2}} \log p + s_{α_{2}}^{2})}{\inf_{α_{2} \in H_{α_{2}}} K^{2} (s_{β}, c_{0}, Ω^{(2)} (α_{2}))} .

(9.5)

Then, for any 0 < θ < 1 and sufficiently large n, with probability at least $1 - \bar{c} / (n + p)$ , we have

K (s_{β_{2}}, c_{0}, Ω_{n}^{(2)}) > (1 - θ) \inf_{α \in H_{α_{2}}} K (s_{β_{2}}, c_{0}, Ω^{(2)} (α_{2})) .

Remark 9.4

The sample size requirement (9.5) is stronger than (9.2). To see this, for any positive semidefinite matrix Ψ, and positive integers s and c₀, we have

K^{2} (s, c_{0}, Ψ) \leq K^{2} (s, 0, Ψ) = ρ_{min}^{s} (Ψ) .

10. Discussion

10.1. Post selection inference

As pointed by one of the referee, the main goal of constructing optimal DTRs is to find treatments that are significantly superior to other treatment options. This requires addressing a post selection inference issue, i.e, the problem of influencing the estimated optimal value function (or the difference between the estimated value and the value function under other treatment options). In the fixed dimension setting, we can use either the empirical average of the advantage function (Murphy, 2003) or the augmented inverse propensity score type estimates (AIPWE, Zhang et. al, 2012) to estimate the optimal value function. Both type of estimators are asymptotically normally distributed. However, the inference based on the advantage function may not be valid in high dimensions. This is because when the number of predictors is large, the parameter estimates in the contrast function may not have oracle property (i.e, model selection consistency and asymptotic normality).

For a single stage study, assuming a linear interaction form X^T β₀ for the contrast function. Under certain conditions, we can show AIPWE is asymptotically normal even for NP-dimensionality if (i) $‖ \hat{β} - β_{0} ‖_{2} = o_{p} (n^{- 1 / 4})$ , (ii) with probability going to 1, $‖ {\hat{β}}^{M_{β}^{c}} - β_{0}^{M_{β}^{c}} ‖_{1} \leq c_{0} ‖ {\hat{β}}^{M_{β}} - β_{0}^{M_{β}} ‖_{1}$ for some constant c₀ where M_β is the support of β₀. For our penalized A-learning estimator, Assumption (i) can be achieved assuming certain conditions on the dimension of covariates, sample size and the sparsity of parameters in the contrast, the baseline and propensity score function. Assumption (ii) is typically satisfied for Lasso, Dantzig and folded-concave type estimators. Similar to Theorem 6.1, we can show our estimator satisfies $‖ {\hat{β}}^{M_{β}^{c}} - β_{0}^{M_{β}^{c}} ‖_{1} \leq ‖ {\hat{β}}^{M_{β}} - β_{0}^{M_{β}} ‖_{1}$ with probability going to 1. The asymptotic normality of AIPWE therefore follows. Standard error of the value estimator can be similarly obtained as in Zhang et. al (2012). Alternatively, we can use the one step online estimator as in Luedtke and van der Laan (2016). However, the asymptotic variance will be larger since it does not use all the data to construct the estimator. In summary, it is important and interesting to develop statistical inference for the estimated value function under the obtained optimal treatment regime in high dimensions, but it is beyond the scope of the current paper.

10.2. Tuning parameter selection

Bayesian information criteria (BIC) is used to tune the penalty functions. BIC has been widely used in model selection for selecting the tuning parameter when the goal is prediction. In high dimensional regressions, Chen and Chen (2008) proposed an extended BIC for model selection, and showed their BIC is consistent when the number of predictors grows polynomially in sample size. Fan and Tang (2013) proposed a similar criterion and showed its consistency when the number of predictors is in the non-polynomial order of the sample size. When the goal is to select treatment effect modifiers, Lu et al. (2011) also used a BIC-type criterion, which showed good empirical performance. This motivated us to use a similar BIC-type criterion for selecting the tuning parameter in our method. Our simulations demonstrated that the proposed BIC-type criterion empirically worked well. We conjecture that following similar arguments in the proof of Theorem 1 of Chen and Chen (2008) and the proof of Theorem 3 in Fan and Tang (2013), we can show our proposed BIC-type criterion is also consistent for selecting important variables in the contrast function. This is another interesting topic that needs further investigation.

10.3. Extensions to multiple stages and general models

In this paper, we mainly focus on a two-stage study. Extension of results to three-stage studies are provided in the supplementary article. It raises additional challenges to establish these results, since the potential model misspecification of contrast functions in the previous two stages can add up and stronger assumptions are needed to guarantee consistency of the parameter estimates. Readers can refer to the supplementary article for details.

For technical convenience, we assume a linear interaction form the contrast function on the last stage. More general results when the contrast function is misspecified can be similarly derived as the three-stage studies discussed in the supplementary article.

Supplementary Material

Supplement

NIHMS915736-supplement-Supplement.pdf^{(480.5KB, pdf)}

Acknowledgments

We thank the editor, the AE and two referees for providing helpful suggestions that significantly improved the quality of the paper. The STAR*D data are provided by National Institute of Mental Health. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.

Contributor Information

Chengchun Shi, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

Alin Fan, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

Rui Song, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

Wenbin Lu, Department of Statistics, North Carolina State University, Raleigh NC, U.S.A.

References

Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]
Candès E, Tao T. Rejoinder: “The Dantzig selector: statistical estimation when p is much larger than n” [Ann. Statist. 2007;35(6):2313–2351. 2007. [Google Scholar]; Ann Statist. 35:2392–2404. [Google Scholar]
Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sack-eim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, et al. Background and rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR* D) study. Psychiatric Clinics of North America. 2003;26:457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann Statist. 2016;44:713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]
McCullagh P, Nelder JA. Generalized linear models Monographs on Statistics and Applied Probability. Second Chapman & Hall; London: 1989. [Google Scholar]
Mendelson S, Pajor A, Tomczak-Jaegermann N. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom Funct Anal. 2007;17:1248–1282. [Google Scholar]
Mendelson S, Pajor A, Tomczak-Jaegermann N. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constr Approx. 2008;28:277–289. [Google Scholar]
Milman VD, Pajor A. Geometric aspects of functional analysis. Springer; 1989. Isotropic position and inertia ellipsoids and zonoids of the unit ball of a normed n-dimensional space; pp. 64–104. [Google Scholar]
Milman V, Pajor A. Regularization of star bodies by random hyperplane cut off. Studia Math. 2003;159:247–261. [Google Scholar]
Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. [Google Scholar]
Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. Sequenced treatment alternatives to relieve depression (STAR* D): rationale and design. Controlled clinical trials. 2004;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. [Google Scholar]
Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika. 2013;100:681–694. doi: 10.1093/biomet/ast014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. J Amer Statist Assoc. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou S. Restricted eigenvalue conditions on subgaussian random matrices. 2009 arXiv: 0912.4045. [Google Scholar]
Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR. Journal of the American Statistical Association. 2015. Residual weighted learning for estimating individualized treatment rules. just-accepted 00-00. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS915736-supplement-Supplement.pdf^{(480.5KB, pdf)}

[R1] Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]

[R2] Candès E, Tao T. Rejoinder: “The Dantzig selector: statistical estimation when p is much larger than n” [Ann. Statist. 2007;35(6):2313–2351. 2007. [Google Scholar]; Ann Statist. 35:2392–2404. [Google Scholar]

[R3] Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95:759–771. [Google Scholar]

[R5] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R6] Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sack-eim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, et al. Background and rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR* D) study. Psychiatric Clinics of North America. 2003;26:457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]

[R7] Luedtke AR, van der Laan MJ. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann Statist. 2016;44:713–742. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]

[R9] McCullagh P, Nelder JA. Generalized linear models Monographs on Statistics and Applied Probability. Second Chapman & Hall; London: 1989. [Google Scholar]

[R10] Mendelson S, Pajor A, Tomczak-Jaegermann N. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom Funct Anal. 2007;17:1248–1282. [Google Scholar]

[R11] Mendelson S, Pajor A, Tomczak-Jaegermann N. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constr Approx. 2008;28:277–289. [Google Scholar]

[R12] Milman VD, Pajor A. Geometric aspects of functional analysis. Springer; 1989. Isotropic position and inertia ellipsoids and zonoids of the unit ball of a normed n-dimensional space; pp. 64–104. [Google Scholar]

[R13] Milman V, Pajor A. Regularization of star bodies by random hyperplane cut off. Studia Math. 2003;159:247–261. [Google Scholar]

[R14] Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. [Google Scholar]

[R15] Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]

[R17] Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. Sequenced treatment alternatives to relieve depression (STAR* D): rationale and design. Controlled clinical trials. 2004;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]

[R18] Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. [Google Scholar]

[R19] Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]

[R20] Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Zhang B, Tsiatis AA, Laber EB, Davidian M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika. 2013;100:681–694. doi: 10.1093/biomet/ast014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. J Amer Statist Assoc. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Zhou S. Restricted eigenvalue conditions on subgaussian random matrices. 2009 arXiv: 0912.4045. [Google Scholar]

[R25] Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR. Journal of the American Statistical Association. 2015. Residual weighted learning for estimating individualized treatment rules. just-accepted 00-00. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

HIGH-DIMENSIONAL A-LEARNING FOR OPTIMAL DYNAMIC TREATMENT REGIMES

Chengchun Shi

Alin Fan

Rui Song

Wenbin Lu

Abstract

1. Introduction

2. Penalized A-Learning

3. Some Implementation Issues

4. Simulation Studies

4.1. Settings

Table 1.

4.2. Competing methods

4.3. Results

Table 2.

4.4. Nonregularity

Table 3.

Table 4.

5. Application to STAR*D Study

Table 5.

6. Oracle inequalities for β^2 and the value function of the estimated regime at the second stage

6.1. Oracle inequality for β^2

Remark 6.1

Remark 6.2

Condition 1

Remark 6.3

Condition 2

Condition 3

Remark 6.4

Lemma 6.1

Remark 6.5

Remark 6.6

Theorem 6.1

Corollary 6.1 (Double robustness of β^2)

Remark 6.7

6.2. Oracle inequality for the value function of the estimated regime at the second stage

Condition 4

Theorem 6.2

Remark 6.8

7. Error bounds for β^1 and the value function of the estimated dynamic treatment regime

7.1. Misspecified contrast function

7.2. Error bound for β^1

Condition 5

Condition 6

Condition 7

Remark 7.1

Condition 8

Lemma 7.1

Remark 7.2

Condition 9

Theorem 7.1

7.3. Error bound for the value function of the estimated dynamic treatment regime

Condition 10

Theorem 7.2

Remark 7.3

8. Weak oracle properties of α^j’s and θ^j’s

8.1. Weak oracle properties of α^2 and θ^2

Theorem 8.1

Theorem 8.2

Remark 8.1

Remark 8.2

8.2. Weak oracle properties of α^1 and θ^1

Theorem 8.3

9. Uniform uncertainty principle and restricted eigenvalue conditions in A-learning

Lemma 9.1

Remark 9.1

Remark 9.2

Remark 9.3

Lemma 9.2

Remark 9.4

10. Discussion

10.1. Post selection inference

10.2. Tuning parameter selection

10.3. Extensions to multiple stages and general models

Supplementary Material

Acknowledgments

Contributor Information

References

Associated Data

6. Oracle inequalities for ${\hat{β}}_{2}$ and the value function of the estimated regime at the second stage

6.1. Oracle inequality for ${\hat{β}}_{2}$

Corollary 6.1 (Double robustness of ${\hat{β}}_{2}$ )

7. Error bounds for ${\hat{β}}_{1}$ and the value function of the estimated dynamic treatment regime

7.2. Error bound for ${\hat{β}}_{1}$

8. Weak oracle properties of ${\hat{α}}_{j}$ ’s and ${\hat{θ}}_{j}$ ’s

8.1. Weak oracle properties of ${\hat{α}}_{2}$ and ${\hat{θ}}_{2}$

8.2. Weak oracle properties of ${\hat{α}}_{1}$ and ${\hat{θ}}_{1}$