Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Aug 4.
Published in final edited form as: Electron J Stat. 2016 Oct 13;10:2894–2921. doi: 10.1214/16-EJS1178

Robust learning for optimal treatment decision with NP-dimensionality

Chengchun Shi 1, Rui Song 1, Wenbin Lu 1
PMCID: PMC5544015  NIHMSID: NIHMS834730  PMID: 28781717

Abstract

In order to identify important variables that are involved in making optimal treatment decision, Lu, Zhang and Zeng (2013) proposed a penalized least squared regression framework for a fixed number of predictors, which is robust against the misspecification of the conditional mean model. Two problems arise: (i) in a world of explosively big data, effective methods are needed to handle ultra-high dimensional data set, for example, with the dimension of predictors is of the non-polynomial (NP) order of the sample size; (ii) both the propensity score and conditional mean models need to be estimated from data under NP dimensionality.

In this paper, we propose a robust procedure for estimating the optimal treatment regime under NP dimensionality. In both steps, penalized regressions are employed with the non-concave penalty function, where the conditional mean model of the response given predictors may be misspecified. The asymptotic properties, such as weak oracle properties, selection consistency and oracle distributions, of the proposed estimators are investigated. In addition, we study the limiting distribution of the estimated value function for the obtained optimal treatment regime. The empirical performance of the proposed estimation method is evaluated by simulations and an application to a depression dataset from the STAR*D study.

Keywords and phrases: Non-concave penalized likelihood, optimal treatment strategy, oracle property, variable selection

1. Introduction

Personalized medicine, which has gained much attentions over the past few years, is a medical paradigm that emphasizes systematic use of individual patient information to optimize that patient's health care. In this paradigm, the primary interest lies in identifying the optimal treatment strategy that assigns the best treatment to a patient based on his/her observed covariates. Formally speaking, a treatment regime is a function that maps the sample space of patient's covariates to the treatments.

There is a growing literature for estimating the optimal individualized treatment regimes. Existing literature can be casted into as model based methods and direct search methods. Popular model based methods include Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010) and A-learning (Robins, Hernan and Brumback, 2000; Murphy, 2003), where Q-learning models the conditional mean of the response given predictors and treatment while A-learning models the interaction between treatment and predictors, better known as the contrast function. The advantage of A-learning is robustness against the misspecification of the baseline mean function, provided that the propensity score model is correctly specified. Recently, Zhang et al. (2012) proposed inverse propensity score weighted (IPSW) and augmented-IPSW estimators to directly maximize the mean potential outcome under a given treatment regime, i.e. the value function. Moreover, Zhao et al. (2012) recast the estimation of the value function from a classification perspective and use machine learning tools, to directly search for the optimal treatment regimes.

The rapid advances and breakthrough in technology and communication systems make it possible to gather an extraordinary large number of prognostic factors for each individual. For example, in the Sequenced Treatment Alternative to Relieve Depression (STAR*D) study, over 305 covariates are collected from each patient. With such data gathered at hand, it is of significant importance to organize and integrate information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing personalized medicine. There have been extensive developments of variable selection methods for prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), MCP (Zhang, 2010) and many others in the context of penalized regression. Their associated inferential properties have been studied when the number of predictors is fixed, diverging with the sample size and of the non-polynomial order of the sample size.

In contrast to the large amount of work on developing variable selection methods for prediction, the variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size. Among those available, Gunter, Zhu and Murphy (2011) proposed variable ranking methods for the marginal qualitative interaction of predictors with treatment. Fan, Lu and Song (2015) developed a sequential advantage selection method that extends the marginal ranking methods by selecting important variables with qualitative interaction in a sequential fashion. However, no theoretical justifications are provided for these methods. Qian and Murphy (2011) proposed to estimate the conditional mean response using a L1-penalized regression and studied the error bound of the value function for the estimated treatment regime. However, the associated variable selection properties, such as selection consistency, convergence rate and oracle distribution, are not studied. Lu, Zhang and Zeng (2013) introduced a new penalized least squared regression framework, which is robust against the misspecification of the conditional mean function. However, they only studied the case when the number of covariates is fixed and the propensity score model is known as in randomized clinical trials. Song et al. (2015) proposed penalized outcome weighted learning for the case with the fixed number of predictors.

In this paper, we study the penalized least squared regression framework considered in Lu, Zhang and Zeng (2013) when the number of predictors is of the non-polynomial (NP) order of the sample size. In addition, we consider a more general situation where the propensity score model may depend on predictors and needs to be estimated from data, as common in observational studies. A two-step estimation procedure is developed. In the first step, penalized regression models are fitted for the propensity score and the conditional mean of the response given predictors. In the second step, the optimal treatment regime is estimated using the penalized least squared regression with the estimated propensity score and conditional mean models obtained in the first step. There are several challenges in both numerical implementation and derivation of theoretical properties, such as weak oracle and oracle properties, for the proposed estimation procedure. First, since the posited model for the conditional mean of the response given predictors may be misspecified, the associated estimation and variable selection properties under model misspecification with NP dimensionality is not standard. Second, it is unknown how the asymptotic properties of the estimators for the optimal treatment regime obtained in the second step will depend on the estimated propensity score and conditional mean models obtained in the first step under NP dimensionality. To our knowledge, these two challenges have never been studied in the literature. Moreover, we estimate the value function of the estimated optimal regime and study the estimator's theoretical properties.

The remainder of the paper is organized as follows. The proposed method for estimating the optimal treatment regime is introduced in Section 2. Simulation results are presented in Section 3. An application to a dataset from the STAR*D study is illustrated in Section 4. Section 5 and 6 demonstrate the weak oracle and oracle properties of the resulting estimators, respectively. The estimator for the value function of the estimated optimal treatment regime is given in Section 7, followed by a Conclusion Section. All the technical proofs are given in the Appendix.

2. Method

Let Y denote the response, A ∈ 𝒜 denote the treatment received, where 𝒜 is the set of available treatment options, and X denote the baseline covariates including constant one. For demonstration purpose, we focus on a binary treatment regime, i.e., 𝒜 = {0, 1}, with 0 for the standard treatment and 1 for the new treatment. We consider the following semiparametric model:

Y=h0(X)+A(β0TX)+e, (2.1)

where h0(X) is the unspecified baseline function, β0 is the p-dimensional regression coefficients and e is an independent error with mean 0 and variance σ2. Under the assumptions of stable unit treatment value (SUTVA) and no unmeasured confounders (Rubin, 1974), it can be shown that the optimal treatment regime dopt(x) for patients with baseline covariates X = x takes the form

I(E(Y|X=x,A=1)E(Y|X=x,A=0)>0)=I(β0Tx>0),

where I(·) is the indicator function.

Our primary interest is in estimating the regression coefficients β0 defining the optimal treatment regime. Let π(x) = P(A = 1|X = x) be the propensity score. We assume a logistic regression model for π(x):

π(x,α0)=exp(xTα0)/[1+exp(xTα0)], (2.2)

with p-dimensional parameter α0. Here, we allow the propensity score to depend on covariates, which is common in observational studies and the parameters α0 can be estimated from the data. For randomized clinical trials, π(x, α0) is a constant. We assume the majority of elements in β0 and α0 are zero and refer to the support supp(β0), supp(α0) as the true underlying sparse model of the indices.

Consider a study with n subjects. Assume X = (x1, …, xn)T is deterministic. The observed data consist of {(Yi, Ai, xi) : i = 1, ⋯, n}. Define μ(x) = h0(x) + π(x, α0)xTβ0, the conditional mean of the response given covariates X = x. We propose the following two-step estimation procedure to estimate the optimal treatment regime. In the first step, we posit a model Φ(x, θ) for the conditional mean function μ(x), and consider the penalized estimation for the propensity score and conditional mean models as follows.

Define

α^=argminα1ni=1n[log{1+exp(xiTα)}AixiTα]+j=1pλ1nρ1(|αj|,λ1n), (2.3)

and

θ^=argminθ1ni=1n{YiΦ(xi,θ)}2+j=1qλ2nρ2(|θj|,λ2n), (2.4)

where αj and θj refer to the jth element in α and θ, q is the dimension of θ, and ρ1 and ρ2 are folded concave penalty functions with the tuning parameters λ1n and λ2n, respectively. We allow p, q to be of NP order of n and assume logp = O(n1−2dβ) and logq = O(n1−2dθ) for some dβ and dθ(0,12), respectively. The posited model Φ(x, θ) may be misspecified.

Define Φ̂i = Φ(xi, θ̂) and π̂i = π(xi, α̂). In the second step, we consider the following penalized least square estimation:

β^=argminβ1ni=1n{YiΦ^i(Aiπ^i)βTxi}2+j=1pλ3nρ3(|βj|,λ3n), (2.5)

where ρ3 is a folded-concave penalty function with the tuning parameter λ3n. Here the folded-concave penalty functions ρ1, ρ2 and ρ3 are assumed to satisfy the following condition:

Condition 2.1. ρ(t, λ) is increasing and concave in t ∈ [0, ∞), and has a continuous derivative ρ′(t, λ) with ρ′(0+, λ) > 0. In addition, ρ′(t, λ) is increasing in λ ∈ [0, ∞) and ρ′(0+, λ) is independent of λ.

Popular penalties, such as LASSO, SCAD and MCP, satisfy Condition (2.1). In our implementation, we use SCAD penalty. Here, we adopt a two-step estimation procedure due to its computational simplicity. Alternatively, we can jointly estimate the parameters θ in the conditional mean model and β in the contrast function in a single penalized regression. However, this joint approach will require more computational effort since the tuning parameters for θ and β need to be selected simultaneously. In contrast, our two-step method only requires a single tuning parameter at each step and thus can be easily implemented by existing softwares, for example, the R package ncvreg.

3. Numerical studies

In this section, we evaluate the numerical performance of the proposed estimators in various settings. We generated the propensity score from the logistic regression model (2.2), with only one important covariate with the coefficient of 1.5. We chose three forms for the baseline function h0(x), including a simple linear form, a quadratic form and a complex non-linear form,

  • Model I: Y=1+θ0TX+A(β0TX)+ε,

  • Model II: Y=1+0.5(1+θ0TX)2+A(β0TX)+ε,

  • Model III: Y=1+1.5sin(πθ0TX)+X12+A(β0TX)+ε,

where X is a p-dimensional vector of covariates and = (1, XT)T. We set p = 1000. Covariates were generated independently from two distributions: standard normal or s shifted exponential distribution with mean 0 and variance 1.

For each model, the first two covariates were chosen as important variables both in the baseline mean function and the contrast function with θ0 = (−2, −1, 0, …, 0)T and β0 = (0, −1.5, 1.5, 0, …, 0)T. We considered two different sample sizes, n = 300 and n = 500. For each scenario, we conducted 1000 replications. In our method, we fitted a linear model for Φ(X, θ) and used the SCAD penalty for variable selection. The tuning parameter was chosen using 10-fold cross-validation.

To evaluate the performance of the proposed estimator, we also compared our method with the penalized Q-learning using the SCAD penalty. Specifically, we fitted a linear model with baseline covariate effects and treatment-covariates interaction. Note that it is correctly specified under model I but misspecified under models II and III.

Let β̂ and β̃ denote our estimator and the penalized Q-learning estimator, respectively. We report the L2 loss of β̂ and β̃, the number of missed important variables (denoted as FN), the number of selected noisy variables (denoted as FP) and the average percentage of making correct decisions (denoted as PCD), which is defined as 1i=1n|d(xi)I(β0Txi>0)|/n for treatment rules (x) = I(xTβ̂ > 0) and (x) = I(xTβ̃ > 0). In addition, we estimated E{Y()}, E{Y()} and E{Y(dopt)}, the value functions of the estimated optimal treatment regimes by our method and the penalized Q-learning method, and of the true optimal regime, respectively, using Monte Carlo simulations. For a given treatment rule d(x), we compute E{Y(d)} by averaging the responses for 20000 subjects generated from the true model with A being determined by d(x). We report the averages of mean responses over 1000 replications as well as their standard deviations.

Table 1 summarizes the results. The penalized Q-learning method performs pretty well under Model I where the fitted linear model is correctly specified and is more efficient than the proposed method as expected. For example, when covariates are i.i.d normal and n = 300, the PCD is around 99.3% and the estimated value function is very close to the true optimal, E{Y(dopt)}. In contrast, under this setting, the PCD of our proposed method is 97.5%, and the estimated value function is slightly lower.

Table 1. Simulation results for L2 loss, FN, FP, PCD and values.

Measures n Model I Model II Model III
Robust learning with covariates i.i.d normal L2 loss 300 0.276 1.743 1.171
500 0.189 1.453 0.700
FP 300 5.104 9.148 12.481
500 4.143 9.742 12.616
FN 300 0.000 0.893 0.125
500 0.000 0.471 0.002
PCD 300 0.975 0.734 0.834
500 0.983 0.789 0.904
EY*() 300 1.842(0.021) 4.544(0.157) 2.716(0.089)
500 1.845(0.019) 4.643(0.116) 2.797(0.048)
EY*(dopt) 1.847 4.846 2.847

Penalized Q-learning with covariates i.i.d normal L2 loss 300 0.080 4.861 1.729
500 0.061 4.928 1.833
FP 300 0.001 8.191 7.745
500 0.000 4.438 7.972
FN 300 0.000 0.050 0.757
500 0.000 0.006 0.553
PCD 300 0.993 0.550 0.714
500 0.994 0.538 0.690
EY*() 300 1.846(0.021) 4.117(0.165) 2.508(0.192)
500 1.846(0.020) 4.091(0.093) 2.457(0.204)
EY*(dopt) 1.847 4.846 2.847

Robust learning with covariates i.i.d exponential L2 loss 300 0.290 1.768 1.186
500 0.199 1.495 0.730
FP 300 6.596 9.700 13.240
500 4.972 10.512 13.932
FN 300 0.000 0.793 0.142
500 0.000 0.466 0.003
PCD 300 0.958 0.724 0.809
500 0.971 0.761 0.871
EY*() 300 1.744(0.018) 4.500(0.179) 2.670(0.095)
500 1.747(0.018) 4.562(0.161) 2.736(0.041)
EY* (dopt) 1.751 4.751 2.783

Penalized Q-learning with covariates i.i.d exponential L2 loss 300 0.264 2.580 2.225
500 0.121 3.236 2.408
FP 300 0.003 12.257 13.234
500 0.000 21.383 15.479
FN 300 0.045 0.824 0.288
500 0.005 0.377 0.072
PCD 300 0.954 0.610 0.609
500 0.978 0.595 0.584
EY*() 300 1.744(0.018) 4.500(0.179) 2.670(0.095)
500 1.743(0.037) 4.197(0.289) 2.201(0.224)
EY*(dopt) 1.751 4.751 2.783

However, for Models II and III, the penalized Q-learning method could lead to substantial bias and works much worse than the proposed method. Taking the second model as an example, when covariates are normal and n = 300, ‖β̃β02 = 4.86, approximately third times as large as ‖β̂β02. The PCD of the estimated treatment regime obtained by the penalized Q-learning is 55.0%, only a little better than a random guess. In contrast, for this scenario, the PCD of our proposed method is 73.4%. Moreover, when sample size increases, the performance of the penalized Q-learning method is even worse. This is due to the misspecification of the baseline mean function. For our method, there's a big increase in the PCD as the sample size gets larger. The L2 loss and average number of missed important variables are also greatly reduced. This demonstrates the robustness of the proposed method to the misspecification of the baseline mean function.

4. Real data example

We applied our method to the data set from the STAR*D study for 4041 patients with nonpsychotic major depressive disorder (MDD). The aim of the study was to determine the effectiveness of different treatments for those people who have not responded to initial medication treatment. At Level 1, all patients received citalopram (CIT), an selective serotonin reuptake inhibit (SSRI) medication. After 8-12 weeks, three more levels of treatments were offered to participants whose previous treatment didn't give an acceptable response. Available treatments at Level 2 included sertraline (SER), venlafaxine (VEN), bupropion (BUP) and cognitive therapy (CT) and augmenting CIT which combines CIT with one more treatment. At Level 2A, switch options to VEN or BUP treatment were provided for patients receiving CT but without sufficient improvement. Four treatments were available at Level 3 for participants without anticipated response, including medication switch to mirtazapine (MIRT), nortriptyline (NTP), and medication augmentation with either lithium (Li) and thyroid hormone (THY). Finally, treatment with tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN) were provided at Level 4 for those without sufficient improvement at Level 3.

Here, we only focused on a subset of data for those patients receiving treatment BUP (coded as 1) or SER (0) at Level 2. The outcome of interest was the 16-item Quick Inventory of Depressive Symptomatology-Clinician-Ratings (QIDS-C16), which indicated the severity of patient's depressive symptom. The maximum vale of QIDS-C16 was 24 and its distribution was highly skewed. Hence, we considered the transformation Yi = log(25 − QIDS-C16) as our response. Larger value of Yi indicates better response. All baseline variables at Level 1 and intermediate outcomes at Level 2 were included as covariates in our study, yielding 305 covariates in total for each patient. There are 383 patients receiving treatment BUP or SER at Level 2, however, only 319 patients have complete records of all 305 covariates and the response. Among them, 153 were treated with BUP and 166 with SER. Our proposed method selected 14 variables that are important for treatment decision. We reestimate the coefficients of these variable by solving A-learning estimating equations (Robins, 2004) and obtained the resulting estimated optimal treatment regime.

To examine the performance of the estimated optimal treatment regime, we compared it with the fixed treatment regimes by assigning all patients to either BUP or SER, in terms of the estimated value functions obtained by the IPSW method (Zhang et al., 2012). The results for the estimated value functions were given in Table 2. In addition, we reported the 95% confidence intervals for the difference between the estimated values of the obtained optimal regime and the fixed regime based on 500 bootstrap samples. Our estimated optimal treatment regime gave larger estimated values than those of the fixed regimes, BUP and SER. The difference is significant when comparing to the BUP treatment at 5% level, but is less significant when comparing to the SER treatment. One reason is that our estimated optimal regime assigns the majority of patients (about two-thirds) to the SER treatment. Please refer to Table 3 for the numbers of patients receiving BUP or SER according to the estimated optimal regime.

Table 2. Estimated value functions and confidence intervals for the difference of the estimated values.

Treatment regime Estimated value function Diff 95% CI on Diff
Estimated optimal regime 3.10
BUP 2.55 0.55 [0.07, 1.13]
SER 2.80 0.30 [−0.08, 0.64]

Table 3. Number of patients receiving BUP or SER, according to the estimated optimal treatment regime.

receives BUP receives SER total
assigns BUP 66 50 116
assigns SER 93 110 203
total 153 160 319

In addition, as suggested by a referee, we examined the effects of missing data. Specifically, we deleted one patient whose response was missing, and imputed all the missing values in covariates using the R package missForrest available in CRAN. This package uses a random forest trained based on the observe entries in the design matrix to predict those missing values. The optimal treatment regime obtained based on the imputed data was similar to the one based on the complete-case analysis as shown above. It selected 14 variables among which 11 variables were also included in the estimated optimal treatment regime without imputation. In addition, the bootstrap results suggested that the estimated value of the estimated optimal treatment regime is significantly larger than those of the fixed treatment regimes, under 0.05 significance level. Since results are similar, we omitted them here.

5. Non-asymptotic weak oracle properties

In this section we show that the proposed estimator enjoys the weak oracle property, that is, α̂, β̂ and θ̂ defined in (2.3)-(2.5) are sign consistent with probability tending to 1, and are consistent with respect to the L norm. Weak oracle properties of θ̂ are established in the sense that it converges to some least false parameter θ when the main effect model is misspecified.

Theorem 5.1 provides the main results. Some regularity conditions are discussed in subsections 5.1 and 5.2. A major technical challenge in deriving weak oracle properties of β̂ is to analyze the deviation in (5.18), for which we develop a general empirical process result in the supplementary article (Shi et al., 2016). This result is important in its own right and can be used in analyzing many other high-dimensional semiparametric models where the index parameter of an empirical process is a plug-in estimator. The following notation is introduced to simplify our presentation.

Let 1 denote a vector of ones, E denote the identity matrix, O denote the zero matrix consisting of all zeros. For any matrix Ψ, let P(Ψ) denote the projection matrix Ψ(ΨTΨ)−1ΨT. ΨM the submatrix of Ψ formed by columns in the subset M. For any vector a, b, let “○” denote the Hadamard product: ab = (a1b1, …, anbn)T, |a| = (|a1|, …, |an|)T, diag(a) as the diagonal matrix with elements of vector a and aM the subvector of a formed by elements in M. The jth element in a is denoted as aj. Let ‖ · ‖p be the Lp norm of vectors or matrices. Let ‖Yψm be the Orlicz norm of a random variable Y,

infu{u>0:Eexp(|Y|u)m2},

for any m ≥ 1.

Let Mα = supp(α0), Mβ = supp(β0), Mθ = supp(θ), and Mαc, Mβc, Mθc be their complements. Assume each xj, is standardized such that ‖xj2 = √n. Let Φ(θ) = [Φ(x1, θ), …, Φ(xn, θ)]T, ϕ(θ) = [ϕ1(θ), …, ϕq(θ)] denote its Jacobian matrix. The derivatives are taken componentwise, i.e.,

ϕl(θ)=(ϕl(x1,θ),,ϕl(xn,θ)),

for all l = 1, …, q. We denote Φ(θ) and ϕ(θ) as Φ and ϕ when there's no confusion. We use a short-hand Φ̂, ϕ̂ for Φ(θ̂), ϕ(θ̂).

5.1. The misspecified function

We first define the least false parameter under the misspecification due to the posited mean function Φ(x, θ). For regression models with fixed number of predictors, the definition of the least false parameter under model misspecification has been widely studied in the literature (e.g, White, 1982; Li and Duan, 1989). However, for regression models with NP dimensionality, its definition is more tricky. Here, we define our least false parameter as follows.

For each θ ∈ ℝq, let d = 1/2 minj{|θj| : θj ≠ 0}, Mθ be the support of θ, μ = (μ(x1), …, μ(xn))T and

Hθ={δRd:δMθc=0,δMθθMθdnθ}.

Consider the set

Θ={θ:supδHθϕMθc(δ)T[EP{ϕMθ(δ)}]{μΦ(θ)}C0n1dθlogn,|Mθ|s0},

for some constant C0, and s0n. We assume the set Θ to be nonempty and define the least false parameter as

θ=argminθΘsupδHθ{ϕMθ(δ)TϕMθ(δ)}1ϕMθT(δ){μΦ(θ)}.

In addition, we assume

supδHθ{ϕMθ(δ)TϕMθ(δ)}1ϕMθT(δ)(μΦ)=O(nγ0logn), (5.1)

for some γ0 ≥ 0. By its definition, θ satisfies

supδHθϕMθc(δ)T[EP{ϕMθ(δ)}](μΦ)=O(n1dθlogn), (5.2)

and |Mθ| ≤ s0.

Remark 5.1. Conditions (5.1) and (5.2) are key assumptions determining the degree of model misspecification. Condition (5.1) requires that the posited working model Φ can provide a good approximation for μ. In that case, the residual μ – Φ will be orthogonal to the jacobian matrix ϕMθ and the left-hand side of (5.1) will be small. In general, our assumptions are weaker than the weak sparsity assumption imposed for Lasso (Bunea, Tsybakov and Wegkamp, 2007), which assumes the L2 approximation error ‖μ – Φ‖2 converges to 0 at some certain rate.

Condition 5.1. We assume the following conditions:

supδHθ{ϕMθ(δ)TϕMθ(δ)}1=O(bθn), (5.3)
supδHθϕMθc(δ)TϕMθ(δ){ϕMθ(δ)TϕMθ(δ)}1min{Cρ3(0+)ρ3(dnθ),O(na3)}, (5.4)
maxl=1qϕl(1+|Xβ0|)2=O(n), (5.5)
maxl=1qkMθsupδHθϕl(δ)θk(1+|Xβ0|)2=O(n12+γθsθlogn), (5.6)
supδ1Hθsupδ2Hθmaxl=1qλmax((|ϕl(δ1)|)TϕMθ(δ2)θMθ)=O(n), (5.7)

for some constants 0 ≤ a3 ≤ 1/2, 0 ≤ γθγ0, sθ = |Mθ|. If the response is unbounded, we require

maxl=1q(ϕl)=o(ndθ/logn), (5.8)

and the right-hand side of (5.6) shall be modified to O(n12+γθ/sθlog2n).

Remark 5.2. Conditions (5.6) and (5.7) put constraints on the derivatives of ϕ, requiring the misspecified function to be smooth. The right-hand side order in (5.6) is not too restrictive when nγθsθ logn.

Two common examples of the main-effect function Φ are provided below to examine the validity of Condition 5.1.

Example 1. Set Φ = 0. Then, no model is needed for Φ. It is easy to check that Condition 5.1 is satisfied.

Example 2. When a linear model is specified, i.e., Φ(x, θ) = xTθ, conditions (5.6) and (5.7) are automatically satisfied since the second-order derivative of Φ vanishes. In this example, θ takes the form

θMθ=(XMθTXMθ)1XMθTμ,

and θMθc=0. Note that θMθ is the regression coefficients between XMθ and μ. Condition (5.1) holds automatically since

(XMθTXMθ)1XMθ(μXθ)=0>

Condition (5.2) becomes

XMθcT{IP(XMθ)}μ=O(n1dθlogn). (5.9)

Each element in the left-hand side vector in (5.9) can be viewed as the inner product of the residuals obtained by fitting XMθ on each noise variable in XMθc and those fitted by regressing XMθ on μ. When μ depends only on XMθ, (5.9) holds for Gaussian linear model.

5.2. The covariates

Condition 5.2. Assume that

supδHθBnβ1XMβTW(δ)ΔXMαBnα1=O(bαβn), (5.10)
supδHθXMβcTWβW(δ)XMαBnα1=min{o(λ2nρ2(0+)λ1nρ1(dnβ)),O(na2)}, (5.11)
maxj=1pW(θ)xj2=O(n), (5.12)
maxj=1pkMαxkxj(Xβ0)2=O(n1/2+γαlogn), (5.13)
maxj=1pkMβxjxk2=O(n1/2+γβlogn), (5.14)
maxj=1plMθsupδHθxjϕl(δ)2=O(n1/2+γθsθlog3n), (5.15)
supδHθmaxj=1pλmax[XMαTdiag(|W(δ)xj|)XMα]=O(n), (5.16)
maxj=1pλmax[XMαTdiag|xj(Xβ0)|XMα]=O(n), (5.17)

for some constants 0 ≤ γα, γβ, a2 ≤ 1/2, where

W(δ)=diag[μΦ(δ)],Bnα=XMαTΔXMα,Bnβ=XMβTΔXMβ,Wβ=ΔΔ12P(Δ12XMβ)Δ12,Δ=diag(π(x1),,π(xn)).

The sequence bαβ in (5.10) shall satisfy

bαβ=min{o(n12γβlogn),o(n2γαγβ/sαlogn)}.

Remark 5.3. Conditions (5.10) and (5.11) control the impact of the deviation of the estimated propensity score from its true value on β̂, thus are not needed when the propensity scores are known. By the definition of W(δ), magnitudes of the left-hand side in these two conditions depend on how accurate Φ models μ. The sequence bαβ in (5.10) can converge to 0 when XMβ and XMα are weakly correlated. Each element in the left-hand side of (5.11) is the multiple regression coefficient of the corresponding variable in XMβc on W(δ)XMα, using weighted least squares with weights π ○ (1 − π), after adjusted by XMβ, which characterize their weak dependence given XMβ. These two conditions are generally weaker than those imposed by Fan and Lv (2011) (Condition 2), and are therefore more likely to hold.

Remark 5.4. The right-hand side in (5.15) can be relaxed to O(n1/2+γθ/logn) when using the linear model. The additional term sθ is due to the penalty on the complexity of the main effect model. This condition typically controls the deviation

ZT{ΦΦ(θ^)}=Op(logplogn), (5.18)

where Z = diag(Aπ)X. A common approach to bound the deviation is to utilize the classical Bernstein's inequality. However this approach does not work here, because the indexing parameter in the process Φ(·) in (5.18) is an estimator. To handle this challenge, we bound the left-hand side in (5.18) by

supδ1,δ2HθZT{Φ(δ1)Φ(δ2)}.

A general theory that covers the above result is provided in Proposition C.1 in the supplementary article.

Remark 5.5. Conditions (5.16) and (5.17) aim to control the L norm of the quadratic term of the Taylor series as a function of α̂, expanded at α0. Similar to (5.10) and (5.11), the two conditions are not needed when α0 is known to us.

5.3. Weak oracle properties

Theorem 5.1 (Weak oracle property). Assume that conditions B.1 and B.3 in the supplementary Appendix and conditions 5.1, 5.2 hold, and maxieiψ1 < ∞, where ei is the residual for the ith patient in (2.1). Then there exist local minimizers α̂, θ̂ and β̂ of the loss functions (2.3), (2.4), and (2.5) respectively, such that with probability at least 1 − /(n + p + q):

  1. α^Mαc=0, β^Mβc=0, θ^Mθc=0,

  2. α̂Mαα0Mα = O(nγαlogn), ‖β̂Mββ0Mβ = O(nγβlogn), θ^MθθMθ=O(nγθlogn),

for is some positive constant.

Remark 5.6. In Theorem 5.1, part (a) corresponds to the sparse recovery while (b) gives the estimators' convergence rates. Weak oracle property of α̂ directly follows from Theorem 2 in Fan and Lv (2011). However, to prove this property of β̂ requires further efforts, to account for the variability due to plugging in θ̂ and α̂. L convergence rate of α̂Mα as well as the nonsparsity size sα, play an important role in determining how fast β̂Mβ converges.

Remark 5.7. The convergence rate of θ̂ will not affect that of β̂. This is because we require the posed propensity score model to be correct, the estimation of β is robust with respect to the model misspecification of the main effect parameters θ. Simulation results also validate our theoretical findings.

6. Oracle properties

In this section we study the oracle property of the estimator β̂. We assume that max(sα, sβ) ≪ √n and nγθsθlogn. The convergence rates of the estimators are established in Section 6.1 and their asymptotic distributions are provided in Section 6.2.

6.1. Rates of convergence

Condition 6.1. In addition to (5.16) and (5.17) in Condition 5.2, assume that the right-hand side of (5.15) is strengthened to O(n12+γθ/sθlog3n), and the following conditions hold,

supδHθBnβ1/2XMβTW(δ)ΔXMαBnα1/2=O(1), (6.1)
supδHθXMβcTWβW(δ)XMα2,=O(n), (6.2)
maxj=1pmaxkMβxjxk2=O(n), (6.3)
maxj=1pmaxkMαxjxk(Xβ0)2=O(n), (6.4)
tr[XMβTW(θ)ΔW(θ)XMβ]=O(sβn). (6.5)

Remark 6.1. Similar to the interpretation of (5.10) and (5.11), (6.1) corresponds to a notion of weak dependence between variables in XMα and XMβ while (6.2) require XMβc and XMα are weakly correlated after adjusted by XMβ. Besides, it can be verified that (6.3)-(6.5) hold with large probability when the baseline covariates possesses subgaussian tail.

Theorem 6.1. Assume that conditions 2.1, 5.1 and 6.1 and conditions B.2 and B.4 in the supplementary Appendix hold, and maxieiψ1 < ∞. Constraints on bθ, dθ, d and λ3n are same as in Theorem 5.1. Further assume max(l1,l2)<12 with sα = O(nl1), sβ = O(nl2), and nγθsθlogn. Then there exists a strict local minimizer β̂ of the loss function (2.5), α̂ of (2.3), such that α^Mαc=0, β^Mβc=0 with probability tending to 1 as n → ∞, and α^Mαα0Mα2=O(sαn1/2), β^Mββ0Mβ2=O(sα+sβn1/2).

Remark 6.2. We note that when establishing the oracle property of β̂, only the weak oracle property of θ̂ is required. This is due to the robustness of the A-learning methods and the fact that the propensity score is correctly specified.

Remark 6.3. Precision of β̂Mβ is affected by that of α̂Mα, since ‖β̂Mββ0Mβ2 is at least the same order of magnitude as ‖α̂Mαα0Mα2. When the propensity score is known, convergence rate of β̂Mβ is improved to sβ/n.

6.2. Asymptotic distributions

We define Σ12 and Σ22 as

12=2Bnα1/2XMαTΔWXMβBnβ1/2,
22=Bnα1/2XMαTWΔ1/2(EPΔ12XMα)Δ1/2WXMβBnβ1/2,

where W is a shorthand for W(θ).

To establish the weak convergence of the estimators, we introduce the following conditions.

Condition 6.2. Assume that

λ1nρ¯1(dnα)=o(sα1/2n1/2),λ2nρ¯2(dnβ)=o(sβ1/2n1/2), (6.6)
i=1n(xMαiTBnα1xMαi)3/20,i=1n(xMβiTBnβ1xMβi)3/20, (6.7)
i=1n(xMβiTBnβ1xMβi)3/2|μiΦi|30, (6.8)
λmax(Bnβ1/2XMβTW2XMβBnβ1/2)=O(1), (6.9)
supδHθBnβ1/2XMβTdiag[ΦΦ(δ)]ΔXMαBnα1/22=o(1). (6.10)

where xMαi and xMβi stand for the ith row of the matrix XMα and XMβ respectively.

Remark 6.4. Conditions (6.7) and (6.8) are the Lyapunov conditions which guarantee the normality of α̂Mα and β̂Mβ. Condition (6.9) puts constraints on the maximum eigenvalue of the variance-covariance matrix of XMβTdiag(Aπ)(μΦ) by requiring it to be finite. Condition (6.10) holds when Φ(δ) converges to Φ uniformly in terms of L norm with δ in the region Hθ. When ‖μ − Φ‖ is bounded, (6.8) and (6.9) are simultaneously satisfied.

Theorem 6.2 (Oracle property). Under conditions in Theorem 6.1 and Condition 6.2, assume max(sα, sβ) = o(n1/3), the right-hand side of (5.15) is strengthened to O(n12+γθ/sβsθlog3n), as n → ∞. Then with probability tending to 1, α^=(α^MαT,α^2T)T, β^=(β^MβT,β^2T)T in Theorem 6.1 must satisfy

  1. α̂2 =0, β̂2 = 0,

  2. [A1nBnα1/2(α^Mαα0Mα),A2nBnβ1/2(β^Mββ0Mβ)] is asymptotically normally distributed with mean 0, covariance matrix Ω, which is the limit of

(A1nA1nTA1n12A2nTA2n21A1nTσ2A2nA2nT+A2n22A2nT),

where A1n is a q1 × sα matrix and A2n is a q2 × sβ matrix such that

λmax(A1nA1nT)=O(1),λmax(A2nA2nT)=O(1).

We note that conditions on the smoothness of the misspecified function (5.15) is strengthened. To better understand the above theorem, we provide the following two corollaries. The first corollary gives the limiting distribution when we specify both the propensity score and main-effect model while the second one corresponds to case when the propensity score is known in advance.

Corollary 6.1. Under conditions of Theorem 6.2, when we correctly specify the main-effect model, i.e., μ = Φ, A1nBnα1/2(α^Mαα0Mα) and A2nBnβ1/2(β^Mββ0Mβ) are jointly asymptotically normally distributed, with the covariance matrix Ω′, which is the limit of the following matrix,

(A1nTA1nOOσ2A2nTA2n).

Remark 6.5. Comparing the results in Corollary 6.1 and in Theorem 6.2, the term A2nT22A2n accounts for the partial specification of model (2.1). In the most extreme case where we correctly specify Φ, β̂Mβ will achieve its minimum variance and is independent of α̂Mα. In general, we can gain efficiency by posing a good working model for Φ. Numerical studies also suggest that a linear model such as Φ = is preferred compared to the constant model. This is in line to our theoretical justification since W is a diagonal matrix with the ith diagonal element μi − Φi.

Corollary 6.2. When the propensity score is known, under conditions of Theorem 6.2 with all α̂'s replaced by α0, then with probability tending to 1 as n → ∞, A2nBnβ1/2(β^Mββ0Mβ) is asymptotically normally distributed with mean 0, co-variance matrix Ω″ which is the limit of

σ2A2nTA2n+A2nT22A2n,

where

22=Bn21/2XMβTWΔWXMβBn21/2.

Remark 6.6. An interesting fact implied by Corollary 6.2 is that the asymptotic variance of β̂Mβ will be smaller than that of the same estimator had we known the propensity score in advance. A similar result is given in the asymptotic distribution of the mean response for the value function in the next section. This is in line with the semiparametric theory in fixed p case where the variance of augmented-IPWS estimator would be smaller when we estimate the parameter in the coarsening probability model, even if we know what the true value is (see Chapter 9 in Tsiatis, 2006). By doing so, we can actually borrow information from the linear association between covariates in WXMβ and those in XMα.

7. Evaluation of value function

In this section, we derive a non-parametric estimate for the mean response under the optimal treatment regime. By (2.1), define our average population-level response under a specific regime as

Vn(β)=1ni=1nE[Yi|Ai=I(xiTβ>0),Xi=xi]=1ni=1n[h0(xi)+xiTβ0I(xiTβ>0)],

where the treatment decision for the ith patient is given as I(xiTβ>0). The mean response under the true optimal regime is denoted as Vn(β0) and it is easy to verify that β0 is the maximizer of the function Vn.

Similarly as in Murphy (2003), we propose to estimate Vn(β0) using

V^n=1ni=1n[Yi+xiTβ^{I(xiTβ^>0)Ai}]. (7.1)

This estimator is not doubly robust but offers protection against misspecification of the baseline function and improved efficiency It's not doubly robust because we require the propensity score model to be correctly specified to ensure the oracle property of β̂. A key condition which guarantees asymptotic normality of (7.1) is given as follows.

Condition 7.1. Assume there exists some constant C′, such that for all ε > 0,

1niI(|xiTβ0|<ε)Cε.

Remark 7.1. The above condition has similar interpretation as Condition (3.3) in Qian and Murphy (2011), where random design were utilized. Condition 7.1 requires that the absolute value of the average contrast function can not be too small, which together with the condition sβ = o(n1/4) ensures the following stochastic approximation condition:

nixiTβ^{I(xiTβ^>0)I(xiTβ0>0)}=op(1). (7.2)

Theorem 7.1. Assume that conditions in Theorem 6.2 hold. If Condition 7.1 holds and the nonsparsity size sβ satisfies sβ = o(n1/4), then with probability going to 1, √n{nVn(β0)} is asymptotically normally distributed with variance ν02, which is limit of

σ2+σ2υnTXMβBnβ1XMβTυn+υnTXMβBnβ1/222Bnβ1/2XMβTυn, (7.3)

where υn stands for the vector [I(x1Tβ0>0)π(x1),,I(xnTβ0>0)π(xn)]T/n, and Σ22 is defined in Theorem 6.2.

Remark 7.2. Note that we only need sβ = o(n1/2) to guarantee the weak oracle property of β̂ or O(sβ/n) convergence rate of ‖β̂Mββ0Mβ2. This condition is strengthened to sβ = o(n1/3) to show the asymptotic normality of β̂Mβ. Theorem 7.1 further requires sβ = o(n1/4) as to ensure the approximation condition (7.2).

Remark 7.3. When (7.2) is satisfied, the asymptotic normality of n follows immediately from the oracle property of the estimator β̂Mβ. The first term σ2 in (7.3) is due to variation of the error term ei while the last two terms correspond to the asymptotic variance of β̂Mβ.

We provide a corollary here which corresponds to the case where the main-effect model is correctly specified.

Corollary 7.1. In addition to the conditions in Theorem 7.1, if the main-effect model is correct, √n{nVn(β0)} is asymptotically normally distributed with variance ν12, which is defined as the limit of

σ2+σ2υnTXMβBnβ1XMβTυn,

where υn is defined in Theorem 7.1.

Similar to the asymptotic distribution of β̂Mβ, the following corollary suggests that the proposed estimator is more efficient in the case when we estimate the propensity score by fitting a penalized logistic regression.

Corollary 7.2. Assume the propensity score is known, and conditions in Theorem 7.1 hold with all α̂'s replaced by α0, then with probability going to 1, √n{nVn(β0)} is asymptotically normally distributed with variance ν22, which is the limit of

σ2+σ2υnTXMβBnβ1XMβTυn+υnTXMβBnβ1/222Bnβ1/2XMβTυn,

with υn defined in Theorem 7.1, and 22 defined in Corollary 6.2.

By the definition of υn and the condition that λmax(XMβTXMβ)=O(n), the asymptotic variance will reach its minimum when I(xiTβ0>0) is close to the propensity score. We characterize this result in the following Corollary.

Corollary 7.3. Under the conditions in Theorem 7.1, if we further assume that

1ni=1n{I(XiTβ0>0)π(xi)}2=o(1),

then with probability going to 1, √n{nVn(β0)} is asymptotically normally distributed with the variance σ2.

Remark 7.4. Such a result is expected with the following intuition: in an observational study, if the clinician or the decision maker has a high chance to assign the optimal treatment to an individual patient, i.e., the propensity score is close to I(xiTβ0>0), the variation in estimating the value function will be decreased. In other words, the more skillful the clinician or the decision maker is, the closer the observed individual response Yi approaches the potential outcome under the optimal treatment regime.

8. Conclusion

In this article, we propose a two-step estimator for estimating the optimal treatment strategy which selects variables and estimates parameters simultaneously in both propensity score and outcome regression models using penalized regression. Our methodology can handle data set whose dimensionality is allowed to grow exponentially fast compared to the sample size. Oracle properties of the estimators are given. Variable selection is also involved in the misspecified model and new mathematical techniques are developed to study the estimator's properties in a general form of optimization. The estimator is shown to be more efficient when the misspecified working model is “closer” to the conditional mean of the response, although our approach does not require correct specification of the baseline function. Numerical results demonstrate that the proposed estimator enjoys model selection consistency and has overall satisfactory performance.

In the case when there are multiple local solutions of our objective functions (2.5), (2.3) or (2.4), although our asymptotic theory only suggests the existence of a local minimum possessing the oracle property, it is worth mentioning that we can actually identify the desired oracle estimator using existing algorithms (see Fan, Xue and Zou, 2014; Wang, Kim and Li, 2013). Theoretical properties can be established in a similar fashion.

The proposed method requires to specify the propensity score model correctly. In randomized studies, the propensity score is known in advance and thus the assumption is automatically satisfied. However, for observational studies, there's no guarantee. In practice, some prior information on treatment decision mechanism used by physicians may be helpful for building a reasonable propensity score. In addition, model diagnostic tests can be used to check the goodness-of-fit of the posited propensity score model, such as a logistic regression model. In general, this might be easier than checking the goodness-of-fit of the regression model for the response. In addition, in our current work, we assume the design matrix X to be deterministic mainly for technical convenience. To the best of our knowledge, the penalized regression with the folded-concave penalties has never been studied in random design settings with NP dimensionality. To consider random design settings, we need to impose some tail conditions on X, and the derivation of some technical results needs to be modified. This is beyond the scope of our current paper and will be investigated elsewhere.

The current framework is focused on point treatment study. It will be interesting and practically useful to extend our results to dynamic treatment regimes. Significant efforts are needed to handle model misspecification in multiple stages. This is an interesting research topic that needs further investigation.

Supplementary Material

supplementary appendix

Appendix

Here, we only give the proof of Theorem 5.1. More technical conditions and proofs for Theorems 6.1, 6.2 and 7.1 are given in the supplementary Appendix. To establish Theorem 5.1, we need the following lemmas. The proofs of these lemmas are also given in the supplementary Appendix.

Lemma 1. Let z = (z1, …, zn)T be an n-dimensional independent random response vector with mean 0 and a ∈ ℝn.

  1. If z1, …, zn are bounded in [c, d], then for any ε ∈ (0, ∞),

    Pr(|aTz|>ε)2exp(ε22a22(dc)2).
  2. If z1, …, zn satisfy maxiziψ1ω, then for any ε ∈ (0, ∞),

    Pr(|aTz|>ε)2exp(12ε22a22ω2+aεω).

Lemma 2. Define ε=k=116εk, where εk is defined in Appendix G, under conditions in Theorem 5.1, we have Pr(ε) ≥ 1 − /(n + p + q) for some > 0.

Notation. Let Z = diag(Aπ)X, = diag(Aπ̂)X, and

ξ1=Z^Te,ξ2=ZT(μΦ),ξ3=ϕT(eZβ0),ξ4=ZTdiag(Xβ0)ΔXMα,ξ5=XT[diag{(Aπ)(Aπ)}Δ]XMβ,ξ6(δ)=ZT{ΦΦ(δ)},ξ7(δ)={ϕ(δ)ϕ}T(eZβ0),

and π = (π(x1), …, π(xn)). For a given matrix Ψ, the superscript Ψj is used to refer to the vector which is the jth column of matrix Ψ while the subscript Ψi stands for the ith row of Ψ. We will write Φ(θ), ϕ(θ) with θ=(θMθT,0T)T as Φ(θMθ), ϕ(θMθ) for convenience.

Proof of Theorem 5.1

We break the proof into three steps. Based on Theorem 1 in Fan and Lv (2011), it suffices to prove the existence of β̂Mβ, θ̂Mθ inside the hypercube

={(δβT,δθT)T:δββ0Mβ=nγβlogn,δθθMθ=Knγθlogn}

with K a large constant, conditional on the event ε, satisfying

Z^MβT{YΦ(θ^)Z^β^}=nλ2nρ¯2(β^Mβ), (A.1)
ϕ^MθT{YΦ(θ^)}=nλ3nρ¯3(θ^1), (A.2)
Z^MβcT{YΦ(θ^)Z^β^}<nλ2nρ2(0+), (A.3)
ϕ^MθcT{YΦ(θ^)}<nλ3nρ3(0+), (A.4)
λmin(Z^MβTZ^Mβ)>nλ2nκ(ρ2,β^Mβ), (A.5)
λmin(ϕ^MθTϕ^Mθ)>nλ3nκ(ρ3,θ^Mθ). (A.6)

Step 1. We first show the existence of a solution to equations (A.1) and (A.2) inside ℵ for sufficiently large n. For any δ = (δ1, …, δsβ+sθ)T ∈ ℵ, since dnγβ logn, dnγθ logn, we have

minj=1sβ|δj|min|β0j|dnβ=dnβ,minj=1sθ|δj+sβ|min|θj|dnθ=dnθ

and sgn(δβ) = sgn(β0Mβ), sgn(δθ)=sgn(θMθ). The monotonicity condition of ρ2(t), ρ3(t) gives

nλ2nρ¯2(δ)nλ2nρ2(dnβ),nλ3nρ¯3(δ)nλ3nρ3(dnθ). (A.7)

We write the left hand side of (A.1) as

Z^MβT{YΦ(δθ)Z^Mβδβ}=ξ1Mβ+ξ2Mβ+(Z^MβZMβ)T{μΦ(δθ)}+Z^MβTZ^Mβ(β0Mβδβ)+Z^MβT(Z^MβZMβ)β0MβZMβT{Φ(δθ)Φ}.I1+I2+I3+I4+I5+I6, (A.8)

on the set ε3ε5ε13, we have

I1+I2+I3=O(nlogn). (A.9)

Define

η1=(Z^Z)T{μΦ(δθ)},η2=(Z^Z)T(Z^MβZMβ)β0Mβ.

Note that η1Mβ = I3 in (A.8), which we represent here using a second order Taylor expansion around α0Mα,

I3=XMβTW(δθ)ΔXMα(α0Mαα^Mα)+12rI3, (A.10)

where rI3 in (A.10) corresponds to second order remainder, whose jth component is given as

(α^Mαα0Mα)TXMαTW(δθ)(α)diag(xj)XMα(α^Mαα0Mα),

where Σ(α̃) is a diagonal matrix with the ith diagonal element π(x1αiTα) with α̃ lying in the line segment between α̂Mα and α0Mα. Since π″(·) is a bounded function, we can bound ‖rI3 by

maxj(α^Mαα0Mα)TXMαTdiag(|W(δθ)xj|)XMα(α^Mαα0Mα), (A.11)

whose order of magnitude is O(sαn1−2γα log2 n) by (5.16).

We decompose I4 in (A.8) as η2Mβ+ZMβT(Z^MβZMβ)β0Mβ. Using similar arguments, on the set ε9, it follows from (5.17) that

ZMβT(Z^MβZMβ)β0Mβmaxj(α^Mαα0Mα)TXMαTdiag(|xjXβ0|)XMα(α^Mαα0Mα)+ξ4Mβ=O(nlogn+sαn12γαlog2n). (A.12)

Using Taylor expansion, it is immediate to see that

η2Mβmaxj(α^Mαα0Mα)TXMαTdiag(|xjXβ0|)XMα(α^Mαα0Mα)=O(sαn12γαlog2n), (A.13)

by (5.17). Combining (A.12) and (A.13) gives

Z^MβT(Z^MβZMβ)β0Mβ=O(nlogn)+O(sαn12γαlog2n). (A.14)

So far, we have

I1+I2+I3+I4+I5+I6XMβTW(δθ)ΔXMα(α0Mαα^Mα)=O(nlogn)+O(sαn12γβlog2n)+O(sβn12γβlog2n), (A.15)

by (A.9), (A.10), (A.11) and (A.14). Now we approximate I4 by XMβTΔXMβ(δββ0Mβ) and bound the magnitude of error ‖ωMβ where ω = (TMβXTΔXMβ)(δββ0Mβ). We present it as

ωMβ=(Z^MβTZ^MβXMβTΔXMβ)(δββ0Mβ)=Z^MβT(Z^MβZMβ)(δββ0Mβ)+(Z^MβZMβ)TZMβ(δββ0Mβ)+(ZMβTZMβXMβTΔXMβ)(δββ0Mβ)ω1Mβ+ω2Mβ+ξ5Mβ(δββ0Mβ). (A.16)

It follows from first-order Taylor expansion that the jth element in ω1Mβ can be presented as

[(Aπ^)xj{Δ(αMα)XMα(α^Mαα0Mα)}]TXMβ(δββ0Mβ), (A.17)

where Δ(α̃Mα) is a diagonal matrix with the ith diagonal component π(xi, α̃Mα)(1 − π(xi, α̃Mα)), where α̃Mα lies between the line segment of α̂Mα and α0Mα. We decompose xj as the Hadamard product of two vectors, denoted by jj, where

x¯j=(|x1j|,,|xnj|),xj=(sgn(x1j)|x1j|,,sgn(xnj)|xnj|).

Let φ = (Aπ̂) ○ j ○ {Δ(α̃Mα)XMα(α̂Mαα0Mα)}, we have

[(Aπ^)xj{Δ(αMα)XMα(α^Mαα0Mα)}]TXMβ2δββ0Mβ2=φTdiag(x¯j)XMβXMβTdiag(x¯j)φδββ0Mβ2λmax(XMβTdiag(|xj|)XMβ)δββ0Mβ2φ2. (A.18)

Since ‖Aπ̂ ≤ 1, elements in Δ(α̃Mα) are bounded, we have

φ2diag(xj)XMα(α^Mαα0Mα)2λmax{XMαTdiag(|xj|)XMα}α^Mαα0Mα2. (A.19)

Combining (A.18) with (A.19) gives

ω1Mβmaxj=1pλmax{XMαTdiag(|xj|)XMα}α^Mαα0Mα22maxj=1pλmax{XMβTdiag(|xj|)XMβ}β^Mββ0Mβ22, (A.20)

which is O(sαsβn1γaγβlog2n) by (B.4) and (B.5).

By the same argument, we can verify that ‖ω2Mβ is of the same order. Note that on the set ε11,

ξ5Mβ(δβ0Mβ)ξ5Mβδβ0Mβ=O(sβn12γβlog2n),

these together with (A.20), yields

ωMβ=O(sαn12γαlog2n)+O(sβn12γβlog2n). (A.21)

Define vector-valued function

Ψ1(δβ,δθ)=Bnβ1[Z^MβT{yΦ(δθ)Z^Mβδβ}nλ2nρ¯2(δβ)]=Bnβ1{I1+I2+I3+I4+I5+I6nλ2nρ¯2(δβ)}=δββ0Mβ+Bnβ1{I1+I2+I3+ωMβ+I5+I6nλ2nρ¯2(δβ)}δββ0Mβ+uβ, (A.22)

then equation (A.1) is equivalent to Ψ1(δβ, δθ) = 0. It follows from (A.7), (A.15) and (A.21) that

uβsupδHθBnβ1XMβTW(δ)ΔXMα(α^Mαα0Mα)+Bnβ1{O(sαn12γαlog2n)+O(sβn12γβlog2n)+O(nlogn)+nλ2nρ(dnβ)}.

By similar arguments in the proof of Theorem 2 in Fan and Lv (2011), we have

Bnα(α^Mαα0Mα)=O(sαn12γαlog2n)+O(nlogn)+nλ1nρ1(dnα), (A.23)

on the set ε1ε2. Thus by (5.10), (B.1), (B.14) and (B.15), we have

uβO[bαβ{sαn2γαlog2n+logn/n+λ1nρ1(dnα)}]+O[bβ{sαn2γαlog2n+sβn2γβlog2n+logn/n+λ2nρ2(dnβ)}].

Therefore by (A.20), for sufficiently large n, if (δββ0Mβ)j = nγβ logn,

Ψ1j(δβ,δθ)>0, (A.24)

and if (δβ0Mβ)j = −nγβ logn,

Ψ1j(δβ,δθ)<0. (A.25)

Similarly we write the left-hand side of (A.2) as

(ϕ^MθϕMθ)T(eZβ0)+ξ3Mθ+ϕ^MθT(μΦ)ϕ^MθT(Φ^Φ). (A.26)

It is immediately to see that

ξ3Mθ=O(nlogn), (A.27)

on the set ε5. The L norm of the first term in (A.26) is bounded by

supδHθξ7Mβ(δ)=O(nlogn), (A.28)

on the set ε15.

Using second-order Taylor expansion, we approximate the last term in (A.26) by its first-order term ϕ^MθTϕ^Mθ(δθθMθ). It follows from (5.7) that the L norm of the remainder term is bounded from above by

maxl=1sθλmax{(|ϕl(δθ)|)TϕMθ(δθ)θMθ}δθθMθ22=O(sθn12γθlog2n), (A.29)

where δ̃θ lies between the line segment of θMθ and δθ.

Define Ψ2(δβ, δθ) = {ϕMθ (δθ)TϕMθ (δθ)}−1[ϕMθ (δθ)T{Y − Φ(δθ)} − 3nρ̄3(δθ)], equation (A.2) is equivalent to Ψ2(δβ, δθ) = 0. Similarly to Ψ1(δβ, δθ), we now show Ψ2(δβ, δθ) is mainly dominated by δθθMθ. Define uθ=Ψ2(δβ,δθ)δθ+θMθ, it follows from (5.1), (5.3), (B.13), (A.26), (A.27), (A.28) and (A.29) that

uθΨ2(δβ,δθ)δθ+θMθ{ϕMθ(δθ)TϕMθ(δθ)}1{ξ3Mθ+ξ7Mθ(δθ)+Φ(δθ)Φϕ(δθ)T(δθθMθ)+nλ3nρ3(dnθ)}+{ϕMθ(δθ)TϕMθ(δθ)}1ϕMθ(δθ)T(μΦ)=o(nγθlogn)+O(nγθlogn). (A.30)

Therefore, we can find a large constant K < ∞, for n large enough such that if (δθθMθ)j=Knγθlogn,

Ψ2j(δβ,δθ)>0, (A.31)

and if (δθθMθ)j=Knγθlogn,

Ψ2j(δβ,δθ)<0. (A.32)

Combining (A.24), (A.25) with (A.31) and (A.32), an application of Miranda's existence theorem shows equations (A.1), (A.2) have a solution (β̂Mβ, θ̂Mθ) in ℵ.

Step 2. Let (β̂T, θ̂T)T be a solution to equations (A.1) and (A.2) with β^Mβc=0 and θ^Mβc=0. We show that (β̂T, θ̂T)T satisfies inequalities (A.3) and (A.4). Decompose (A.3) as the sum of the following terms,

Z^MβcT(YΦ^Z^MβTβ^Mβ)=ξ1Mβc+ξ2Mβc+ZMβT(Z^MβZMβ)β0Mβ+ξ5Mβc(β^Mββ0Mβ)+ω1Mβc+ω2Mβc+η1Mβc+XMβcTΔXMβ(β^Mββ0Mβ)+η2MβcZMβ(Φ^Φ). (A.33)

On the set ε4ε6ε10ε12, it is immediately to see that

ξ1Mβc+ξ2Mβc+ξ5Mβc(β^Mββ0Mβ)+ZMβT(Φ^Φ)=O(n1dβlogn). (A.34)

By (B.4), (B.5) and (A.20), a first-order Taylor expansion gives

ω1Mβc+ω2Mβc=O(sαn12γαlog2n)+O(sβn12γβlog2n). (A.35)

Similarly it follows from (5.17) and (A.13) that

η2Mβ=O(sαn12γαlog2n). (A.36)

On the set ε10, by (5.17) and (A.12), we have

ZMβT(Z^MβZMβ)β=O(n1dβlogn)+O(sαn12γαlog2n). (A.37)

Approximating η1Mβc by XMβcTW(δθ)ΔXMα(α0Mαα^Mα), the L norm of remainder error term is bounded from above by

(α^Mαα0Mα)TXMαTdiag(|W(δθ)xj|)XMα(α^Mαα0Mα)=O(sαn1γθlog2n), (A.38)

by (5.16). Let uβ=Z^MβcT(YΦ^Z^1Tβ^Mβ)XMβcTΔXMβ(β^Mββ0Mβ)XMβcTW(θ^1)ΔXMα(α0Mαα^Mα), it follows from (A.33)–(A.38) that

uβ=O(n1dβlogn+sαn12γαlog2n+sβn12γβlog2n). (A.39)

Since β̂Mβ solves (A.1), we have

β^Mββ0Mβ=uβ, (A.40)

where uβ is defined as Ψ1(β̂Mβ, θ̂Mθ) + β0Mββ̂Mβ. Combining (A.40) with (A.23) and (A.39) gives

1nλ2nZ^MβcT(YΦ^Z^MβTβ^Mβ)1nλ2n[uβ+XMβcTΔXMβ(XMβTΔXMβ)1{uβXMβTW(θ^Mθ)ΔXMα(α0Mαα^Mα)}+XMβcTWβW(θ^1)XMα(XMαTΔXMα)1{nλ1nρ1(dnα)+O(nlogn)+O(sαn12γαlog2n)}]o(1)+Cρ2(0+),

by (5.11), (B.3), (B.16) and (B.19). Since C < 1, for sufficiently large n, (A.3) is satisfied.

Now we verify (A.4), decomposing ϕ^MθcT(YΦ^) as the sums of

(ϕ^MθcϕMθc)T(eZβ0)+ξ3Mθc+ϕ^MθcT(μΦ)+ϕ^MθcT(ΦΦ^), (A.41)

on the set ε8ε16, we have

ξ3Mθc+(ϕ^MθcϕMθc)T(eZβ0)=O(n1dθlogn). (A.42)

Similar to (A.29), a second-order Taylor expansion gives

ϕ^MθcT(Φ^Φ)ϕ^MθcTϕ^Mθ(θ^MθθMθ)=O(sθn12γθlog2n), (A.43)

by (5.7). Since (β̂Mβ, θ̂Mθ) is the solution to Ψ2(δβ, δθ) = 0, it follows from (A.30) that

ϕ^MθcTϕ^Mθ(θ^MθθMθ)ϕ^MθcTϕ^Mθ(ϕ^MθTϕ^Mθ)1(μΦ)=ϕ^MθcTϕ^Mθ(ϕ^MθTϕ^Mθ)1{O(nlogn+sθn12γθlog2n)+nλ3ρ3(dnθ)}. (A.44)

By (A.41)–(A.44) and conditions in (5.2), (5.4), (B.15) and (B.20), the left-hand side of (A.4) can be bounded by

1nλ3n{O(n1dθlogn)+O(sθn12γθlog2n)}+1nλ3nϕ^MθcTϕ^Mθ(ϕ^MθTϕ^Mθ)1{O(nlogn)+O(sθn12γθlog2n)+nλ3nρ3(dnθ)}+1nλ3nϕ^MθcT{IPϕMθ(θ^1)}(μΦ)=o(1)+Cρ3(0+),

for C < 1. Therefore (A.4) is satisfied.

Step 3. Now we show the second order conditions (A.5) and (A.6) hold. Because (A.6) is directly implied by (B.17), it suffices to show that λmin(Z^MβTZ^Mβ)λmin(XMβTΔXMβ) for sufficiently large n. Since (MβZMβ)T(MβZMβ) is positive semi-definite, we have

λmin(Z^MβTZ^Mβ)λmin(XMβTΔXMβ)+λmin{(Z^MβZMβ)TZMβ+ZMβT(Z^MβZMβ)+ξ5Mβ}. (A.45)

Since any symmetric matrix Ψ, the absolute value of minimum eigenvalue can be bounded by

|λmin(Ψ)|λmax(Ψ2)ΨΨ1=Ψ,

(A.5) follows if we can show ξ5Mβ+(Z^MβZMβ)TZMβ+ZMβT(Z^MβZMβ)=o(n). But this is immediate to see because

ξ5Mβ=O(n1/2+γβ/logn)=o(n),

on the set ε11. Similar to (A.20), (Z^MβZMβ)TZMβ+ZMβT(Z^MβZMβ) can be bounded from above by

2maxjsβλmax{XMβTdiag(|xj)XMβ}λmax{XMαTdiag(|xj|)XMα}α^Mβα0Mα22, (A.46)

which is O(sαsβn1γαlogn)=o(n) implied by the constrain max(l1, l2) < γα. This completes the proof.

Footnotes

*

The research of Chengchun Shi and Rui Song is supported in part by Grant NSF-DMS 1309465, 1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is supported in part by Grant NCI P01 CA142538.

Supplementary Material

Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”

(doi: 10.1214/16-EJS1178SUPP; .pdf).

References

  1. Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron J Stat. 2007;1:169–194. MR2312149. [Google Scholar]
  2. Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. MR2757118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581 (2003k:62160) [Google Scholar]
  4. Fan A, Lu W, Song R. Sequetial advantage selection for optimal treatment regime. Ann Appl Stat To appear. 2015 doi: 10.1214/15-AOAS849. MR3480486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. MR2849368 (2012k:62211) [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. Ann Statist. 2014;42:819–849. doi: 10.1214/13-aos1198. MR3210988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Gunter L, Zhu J, Murphy SA. Variable selection for qualitative interactions. Stat Methodol. 2011;8:42–55. doi: 10.1016/j.stamet.2009.05.003. MR2741508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Li KC, Duan N. Regression analysis under link violation. Ann Statist. 1989;17:1009–1052. MR1015136. [Google Scholar]
  9. Lu W, Zhang HH, Zeng D. Variable selection for optimal treatment decision. Stat Methods Med Res. 2013;22:493–504. doi: 10.1177/0962280211428383. MR3190671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. MR1983752 (2005b:62167) [Google Scholar]
  11. Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. MR2816351 (2012e:62227) [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics Lecture Notes in Statist. Vol. 179. Springer; New York: 2004. Optimal structural nested models for optimal sequential decisions; pp. 189–326. MR2129402 (2006g:62007) [Google Scholar]
  13. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
  14. Rubin DB. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]
  15. Shi C, Song R, Lu W. Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”. 2016 doi: 10.1214/16-EJS1178SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Song R, Kosorok M, Zeng D, Zhao Y, Laber E, Yuan M. On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat. 2015;4:59–68. doi: 10.1002/sta4.78. MR3405390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. MR2815776 (2012e:62246) [Google Scholar]
  18. Tsiatis AA. Semiparametric theory and missing data Springer Series in Statistics. Springer; New York: 2006. MR2233926 (2007g:62009) [Google Scholar]
  19. Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. Ann Statist. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. MR3127873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
  21. White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. MR0640163. [Google Scholar]
  22. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. MR2604701 (2011d:62211) [Google Scholar]
  23. Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. MR3040007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. MR3010898. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary appendix

RESOURCES