Robust learning for optimal treatment decision with NP-dimensionality

Chengchun Shi; Rui Song; Wenbin Lu

doi:10.1214/16-EJS1178

. Author manuscript; available in PMC: 2017 Aug 4.

Published in final edited form as: Electron J Stat. 2016 Oct 13;10:2894–2921. doi: 10.1214/16-EJS1178

Robust learning for optimal treatment decision with NP-dimensionality

Chengchun Shi ¹, Rui Song ¹, Wenbin Lu ¹

PMCID: PMC5544015 NIHMSID: NIHMS834730 PMID: 28781717

Abstract

In order to identify important variables that are involved in making optimal treatment decision, Lu, Zhang and Zeng (2013) proposed a penalized least squared regression framework for a fixed number of predictors, which is robust against the misspecification of the conditional mean model. Two problems arise: (i) in a world of explosively big data, effective methods are needed to handle ultra-high dimensional data set, for example, with the dimension of predictors is of the non-polynomial (NP) order of the sample size; (ii) both the propensity score and conditional mean models need to be estimated from data under NP dimensionality.

In this paper, we propose a robust procedure for estimating the optimal treatment regime under NP dimensionality. In both steps, penalized regressions are employed with the non-concave penalty function, where the conditional mean model of the response given predictors may be misspecified. The asymptotic properties, such as weak oracle properties, selection consistency and oracle distributions, of the proposed estimators are investigated. In addition, we study the limiting distribution of the estimated value function for the obtained optimal treatment regime. The empirical performance of the proposed estimation method is evaluated by simulations and an application to a depression dataset from the STAR*D study.

Keywords and phrases: Non-concave penalized likelihood, optimal treatment strategy, oracle property, variable selection

1. Introduction

Personalized medicine, which has gained much attentions over the past few years, is a medical paradigm that emphasizes systematic use of individual patient information to optimize that patient's health care. In this paradigm, the primary interest lies in identifying the optimal treatment strategy that assigns the best treatment to a patient based on his/her observed covariates. Formally speaking, a treatment regime is a function that maps the sample space of patient's covariates to the treatments.

There is a growing literature for estimating the optimal individualized treatment regimes. Existing literature can be casted into as model based methods and direct search methods. Popular model based methods include Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010) and A-learning (Robins, Hernan and Brumback, 2000; Murphy, 2003), where Q-learning models the conditional mean of the response given predictors and treatment while A-learning models the interaction between treatment and predictors, better known as the contrast function. The advantage of A-learning is robustness against the misspecification of the baseline mean function, provided that the propensity score model is correctly specified. Recently, Zhang et al. (2012) proposed inverse propensity score weighted (IPSW) and augmented-IPSW estimators to directly maximize the mean potential outcome under a given treatment regime, i.e. the value function. Moreover, Zhao et al. (2012) recast the estimation of the value function from a classification perspective and use machine learning tools, to directly search for the optimal treatment regimes.

The rapid advances and breakthrough in technology and communication systems make it possible to gather an extraordinary large number of prognostic factors for each individual. For example, in the Sequenced Treatment Alternative to Relieve Depression (STAR*D) study, over 305 covariates are collected from each patient. With such data gathered at hand, it is of significant importance to organize and integrate information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing personalized medicine. There have been extensive developments of variable selection methods for prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), MCP (Zhang, 2010) and many others in the context of penalized regression. Their associated inferential properties have been studied when the number of predictors is fixed, diverging with the sample size and of the non-polynomial order of the sample size.

In contrast to the large amount of work on developing variable selection methods for prediction, the variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size. Among those available, Gunter, Zhu and Murphy (2011) proposed variable ranking methods for the marginal qualitative interaction of predictors with treatment. Fan, Lu and Song (2015) developed a sequential advantage selection method that extends the marginal ranking methods by selecting important variables with qualitative interaction in a sequential fashion. However, no theoretical justifications are provided for these methods. Qian and Murphy (2011) proposed to estimate the conditional mean response using a L₁-penalized regression and studied the error bound of the value function for the estimated treatment regime. However, the associated variable selection properties, such as selection consistency, convergence rate and oracle distribution, are not studied. Lu, Zhang and Zeng (2013) introduced a new penalized least squared regression framework, which is robust against the misspecification of the conditional mean function. However, they only studied the case when the number of covariates is fixed and the propensity score model is known as in randomized clinical trials. Song et al. (2015) proposed penalized outcome weighted learning for the case with the fixed number of predictors.

In this paper, we study the penalized least squared regression framework considered in Lu, Zhang and Zeng (2013) when the number of predictors is of the non-polynomial (NP) order of the sample size. In addition, we consider a more general situation where the propensity score model may depend on predictors and needs to be estimated from data, as common in observational studies. A two-step estimation procedure is developed. In the first step, penalized regression models are fitted for the propensity score and the conditional mean of the response given predictors. In the second step, the optimal treatment regime is estimated using the penalized least squared regression with the estimated propensity score and conditional mean models obtained in the first step. There are several challenges in both numerical implementation and derivation of theoretical properties, such as weak oracle and oracle properties, for the proposed estimation procedure. First, since the posited model for the conditional mean of the response given predictors may be misspecified, the associated estimation and variable selection properties under model misspecification with NP dimensionality is not standard. Second, it is unknown how the asymptotic properties of the estimators for the optimal treatment regime obtained in the second step will depend on the estimated propensity score and conditional mean models obtained in the first step under NP dimensionality. To our knowledge, these two challenges have never been studied in the literature. Moreover, we estimate the value function of the estimated optimal regime and study the estimator's theoretical properties.

The remainder of the paper is organized as follows. The proposed method for estimating the optimal treatment regime is introduced in Section 2. Simulation results are presented in Section 3. An application to a dataset from the STAR*D study is illustrated in Section 4. Section 5 and 6 demonstrate the weak oracle and oracle properties of the resulting estimators, respectively. The estimator for the value function of the estimated optimal treatment regime is given in Section 7, followed by a Conclusion Section. All the technical proofs are given in the Appendix.

2. Method

Let Y denote the response, A ∈ 𝒜 denote the treatment received, where 𝒜 is the set of available treatment options, and X denote the baseline covariates including constant one. For demonstration purpose, we focus on a binary treatment regime, i.e., 𝒜 = {0, 1}, with 0 for the standard treatment and 1 for the new treatment. We consider the following semiparametric model:

Y = h_{0} (X) + A (β_{0}^{T} X) + e,

(2.1)

where h₀(X) is the unspecified baseline function, β₀ is the p-dimensional regression coefficients and e is an independent error with mean 0 and variance σ². Under the assumptions of stable unit treatment value (SUTVA) and no unmeasured confounders (Rubin, 1974), it can be shown that the optimal treatment regime d^opt(x) for patients with baseline covariates X = x takes the form

I (E (Y | X = x, A = 1) - E (Y | X = x, A = 0) > 0) = I (β_{0}^{T} x > 0),

where I(·) is the indicator function.

Our primary interest is in estimating the regression coefficients β₀ defining the optimal treatment regime. Let π(x) = P(A = 1|X = x) be the propensity score. We assume a logistic regression model for π(x):

π (x, α_{0}) = exp (x^{T} α_{0}) / [1 + exp (x^{T} α_{0})],

(2.2)

with p-dimensional parameter α₀. Here, we allow the propensity score to depend on covariates, which is common in observational studies and the parameters α₀ can be estimated from the data. For randomized clinical trials, π(x, α₀) is a constant. We assume the majority of elements in β₀ and α₀ are zero and refer to the support supp(β₀), supp(α₀) as the true underlying sparse model of the indices.

Consider a study with n subjects. Assume X = (x₁, …, x_n)^T is deterministic. The observed data consist of {(Y_i, A_i, x_i) : i = 1, ⋯, n}. Define μ(x) = h₀(x) + π(x, α₀)x^Tβ₀, the conditional mean of the response given covariates X = x. We propose the following two-step estimation procedure to estimate the optimal treatment regime. In the first step, we posit a model Φ(x, θ) for the conditional mean function μ(x), and consider the penalized estimation for the propensity score and conditional mean models as follows.

Define

\hat{α} = arg min_{α} \frac{1}{n} \sum_{i = 1}^{n} [log {1 + exp (x_{i}^{T} α)} - A_{i} x_{i}^{T} α] + \sum_{j = 1}^{p} λ_{1 n} ρ_{1} (| α^{j} |, λ_{1 n}),

(2.3)

and

\hat{θ} = arg min_{θ} \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - Φ (x_{i}, θ)}^{2} + \sum_{j = 1}^{q} λ_{2 n} ρ_{2} (| θ^{j} |, λ_{2 n}),

(2.4)

where α^j and θ^j refer to the jth element in α and θ, q is the dimension of θ, and ρ₁ and ρ₂ are folded concave penalty functions with the tuning parameters λ₁_n and λ₂_n, respectively. We allow p, q to be of NP order of n and assume logp = O(n¹⁻²^d_β) and logq = O(n¹⁻²^d_θ) for some d_β and $d_{θ} \in (0, \frac{1}{2})$ , respectively. The posited model Φ(x, θ) may be misspecified.

Define Φ̂_i = Φ(x_i, θ̂) and π̂_i = π(x_i, α̂). In the second step, we consider the following penalized least square estimation:

\hat{β} = arg min_{β} \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - {\hat{Φ}}_{i} - (A_{i} - {\hat{π}}_{i}) β^{T} x_{i}}^{2} + \sum_{j = 1}^{p} λ_{3 n} ρ_{3} (| β^{j} |, λ_{3 n}),

(2.5)

where ρ₃ is a folded-concave penalty function with the tuning parameter λ₃_n. Here the folded-concave penalty functions ρ₁, ρ₂ and ρ₃ are assumed to satisfy the following condition:

Condition 2.1. ρ(t, λ) is increasing and concave in t ∈ [0, ∞), and has a continuous derivative ρ′(t, λ) with ρ′(0+, λ) > 0. In addition, ρ′(t, λ) is increasing in λ ∈ [0, ∞) and ρ′(0+, λ) is independent of λ.

Popular penalties, such as LASSO, SCAD and MCP, satisfy Condition (2.1). In our implementation, we use SCAD penalty. Here, we adopt a two-step estimation procedure due to its computational simplicity. Alternatively, we can jointly estimate the parameters θ in the conditional mean model and β in the contrast function in a single penalized regression. However, this joint approach will require more computational effort since the tuning parameters for θ and β need to be selected simultaneously. In contrast, our two-step method only requires a single tuning parameter at each step and thus can be easily implemented by existing softwares, for example, the R package ncvreg.

3. Numerical studies

In this section, we evaluate the numerical performance of the proposed estimators in various settings. We generated the propensity score from the logistic regression model (2.2), with only one important covariate with the coefficient of 1.5. We chose three forms for the baseline function h₀(x), including a simple linear form, a quadratic form and a complex non-linear form,

Model I: $Y = 1 + θ_{0}^{T} X + A (β_{0}^{T} \tilde{X}) + ε$ ,
Model II: $Y = 1 + 0.5 {(1 + θ_{0}^{T} X)}^{2} + A (β_{0}^{T} \tilde{X}) + ε$ ,
Model III: $Y = 1 + 1.5 sin (π θ_{0}^{T} X) + X_{1}^{2} + A (β_{0}^{T} \tilde{X}) + ε$ ,

where X is a p-dimensional vector of covariates and X̃ = (1, X^T)^T. We set p = 1000. Covariates were generated independently from two distributions: standard normal or s shifted exponential distribution with mean 0 and variance 1.

For each model, the first two covariates were chosen as important variables both in the baseline mean function and the contrast function with θ₀ = (−2, −1, 0, …, 0)^T and β₀ = (0, −1.5, 1.5, 0, …, 0)^T. We considered two different sample sizes, n = 300 and n = 500. For each scenario, we conducted 1000 replications. In our method, we fitted a linear model for Φ(X, θ) and used the SCAD penalty for variable selection. The tuning parameter was chosen using 10-fold cross-validation.

To evaluate the performance of the proposed estimator, we also compared our method with the penalized Q-learning using the SCAD penalty. Specifically, we fitted a linear model with baseline covariate effects and treatment-covariates interaction. Note that it is correctly specified under model I but misspecified under models II and III.

Let β̂ and β̃ denote our estimator and the penalized Q-learning estimator, respectively. We report the L₂ loss of β̂ and β̃, the number of missed important variables (denoted as FN), the number of selected noisy variables (denoted as FP) and the average percentage of making correct decisions (denoted as PCD), which is defined as $1 - \sum_{i = 1}^{n} | d (x_{i}) - I (β_{0}^{T} x_{i} > 0) | / n$ for treatment rules d̂(x) = I(x^Tβ̂ > 0) and d̃(x) = I(x^Tβ̃ > 0). In addition, we estimated E{Y^⋆(d̂)}, E{Y^⋆(d̃)} and E{Y^⋆(d^opt)}, the value functions of the estimated optimal treatment regimes by our method and the penalized Q-learning method, and of the true optimal regime, respectively, using Monte Carlo simulations. For a given treatment rule d(x), we compute E{Y^⋆(d)} by averaging the responses for 20000 subjects generated from the true model with A being determined by d(x). We report the averages of mean responses over 1000 replications as well as their standard deviations.

Table 1 summarizes the results. The penalized Q-learning method performs pretty well under Model I where the fitted linear model is correctly specified and is more efficient than the proposed method as expected. For example, when covariates are i.i.d normal and n = 300, the PCD is around 99.3% and the estimated value function is very close to the true optimal, E{Y^⋆(d^opt)}. In contrast, under this setting, the PCD of our proposed method is 97.5%, and the estimated value function is slightly lower.

Table 1. Simulation results for L₂ loss, FN, FP, PCD and values.

	Measures	n	Model I	Model II	Model III
Robust learning with covariates i.i.d normal	L₂ loss	300	0.276	1.743	1.171
		500	0.189	1.453	0.700
	FP	300	5.104	9.148	12.481
		500	4.143	9.742	12.616
	FN	300	0.000	0.893	0.125
		500	0.000	0.471	0.002
	PCD	300	0.975	0.734	0.834
		500	0.983	0.789	0.904
	EY*(d̂)	300	1.842(0.021)	4.544(0.157)	2.716(0.089)
		500	1.845(0.019)	4.643(0.116)	2.797(0.048)
	EY(d^opt*)		1.847	4.846	2.847

Penalized Q-learning with covariates i.i.d normal	L₂ loss	300	0.080	4.861	1.729
		500	0.061	4.928	1.833
	FP	300	0.001	8.191	7.745
		500	0.000	4.438	7.972
	FN	300	0.000	0.050	0.757
		500	0.000	0.006	0.553
	PCD	300	0.993	0.550	0.714
		500	0.994	0.538	0.690
	EY*(d̃)	300	1.846(0.021)	4.117(0.165)	2.508(0.192)
		500	1.846(0.020)	4.091(0.093)	2.457(0.204)
	EY(d^opt*)		1.847	4.846	2.847

Robust learning with covariates i.i.d exponential	L₂ loss	300	0.290	1.768	1.186
		500	0.199	1.495	0.730
	FP	300	6.596	9.700	13.240
		500	4.972	10.512	13.932
	FN	300	0.000	0.793	0.142
		500	0.000	0.466	0.003
	PCD	300	0.958	0.724	0.809
		500	0.971	0.761	0.871
	EY*(d̂)	300	1.744(0.018)	4.500(0.179)	2.670(0.095)
		500	1.747(0.018)	4.562(0.161)	2.736(0.041)
	EY* (d^opt)		1.751	4.751	2.783

Penalized Q-learning with covariates i.i.d exponential	L₂ loss	300	0.264	2.580	2.225
		500	0.121	3.236	2.408
	FP	300	0.003	12.257	13.234
		500	0.000	21.383	15.479
	FN	300	0.045	0.824	0.288
		500	0.005	0.377	0.072
	PCD	300	0.954	0.610	0.609
		500	0.978	0.595	0.584
	EY*(d̃)	300	1.744(0.018)	4.500(0.179)	2.670(0.095)
		500	1.743(0.037)	4.197(0.289)	2.201(0.224)
	EY(d^opt*)		1.751	4.751	2.783

Open in a new tab

However, for Models II and III, the penalized Q-learning method could lead to substantial bias and works much worse than the proposed method. Taking the second model as an example, when covariates are normal and n = 300, ‖β̃ − β₀‖₂ = 4.86, approximately third times as large as ‖β̂ − β₀‖₂. The PCD of the estimated treatment regime obtained by the penalized Q-learning is 55.0%, only a little better than a random guess. In contrast, for this scenario, the PCD of our proposed method is 73.4%. Moreover, when sample size increases, the performance of the penalized Q-learning method is even worse. This is due to the misspecification of the baseline mean function. For our method, there's a big increase in the PCD as the sample size gets larger. The L₂ loss and average number of missed important variables are also greatly reduced. This demonstrates the robustness of the proposed method to the misspecification of the baseline mean function.

4. Real data example

We applied our method to the data set from the STAR*D study for 4041 patients with nonpsychotic major depressive disorder (MDD). The aim of the study was to determine the effectiveness of different treatments for those people who have not responded to initial medication treatment. At Level 1, all patients received citalopram (CIT), an selective serotonin reuptake inhibit (SSRI) medication. After 8-12 weeks, three more levels of treatments were offered to participants whose previous treatment didn't give an acceptable response. Available treatments at Level 2 included sertraline (SER), venlafaxine (VEN), bupropion (BUP) and cognitive therapy (CT) and augmenting CIT which combines CIT with one more treatment. At Level 2A, switch options to VEN or BUP treatment were provided for patients receiving CT but without sufficient improvement. Four treatments were available at Level 3 for participants without anticipated response, including medication switch to mirtazapine (MIRT), nortriptyline (NTP), and medication augmentation with either lithium (Li) and thyroid hormone (THY). Finally, treatment with tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN) were provided at Level 4 for those without sufficient improvement at Level 3.

Here, we only focused on a subset of data for those patients receiving treatment BUP (coded as 1) or SER (0) at Level 2. The outcome of interest was the 16-item Quick Inventory of Depressive Symptomatology-Clinician-Ratings (QIDS-C16), which indicated the severity of patient's depressive symptom. The maximum vale of QIDS-C16 was 24 and its distribution was highly skewed. Hence, we considered the transformation Y_i = log(25 − QIDS-C16) as our response. Larger value of Y_i indicates better response. All baseline variables at Level 1 and intermediate outcomes at Level 2 were included as covariates in our study, yielding 305 covariates in total for each patient. There are 383 patients receiving treatment BUP or SER at Level 2, however, only 319 patients have complete records of all 305 covariates and the response. Among them, 153 were treated with BUP and 166 with SER. Our proposed method selected 14 variables that are important for treatment decision. We reestimate the coefficients of these variable by solving A-learning estimating equations (Robins, 2004) and obtained the resulting estimated optimal treatment regime.

To examine the performance of the estimated optimal treatment regime, we compared it with the fixed treatment regimes by assigning all patients to either BUP or SER, in terms of the estimated value functions obtained by the IPSW method (Zhang et al., 2012). The results for the estimated value functions were given in Table 2. In addition, we reported the 95% confidence intervals for the difference between the estimated values of the obtained optimal regime and the fixed regime based on 500 bootstrap samples. Our estimated optimal treatment regime gave larger estimated values than those of the fixed regimes, BUP and SER. The difference is significant when comparing to the BUP treatment at 5% level, but is less significant when comparing to the SER treatment. One reason is that our estimated optimal regime assigns the majority of patients (about two-thirds) to the SER treatment. Please refer to Table 3 for the numbers of patients receiving BUP or SER according to the estimated optimal regime.

Table 2. Estimated value functions and confidence intervals for the difference of the estimated values.

Treatment regime	Estimated value function	Diff	95% CI on Diff
Estimated optimal regime	3.10
BUP	2.55	0.55	[0.07, 1.13]
SER	2.80	0.30	[−0.08, 0.64]

Open in a new tab

Table 3. Number of patients receiving BUP or SER, according to the estimated optimal treatment regime.

	receives BUP	receives SER	total
assigns BUP	66	50	116
assigns SER	93	110	203
total	153	160	319

Open in a new tab

In addition, as suggested by a referee, we examined the effects of missing data. Specifically, we deleted one patient whose response was missing, and imputed all the missing values in covariates using the R package missForrest available in CRAN. This package uses a random forest trained based on the observe entries in the design matrix to predict those missing values. The optimal treatment regime obtained based on the imputed data was similar to the one based on the complete-case analysis as shown above. It selected 14 variables among which 11 variables were also included in the estimated optimal treatment regime without imputation. In addition, the bootstrap results suggested that the estimated value of the estimated optimal treatment regime is significantly larger than those of the fixed treatment regimes, under 0.05 significance level. Since results are similar, we omitted them here.

5. Non-asymptotic weak oracle properties

In this section we show that the proposed estimator enjoys the weak oracle property, that is, α̂, β̂ and θ̂ defined in (2.3)-(2.5) are sign consistent with probability tending to 1, and are consistent with respect to the L_∞ norm. Weak oracle properties of θ̂ are established in the sense that it converges to some least false parameter θ^⋆ when the main effect model is misspecified.

Theorem 5.1 provides the main results. Some regularity conditions are discussed in subsections 5.1 and 5.2. A major technical challenge in deriving weak oracle properties of β̂ is to analyze the deviation in (5.18), for which we develop a general empirical process result in the supplementary article (Shi et al., 2016). This result is important in its own right and can be used in analyzing many other high-dimensional semiparametric models where the index parameter of an empirical process is a plug-in estimator. The following notation is introduced to simplify our presentation.

Let 1 denote a vector of ones, E denote the identity matrix, O denote the zero matrix consisting of all zeros. For any matrix Ψ, let P(Ψ) denote the projection matrix Ψ(Ψ^TΨ)⁻¹Ψ^T. Ψ_M the submatrix of Ψ formed by columns in the subset M. For any vector a, b, let “○” denote the Hadamard product: a ○ b = (a¹b¹, …, aⁿbⁿ)^T, |a| = (|a¹|, …, |aⁿ|)^T, diag(a) as the diagonal matrix with elements of vector a and a_M the subvector of a formed by elements in M. The jth element in a is denoted as a^j. Let ‖ · ‖_p be the L_p norm of vectors or matrices. Let ‖Y‖_ψm be the Orlicz norm of a random variable Y,

inf_{u} {u > 0 : E exp {(\frac{| Y |}{u})}^{m} \leq 2},

for any m ≥ 1.

Let M_α = supp(α₀), M_β = supp(β₀), M_θ^⋆ = supp(θ^⋆), and $M_{α}^{c}$ , $M_{β}^{c}$ , $M_{θ^{⋆}}^{c}$ be their complements. Assume each x^j, is standardized such that ‖x^j‖₂ = √n. Let Φ(θ) = [Φ(x₁, θ), …, Φ(x_n, θ)]^T, ϕ(θ) = [ϕ¹(θ), …, ϕ^q(θ)] denote its Jacobian matrix. The derivatives are taken componentwise, i.e.,

ϕ^{l} (θ) = (ϕ^{l} (x_{1}, θ), \dots, ϕ^{l} (x_{n}, θ)),

for all l = 1, …, q. We denote Φ(θ^⋆) and ϕ(θ^⋆) as Φ and ϕ when there's no confusion. We use a short-hand Φ̂, ϕ̂ for Φ(θ̂), ϕ(θ̂).

5.1. The misspecified function

We first define the least false parameter under the misspecification due to the posited mean function Φ(x, θ). For regression models with fixed number of predictors, the definition of the least false parameter under model misspecification has been widely studied in the literature (e.g, White, 1982; Li and Duan, 1989). However, for regression models with NP dimensionality, its definition is more tricky. Here, we define our least false parameter as follows.

For each θ ∈ ℝ^q, let d_nθ = 1/2 min_j{|θ^j| : θ^j ≠ 0}, M_θ be the support of θ, μ = (μ(x₁), …, μ(x_n))^T and

H_{θ} = {δ \in R^{d} : δ_{M_{θ}^{c}} = 0, {‖ δ_{M_{θ}} - θ_{M_{θ}} ‖}_{\infty} \leq d_{n θ}} .

Consider the set

Θ = {θ : sup_{δ \in H_{θ}} {‖ ϕ_{M_{θ}^{c}} {(δ)}^{T} [E - P {ϕ_{M_{θ}} (δ)}] {μ - Φ (θ)} ‖}_{\infty} \leq C_{0} n^{1 - d_{θ}} \sqrt{log n}, | M_{θ} | \leq s_{0}},

for some constant C₀, and s₀ ≪ n. We assume the set Θ to be nonempty and define the least false parameter as

θ^{⋆} = arg min_{θ \in Θ} sup_{δ \in H_{θ}} {‖ {ϕ_{M_{θ}} {(δ)}^{T} ϕ_{M_{θ}} (δ)}^{- 1} ϕ_{M_{θ}}^{T} (δ) {μ - Φ (θ)} ‖}_{\infty} .

In addition, we assume

sup_{δ \in H_{θ^{⋆}}} {‖ {ϕ_{M_{θ^{⋆}}} {(δ)}^{T} ϕ_{M_{θ^{⋆}}} (δ)}^{- 1} ϕ_{M_{θ^{⋆}}}^{T} (δ) (μ - Φ) ‖}_{\infty} = O (n^{- γ_{0}} log n),

(5.1)

for some γ₀ ≥ 0. By its definition, θ^⋆ satisfies

sup_{δ \in H_{θ^{⋆}}} {‖ ϕ_{M_{θ^{⋆}}^{c}} {(δ)}^{T} [E - P {ϕ_{M_{θ^{⋆}}} (δ)}] (μ - Φ) ‖}_{\infty} = O (n^{1 - d_{θ}} \sqrt{log n}),

(5.2)

and |M_θ^⋆| ≤ s₀.

Remark 5.1. Conditions (5.1) and (5.2) are key assumptions determining the degree of model misspecification. Condition (5.1) requires that the posited working model Φ can provide a good approximation for μ. In that case, the residual μ – Φ will be orthogonal to the jacobian matrix ϕ_{M_θ^⋆} and the left-hand side of (5.1) will be small. In general, our assumptions are weaker than the weak sparsity assumption imposed for Lasso (Bunea, Tsybakov and Wegkamp, 2007), which assumes the L₂ approximation error ‖μ – Φ‖₂ converges to 0 at some certain rate.

Condition 5.1. We assume the following conditions:

sup_{δ \in H_{θ^{⋆}}} {‖ {ϕ_{M_{θ^{⋆}}} {(δ)}^{T} ϕ_{M_{θ^{⋆}}} (δ)}^{- 1} ‖}_{\infty} = O (\frac{b_{θ^{⋆}}}{n}),

(5.3)

sup_{δ \in H_{θ^{⋆}}} {‖ ϕ_{M_{θ^{⋆}}^{c}} {(δ)}^{T} ϕ_{M_{θ^{⋆}}} (δ) {ϕ_{M_{θ^{⋆}}} {(δ)}^{T} ϕ_{M_{θ^{⋆}}} (δ)}^{- 1} ‖}_{\infty} \leq min {C \frac{ρ_{3}^{'} (0 +)}{ρ_{3}^{'} (d_{n θ})}, O (n^{a_{3}})},

(5.4)

{max}_{l = 1}^{q} {‖ ϕ^{l} \circ (1 + | X β_{0} |) ‖}_{2} = O (\sqrt{n}),

(5.5)

{max}_{l = 1}^{q} \sum_{k \in M_{θ^{⋆}}} sup_{δ \in H_{θ^{⋆}}} {‖ \frac{\partial ϕ^{l} (δ)}{\partial θ^{k}} \circ (1 + | X β_{0} |) ‖}_{2} = O (\frac{n^{\frac{1}{2} + γ_{θ^{⋆}}}}{\sqrt{s_{θ^{⋆}}} log n}),

(5.6)

sup_{δ_{1} \in H_{θ^{⋆}}} sup_{δ_{2} \in H_{θ^{⋆}}} {max}_{l = 1}^{q} λ_{max} (\frac{\partial {(| ϕ^{l} (δ_{1}) |)}^{T} ϕ_{M_{θ^{⋆}}} (δ_{2})}{\partial θ_{M_{θ^{⋆}}}}) = O (n),

(5.7)

for some constants 0 ≤ a₃ ≤ 1/2, 0 ≤ γ_θ^⋆ ≤ γ₀, s_θ^⋆ = |M_θ^⋆|. If the response is unbounded, we require

{max}_{l = 1}^{q} ({‖ ϕ^{l} ‖}_{\infty}) = o (n^{d_{θ}} / \sqrt{log n}),

(5.8)

and the right-hand side of (5.6) shall be modified to $O (n^{\frac{1}{2} + γ_{θ^{⋆}}} / \sqrt{s_{θ^{⋆}}} {log}^{2} n)$ .

Remark 5.2. Conditions (5.6) and (5.7) put constraints on the derivatives of ϕ, requiring the misspecified function to be smooth. The right-hand side order in (5.6) is not too restrictive when n^{γ_θ^⋆} ≫ s_θ^⋆ logn.

Two common examples of the main-effect function Φ are provided below to examine the validity of Condition 5.1.

Example 1. Set Φ = 0. Then, no model is needed for Φ. It is easy to check that Condition 5.1 is satisfied.

Example 2. When a linear model is specified, i.e., Φ(x, θ) = x^Tθ, conditions (5.6) and (5.7) are automatically satisfied since the second-order derivative of Φ vanishes. In this example, θ^⋆ takes the form

θ_{M_{θ^{⋆}}}^{⋆} = {(X_{M_{θ^{⋆}}}^{T} X_{M_{θ^{⋆}}})}^{- 1} X_{M_{θ^{⋆}}}^{T} μ,

and $θ_{M_{θ^{⋆}}^{c}}^{⋆} = 0$ . Note that $θ_{M_{θ^{⋆}}}^{⋆}$ is the regression coefficients between X_{M_θ^⋆} and μ. Condition (5.1) holds automatically since

{(X_{M_{θ^{⋆}}}^{T} X_{M_{θ^{⋆}}})}^{- 1} X_{M_{θ^{⋆}}} (μ - X θ^{⋆}) = 0 >

Condition (5.2) becomes

{‖ X_{M_{θ^{⋆}}^{c}}^{T} {I - P (X_{M_{θ^{⋆}}})} μ ‖}_{\infty} = O (n^{1 - d_{θ}} \sqrt{log n}) .

(5.9)

Each element in the left-hand side vector in (5.9) can be viewed as the inner product of the residuals obtained by fitting X_{M_θ^⋆} on each noise variable in $X_{M_{θ^{⋆}}^{c}}$ and those fitted by regressing X_{M_θ^⋆} on μ. When μ depends only on X_{M_θ^⋆}, (5.9) holds for Gaussian linear model.

5.2. The covariates

Condition 5.2. Assume that

sup_{δ \in H_{θ^{⋆}}} {‖ B_{n β}^{- 1} X_{M_{β}}^{T} W (δ) Δ X_{M_{α}} B_{n α}^{- 1} ‖}_{\infty} = O (\frac{b_{α β}}{n}),

(5.10)

sup_{δ \in H_{θ^{⋆}}} {‖ X_{M_{β}^{c}}^{T} W_{β} W (δ) X_{M_{α}} B_{n α}^{- 1} ‖}_{\infty} = min {o (\frac{λ_{2 n} ρ_{2}^{'} (0 +)}{λ_{1 n} ρ_{1}^{'} (d_{n β})}), O (n^{a_{2}})},

(5.11)

{max}_{j = 1}^{p} {‖ W (θ^{⋆}) x^{j} ‖}_{2} = O (\sqrt{n}),

(5.12)

{max}_{j = 1}^{p} \sum_{k \in M_{α}} {‖ x^{k} \circ x^{j} \circ (X β_{0}) ‖}_{2} = O (\frac{n^{1 / 2 + γ_{α}}}{log n}),

(5.13)

{max}_{j = 1}^{p} \sum_{k \in M_{β}} {‖ x^{j} \circ x^{k} ‖}_{2} = O (\frac{n^{1 / 2 + γ_{β}}}{log n}),

(5.14)

{max}_{j = 1}^{p} \sum_{l \in M_{θ^{⋆}}} sup_{δ \in H_{θ^{⋆}}} {‖ x^{j} \circ ϕ^{l} (δ) ‖}_{2} = O (\frac{n^{1 / 2 + γ_{θ^{⋆}}}}{\sqrt{s_{θ^{⋆}}} {log}^{3} n}),

(5.15)

sup_{δ \in H_{θ^{⋆}}} {max}_{j = 1}^{p} λ_{max} [X_{M_{α}}^{T} diag (| W (δ) x^{j} |) X_{M_{α}}] = O (n),

(5.16)

{max}_{j = 1}^{p} λ_{max} [X_{M_{α}}^{T} diag | x^{j} \circ (X β_{0}) | X_{M_{α}}] = O (n),

(5.17)

for some constants 0 ≤ γ_α, γ_β, a₂ ≤ 1/2, where

\begin{matrix} W (δ) = diag [μ - Φ (δ)], & B_{n α} = X_{M_{α}}^{T} Δ X_{M_{α}}, & B_{n β} = X_{M_{β}}^{T} Δ X_{M_{β}}, \\ W_{β} = Δ - Δ^{\frac{1}{2}} P (Δ^{\frac{1}{2}} X_{M_{β}}) Δ^{\frac{1}{2}}, & Δ = diag (π (x_{1}), \dots, π (x_{n})) . \end{matrix}

The sequence b_αβ in (5.10) shall satisfy

b_{α β} = min {o (n^{\frac{1}{2} - γ_{β}} \sqrt{log n}), o (n^{2 γ_{α} - γ_{β}} / s_{α} log n)} .

Remark 5.3. Conditions (5.10) and (5.11) control the impact of the deviation of the estimated propensity score from its true value on β̂, thus are not needed when the propensity scores are known. By the definition of W(δ), magnitudes of the left-hand side in these two conditions depend on how accurate Φ models μ. The sequence b_αβ in (5.10) can converge to 0 when X_{M_β} and X_{M_α} are weakly correlated. Each element in the left-hand side of (5.11) is the multiple regression coefficient of the corresponding variable in $X_{M_{β}^{c}}$ on W(δ)X_{M_α}, using weighted least squares with weights π ○ (1 − π), after adjusted by X_{M_β}, which characterize their weak dependence given X_{M_β}. These two conditions are generally weaker than those imposed by Fan and Lv (2011) (Condition 2), and are therefore more likely to hold.

Remark 5.4. The right-hand side in (5.15) can be relaxed to O(n^{1/2+γ_θ^⋆}/logn) when using the linear model. The additional term $\sqrt{s_{θ^{⋆}}}$ is due to the penalty on the complexity of the main effect model. This condition typically controls the deviation

{‖ Z^{T} {Φ - Φ (\hat{θ})} ‖}_{\infty} = O_{p} (\sqrt{log p log n}),

(5.18)

where Z = diag(A−π)X. A common approach to bound the deviation is to utilize the classical Bernstein's inequality. However this approach does not work here, because the indexing parameter in the process Φ(·) in (5.18) is an estimator. To handle this challenge, we bound the left-hand side in (5.18) by

sup_{δ_{1}, δ_{2} \in H_{θ^{⋆}}} {‖ Z^{T} {Φ (δ_{1}) - Φ (δ_{2})} ‖}_{\infty} .

A general theory that covers the above result is provided in Proposition C.1 in the supplementary article.

Remark 5.5. Conditions (5.16) and (5.17) aim to control the L_∞ norm of the quadratic term of the Taylor series as a function of α̂, expanded at α₀. Similar to (5.10) and (5.11), the two conditions are not needed when α₀ is known to us.

5.3. Weak oracle properties

Theorem 5.1 (Weak oracle property). Assume that conditions B.1 and B.3 in the supplementary Appendix and conditions 5.1, 5.2 hold, and max_i ‖e_i‖_ψ₁ < ∞, where e_i is the residual for the ith patient in (2.1). Then there exist local minimizers α̂, θ̂ and β̂ of the loss functions (2.3), (2.4), and (2.5) respectively, such that with probability at least 1 − c̄/(n + p + q):

${\hat{α}}_{M_{α}^{c}} = 0$ , ${\hat{β}}_{M_{β}^{c}} = 0$ , ${\hat{θ}}_{M_{θ^{⋆}}^{c}} = 0$ ,
‖α̂_{M_α} − α₀_{M_α}‖_∞ = O(n⁻^γ_αlogn), ‖β̂_{M_β} − β₀_{M_β}‖_∞ = O(n⁻^γ_βlogn), ${‖ {\hat{θ}}_{M_{θ^{⋆}}} - θ_{M_{θ^{⋆}}}^{⋆} ‖}_{\infty} = O (n^{- γ_{θ^{⋆}}} log n)$ ,

for c̄ is some positive constant.

Remark 5.6. In Theorem 5.1, part (a) corresponds to the sparse recovery while (b) gives the estimators' convergence rates. Weak oracle property of α̂ directly follows from Theorem 2 in Fan and Lv (2011). However, to prove this property of β̂ requires further efforts, to account for the variability due to plugging in θ̂ and α̂. L_∞ convergence rate of α̂_{M_α} as well as the nonsparsity size s_α, play an important role in determining how fast β̂_{M_β} converges.

Remark 5.7. The convergence rate of θ̂ will not affect that of β̂. This is because we require the posed propensity score model to be correct, the estimation of β is robust with respect to the model misspecification of the main effect parameters θ. Simulation results also validate our theoretical findings.

6. Oracle properties

In this section we study the oracle property of the estimator β̂. We assume that max(s_α, s_β) ≪ √n and n^{γ_θ^⋆} ≫ s_θ^⋆logn. The convergence rates of the estimators are established in Section 6.1 and their asymptotic distributions are provided in Section 6.2.

6.1. Rates of convergence

Condition 6.1. In addition to (5.16) and (5.17) in Condition 5.2, assume that the right-hand side of (5.15) is strengthened to $O (n^{\frac{1}{2} + γ_{θ^{⋆}}} / \sqrt{s_{θ^{⋆}} {log}^{3} n})$ , and the following conditions hold,

sup_{δ \in H_{θ^{⋆}}} {‖ B_{n β}^{- 1 / 2} X_{M_{β}}^{T} W (δ) Δ X_{M_{α}} B_{n α}^{- 1 / 2} ‖}_{\infty} = O (1),

(6.1)

sup_{δ \in H_{θ^{⋆}}} {‖ X_{M_{β}^{c}}^{T} W_{β} W (δ) X_{M_{α}} ‖}_{2, \infty} = O (n),

(6.2)

{max}_{j = 1}^{p} max_{k \in M_{β}} {‖ x^{j} \circ x^{k} ‖}_{2} = O (\sqrt{n}),

(6.3)

{max}_{j = 1}^{p} max_{k \in M_{α}} {‖ x^{j} \circ x^{k} \circ (X β_{0}) ‖}_{2} = O (\sqrt{n}),

(6.4)

tr [X_{M_{β}}^{T} W (θ^{⋆}) Δ W (θ^{⋆}) X_{M_{β}}] = O (s_{β} n) .

(6.5)

Remark 6.1. Similar to the interpretation of (5.10) and (5.11), (6.1) corresponds to a notion of weak dependence between variables in X_{M_α} and X_{M_β} while (6.2) require $X_{M_{β}^{c}}$ and X_{M_α} are weakly correlated after adjusted by X_{M_β}. Besides, it can be verified that (6.3)-(6.5) hold with large probability when the baseline covariates possesses subgaussian tail.

Theorem 6.1. Assume that conditions 2.1, 5.1 and 6.1 and conditions B.2 and B.4 in the supplementary Appendix hold, and max_i‖e_i‖_ψ₁ < ∞. Constraints on b_θ^⋆, d_θ, d_nθ and λ₃_n are same as in Theorem 5.1. Further assume $max (l_{1}, l_{2}) < \frac{1}{2}$ with s_α = O(n^l₁), s_β = O(n^l₂), and n^{γ_θ^⋆} ≫ s_θ^⋆logn. Then there exists a strict local minimizer β̂ of the loss function (2.5), α̂ of (2.3), such that ${\hat{α}}_{M_{α}^{c}} = 0$ , ${\hat{β}}_{M_{β}^{c}} = 0$ with probability tending to 1 as n → ∞, and ${‖ {\hat{α}}_{M_{α}} - α_{0 M_{α}} ‖}_{2} = O (\sqrt{s_{α}} n^{- 1 / 2})$ , ${‖ {\hat{β}}_{M_{β}} - β_{0 M_{β}} ‖}_{2} = O (\sqrt{s_{α} + s_{β}} n^{- 1 / 2})$ .

Remark 6.2. We note that when establishing the oracle property of β̂, only the weak oracle property of θ̂ is required. This is due to the robustness of the A-learning methods and the fact that the propensity score is correctly specified.

Remark 6.3. Precision of β̂_{M_β} is affected by that of α̂_{M_α}, since ‖β̂_{M_β} − β_{0M_β}‖₂ is at least the same order of magnitude as ‖α̂_{M_α} − α_{0M_α}‖₂. When the propensity score is known, convergence rate of β̂_{M_β} is improved to $\sqrt{s_{β} / n}$ .

6.2. Asymptotic distributions

We define Σ₁₂ and Σ₂₂ as

\sum_{12} = 2 B_{n α}^{- 1 / 2} X_{M_{α}}^{T} Δ W X_{M_{β}} B_{n β}^{- 1 / 2},

\sum_{22} = B_{n α}^{- 1 / 2} X_{M_{α}}^{T} W Δ^{1 / 2} (E - P_{Δ^{\frac{1}{2}} X_{M_{α}}}) Δ^{1 / 2} W X_{M_{β}} B_{n β}^{- 1 / 2},

where W is a shorthand for W(θ^⋆).

To establish the weak convergence of the estimators, we introduce the following conditions.

Condition 6.2. Assume that

\begin{matrix} λ_{1 n} {\bar{ρ}}_{1} (d_{n α}) = o (s_{α}^{- 1 / 2} n^{- 1 / 2}), & λ_{2 n} {\bar{ρ}}_{2} (d_{n β}) = o (s_{β}^{- 1 / 2} n^{- 1 / 2}), \end{matrix}

(6.6)

\begin{matrix} \sum_{i = 1}^{n} {(x_{M_{α} i}^{T} B_{n α}^{- 1} x_{M_{α} i})}^{3 / 2} \to 0, & \sum_{i = 1}^{n} {(x_{M_{β} i}^{T} B_{n β}^{- 1} x_{M_{β} i})}^{3 / 2} \to 0, \end{matrix}

(6.7)

\sum_{i = 1}^{n} {(x_{M_{β} i}^{T} B_{n β}^{- 1} x_{M_{β} i})}^{3 / 2} {| μ^{i} - Φ^{i} |}^{3} \to 0,

(6.8)

λ_{max} (B_{n β}^{- 1 / 2} X_{M_{β}}^{T} W^{2} X_{M_{β}} B_{n β}^{- 1 / 2}) = O (1),

(6.9)

sup_{δ \in H_{θ^{⋆}}} {‖ B_{n β}^{- 1 / 2} X_{M_{β}}^{T} diag [Φ - Φ (δ)] Δ X_{M_{α}} B_{n α}^{- 1 / 2} ‖}_{2} = o (1) .

(6.10)

where x_{M_αi} and x_{M_βi} stand for the ith row of the matrix X_{M_α} and X_{M_β} respectively.

Remark 6.4. Conditions (6.7) and (6.8) are the Lyapunov conditions which guarantee the normality of α̂_{M_α} and β̂_{M_β}. Condition (6.9) puts constraints on the maximum eigenvalue of the variance-covariance matrix of $X_{M_{β}}^{T} diag (A - π) (μ Φ)$ by requiring it to be finite. Condition (6.10) holds when Φ(δ) converges to Φ uniformly in terms of L_∞ norm with δ in the region H_θ^⋆. When ‖μ − Φ‖_∞ is bounded, (6.8) and (6.9) are simultaneously satisfied.

Theorem 6.2 (Oracle property). Under conditions in Theorem 6.1 and Condition 6.2, assume max(s_α, s_β) = o(n^1/3), the right-hand side of (5.15) is strengthened to $O (n^{\frac{1}{2} + γ_{θ^{⋆}}} / \sqrt{s_{β} s_{θ^{⋆}} {log}^{3} n})$ , as n → ∞. Then with probability tending to 1, $\hat{α} = {({\hat{α}}_{M_{α}}^{T}, {\hat{α}}_{2}^{T})}^{T}$ , $\hat{β} = {({\hat{β}}_{M_{β}}^{T}, {\hat{β}}_{2}^{T})}^{T}$ in Theorem 6.1 must satisfy

α̂₂ =0, β̂₂ = 0,
$[A_{1 n} B_{n α}^{1 / 2} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}), A_{2 n} B_{n β}^{1 / 2} ({\hat{β}}_{M_{β}} - β_{0 M_{β}})]$ is asymptotically normally distributed with mean 0, covariance matrix Ω, which is the limit of

(\begin{matrix} A_{1 n} A_{1 n}^{T} & A_{1 n} \sum_{12} A_{2 n}^{T} \\ A_{2 n} \sum_{21} A_{1 n}^{T} & σ^{2} A_{2 n} A_{2 n}^{T} + A_{2 n} \sum_{22} A_{2 n}^{T} \end{matrix}),

where A₁_n is a q₁ × s_α matrix and A₂_n is a q₂ × s_β matrix such that

\begin{matrix} λ_{max} (A_{1 n} A_{1 n}^{T}) = O (1), & λ_{max} (A_{2 n} A_{2 n}^{T}) = O (1) . \end{matrix}

We note that conditions on the smoothness of the misspecified function (5.15) is strengthened. To better understand the above theorem, we provide the following two corollaries. The first corollary gives the limiting distribution when we specify both the propensity score and main-effect model while the second one corresponds to case when the propensity score is known in advance.

Corollary 6.1. Under conditions of Theorem 6.2, when we correctly specify the main-effect model, i.e., μ = Φ, $A_{1 n} B_{n α}^{1 / 2} ({\hat{α}}_{M_{α}} - α_{0 M_{α}})$ and $A_{2 n} B_{n β}^{1 / 2} ({\hat{β}}_{M_{β}} - β_{0 M_{β}})$ are jointly asymptotically normally distributed, with the covariance matrix Ω′, which is the limit of the following matrix,

(\begin{matrix} A_{1 n}^{T} A_{1 n} & O \\ O & σ^{2} A_{2 n}^{T} A_{2 n} \end{matrix}) .

Remark 6.5. Comparing the results in Corollary 6.1 and in Theorem 6.2, the term $A_{2 n}^{T} \sum_{22} A_{2 n}$ accounts for the partial specification of model (2.1). In the most extreme case where we correctly specify Φ, β̂_{M_β} will achieve its minimum variance and is independent of α̂_{M_α}. In general, we can gain efficiency by posing a good working model for Φ. Numerical studies also suggest that a linear model such as Φ = Xθ is preferred compared to the constant model. This is in line to our theoretical justification since W is a diagonal matrix with the ith diagonal element μⁱ − Φⁱ.

Corollary 6.2. When the propensity score is known, under conditions of Theorem 6.2 with all α̂'s replaced by α₀, then with probability tending to 1 as n → ∞, $A_{2 n} B_{n β}^{1 / 2} ({\hat{β}}_{M_{β}} - β_{0 M_{β}})$ is asymptotically normally distributed with mean 0, co-variance matrix Ω″ which is the limit of

σ^{2} A_{2 n}^{T} A_{2 n} + A_{2 n}^{T} \sum_{22}^{'} A_{2 n},

where

\sum_{22}^{'} = B_{n 2}^{- 1 / 2} X_{M_{β}}^{T} W Δ W X_{M_{β}} B_{n 2}^{- 1 / 2} .

Remark 6.6. An interesting fact implied by Corollary 6.2 is that the asymptotic variance of β̂_{M_β} will be smaller than that of the same estimator had we known the propensity score in advance. A similar result is given in the asymptotic distribution of the mean response for the value function in the next section. This is in line with the semiparametric theory in fixed p case where the variance of augmented-IPWS estimator would be smaller when we estimate the parameter in the coarsening probability model, even if we know what the true value is (see Chapter 9 in Tsiatis, 2006). By doing so, we can actually borrow information from the linear association between covariates in WX_{M_β} and those in X_{M_α}.

7. Evaluation of value function

In this section, we derive a non-parametric estimate for the mean response under the optimal treatment regime. By (2.1), define our average population-level response under a specific regime as

V_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} E [Y_{i} | A_{i} = I (x_{i}^{T} β > 0), X_{i} = x_{i}] = \frac{1}{n} \sum_{i = 1}^{n} [h_{0} (x_{i}) + x_{i}^{T} β_{0} I (x_{i}^{T} β > 0)],

where the treatment decision for the ith patient is given as $I (x_{i}^{T} β > 0)$ . The mean response under the true optimal regime is denoted as V_n(β₀) and it is easy to verify that β₀ is the maximizer of the function V_n.

Similarly as in Murphy (2003), we propose to estimate V_n(β₀) using

{\hat{V}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} [Y_{i} + x_{i}^{T} \hat{β} {I (x_{i}^{T} \hat{β} > 0) - A_{i}}] .

(7.1)

This estimator is not doubly robust but offers protection against misspecification of the baseline function and improved efficiency It's not doubly robust because we require the propensity score model to be correctly specified to ensure the oracle property of β̂. A key condition which guarantees asymptotic normality of (7.1) is given as follows.

Condition 7.1. Assume there exists some constant C′, such that for all ε > 0,

\frac{1}{n} \sum_{i} I (| x_{i}^{T} β_{0} | < ε) \leq C^{'} ε .

Remark 7.1. The above condition has similar interpretation as Condition (3.3) in Qian and Murphy (2011), where random design were utilized. Condition 7.1 requires that the absolute value of the average contrast function can not be too small, which together with the condition s_β = o(n^1/4) ensures the following stochastic approximation condition:

\sqrt{n} \sum_{i} x_{i}^{T} \hat{β} {I (x_{i}^{T} \hat{β} > 0) - I (x_{i}^{T} β_{0} > 0)} = o_{p} (1) .

(7.2)

Theorem 7.1. Assume that conditions in Theorem 6.2 hold. If Condition 7.1 holds and the nonsparsity size s_β satisfies s_β = o(n^1/4), then with probability going to 1, √n{V̂_n − V_n(β₀)} is asymptotically normally distributed with variance $ν_{0}^{2}$ , which is limit of

σ^{2} + σ^{2} υ_{n}^{T} X_{M_{β}} B_{n β}^{- 1} X_{M_{β}}^{T} υ_{n} + υ_{n}^{T} X_{M_{β}} B_{n β}^{- 1 / 2} \sum_{22} B_{n β}^{- 1 / 2} X_{M_{β}}^{T} υ_{n},

(7.3)

where υ_n stands for the vector ${[I (x_{1}^{T} β_{0} > 0) - π (x_{1}), \dots, I (x_{n}^{T} β_{0} > 0) - π (x_{n})]}^{T} / \sqrt{n}$ , and Σ₂₂ is defined in Theorem 6.2.

Remark 7.2. Note that we only need s_β = o(n^1/2) to guarantee the weak oracle property of β̂ or $O (\sqrt{s_{β}} / \sqrt{n})$ convergence rate of ‖β̂_{M_β} − β_{0M_β}‖₂. This condition is strengthened to s_β = o(n^1/3) to show the asymptotic normality of β̂_{M_β}. Theorem 7.1 further requires s_β = o(n^1/4) as to ensure the approximation condition (7.2).

Remark 7.3. When (7.2) is satisfied, the asymptotic normality of V̂_n follows immediately from the oracle property of the estimator β̂_{M_β}. The first term σ² in (7.3) is due to variation of the error term e_i while the last two terms correspond to the asymptotic variance of β̂_{M_β}.

We provide a corollary here which corresponds to the case where the main-effect model is correctly specified.

Corollary 7.1. In addition to the conditions in Theorem 7.1, if the main-effect model is correct, √n{V̂_n − V_n(β₀)} is asymptotically normally distributed with variance $ν_{1}^{2}$ , which is defined as the limit of

σ^{2} + σ^{2} υ_{n}^{T} X_{M_{β}} B_{n β}^{- 1} X_{M_{β}}^{T} υ_{n},

where υ_n is defined in Theorem 7.1.

Similar to the asymptotic distribution of β̂_{M_β}, the following corollary suggests that the proposed estimator is more efficient in the case when we estimate the propensity score by fitting a penalized logistic regression.

Corollary 7.2. Assume the propensity score is known, and conditions in Theorem 7.1 hold with all α̂'s replaced by α₀, then with probability going to 1, √n{V̂_n − V_n(β₀)} is asymptotically normally distributed with variance $ν_{2}^{2}$ , which is the limit of

σ^{2} + σ^{2} υ_{n}^{T} X_{M_{β}} B_{n β}^{- 1} X_{M_{β}}^{T} υ_{n} + υ_{n}^{T} X_{M_{β}} B_{n β}^{- 1 / 2} \sum_{22}^{'} B_{n β}^{- 1 / 2} X_{M_{β}}^{T} υ_{n},

with υ_n defined in Theorem 7.1, and $\sum_{22}^{'}$ defined in Corollary 6.2.

By the definition of υ_n and the condition that $λ_{max} (X_{M_{β}}^{T} X_{M_{β}}) = O (n)$ , the asymptotic variance will reach its minimum when $I (x_{i}^{T} β_{0} > 0)$ is close to the propensity score. We characterize this result in the following Corollary.

Corollary 7.3. Under the conditions in Theorem 7.1, if we further assume that

\frac{1}{n} \sum_{i = 1}^{n} {I (X_{i}^{T} β_{0} > 0) - π (x_{i})}^{2} = o (1),

then with probability going to 1, √n{V̂_n − V_n(β₀)} is asymptotically normally distributed with the variance σ².

Remark 7.4. Such a result is expected with the following intuition: in an observational study, if the clinician or the decision maker has a high chance to assign the optimal treatment to an individual patient, i.e., the propensity score is close to $I (x_{i}^{T} β_{0} > 0)$ , the variation in estimating the value function will be decreased. In other words, the more skillful the clinician or the decision maker is, the closer the observed individual response Y_i approaches the potential outcome under the optimal treatment regime.

8. Conclusion

In this article, we propose a two-step estimator for estimating the optimal treatment strategy which selects variables and estimates parameters simultaneously in both propensity score and outcome regression models using penalized regression. Our methodology can handle data set whose dimensionality is allowed to grow exponentially fast compared to the sample size. Oracle properties of the estimators are given. Variable selection is also involved in the misspecified model and new mathematical techniques are developed to study the estimator's properties in a general form of optimization. The estimator is shown to be more efficient when the misspecified working model is “closer” to the conditional mean of the response, although our approach does not require correct specification of the baseline function. Numerical results demonstrate that the proposed estimator enjoys model selection consistency and has overall satisfactory performance.

In the case when there are multiple local solutions of our objective functions (2.5), (2.3) or (2.4), although our asymptotic theory only suggests the existence of a local minimum possessing the oracle property, it is worth mentioning that we can actually identify the desired oracle estimator using existing algorithms (see Fan, Xue and Zou, 2014; Wang, Kim and Li, 2013). Theoretical properties can be established in a similar fashion.

The proposed method requires to specify the propensity score model correctly. In randomized studies, the propensity score is known in advance and thus the assumption is automatically satisfied. However, for observational studies, there's no guarantee. In practice, some prior information on treatment decision mechanism used by physicians may be helpful for building a reasonable propensity score. In addition, model diagnostic tests can be used to check the goodness-of-fit of the posited propensity score model, such as a logistic regression model. In general, this might be easier than checking the goodness-of-fit of the regression model for the response. In addition, in our current work, we assume the design matrix X to be deterministic mainly for technical convenience. To the best of our knowledge, the penalized regression with the folded-concave penalties has never been studied in random design settings with NP dimensionality. To consider random design settings, we need to impose some tail conditions on X, and the derivation of some technical results needs to be modified. This is beyond the scope of our current paper and will be investigated elsewhere.

The current framework is focused on point treatment study. It will be interesting and practically useful to extend our results to dynamic treatment regimes. Significant efforts are needed to handle model misspecification in multiple stages. This is an interesting research topic that needs further investigation.

Supplementary Material

supplementary appendix

NIHMS834730-supplement-supplementary_appendix.pdf^{(356.8KB, pdf)}

Appendix

Here, we only give the proof of Theorem 5.1. More technical conditions and proofs for Theorems 6.1, 6.2 and 7.1 are given in the supplementary Appendix. To establish Theorem 5.1, we need the following lemmas. The proofs of these lemmas are also given in the supplementary Appendix.

Lemma 1. Let z = (z¹, …, zⁿ)^T be an n-dimensional independent random response vector with mean 0 and a ∈ ℝⁿ.

If z¹, …, zⁿ are bounded in [c, d], then for any ε ∈ (0, ∞),

$Pr (| a^{T} z | > ε) \leq 2 exp (- \frac{ε^{2}}{2 {‖ a ‖}_{2}^{2} {(d - c)}^{2}}) .$
If z¹, …, zⁿ satisfy max_i ‖zⁱ‖_ψ₁ ≤ ω, then for any ε ∈ (0, ∞),

$Pr (| a^{T} z | > ε) \leq 2 exp (- \frac{1}{2} \frac{ε^{2}}{2 {‖ a ‖}_{2}^{2} ω^{2} + ‖ a ‖ ε ω}) .$

Lemma 2. Define $ε = \cup_{k = 1}^{16} ε_{k}$ , where ε_k is defined in Appendix G, under conditions in Theorem 5.1, we have Pr(ε) ≥ 1 − c̄/(n + p + q) for some c̄ > 0.

Notation. Let Z = diag(A − π)X, Ẑ = diag(A − π̂)X, and

\begin{matrix} ξ_{1} = {\hat{Z}}^{T} e, & ξ_{2} = Z^{T} (μ - Φ), & ξ_{3} = ϕ^{T} (e - Z β_{0}), \\ ξ_{4} = Z^{T} diag (X β_{0}) Δ X_{M_{α}}, & ξ_{5} = X^{T} [diag {(A - π) \circ (A - π)} - Δ] X_{M_{β}}, \\ ξ_{6} (δ) = Z^{T} {Φ - Φ (δ)}, & ξ_{7} (δ) = {ϕ (δ) - ϕ}^{T} (e - Z β_{0}), \end{matrix}

and π = (π(x₁), …, π(x_n)). For a given matrix Ψ, the superscript Ψ^j is used to refer to the vector which is the jth column of matrix Ψ while the subscript Ψ_i stands for the ith row of Ψ. We will write Φ(θ), ϕ(θ) with $θ = {(θ_{M_{θ^{⋆}}}^{T}, 0^{T})}^{T}$ as Φ(θ_{M_θ^⋆}), ϕ(θ_{M_θ^⋆}) for convenience.

Proof of Theorem 5.1

We break the proof into three steps. Based on Theorem 1 in Fan and Lv (2011), it suffices to prove the existence of β̂_{M_β}, θ̂_{M_θ^⋆} inside the hypercube

ℵ = {{(δ_{β}^{T}, δ_{θ}^{T})}^{T} : {‖ δ_{β} - β_{0 M_{β}} ‖}_{\infty} = n^{- γ_{β}} log n, {‖ δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆} ‖}_{\infty} = K n^{- γ_{θ^{⋆}}} log n}

with K a large constant, conditional on the event ε, satisfying

{\hat{Z}}_{M_{β}}^{T} {Y - Φ (\hat{θ}) - \hat{Z} \hat{β}} = n λ_{2 n} {\bar{ρ}}_{2} ({\hat{β}}_{M_{β}}),

(A.1)

{\hat{ϕ}}_{M_{θ^{⋆}}}^{T} {Y - Φ (\hat{θ})} = n λ_{3 n} {\bar{ρ}}_{3} ({\hat{θ}}_{1}),

(A.2)

{‖ {\hat{Z}}_{M_{β}^{c}}^{T} {Y - Φ (\hat{θ}) - \hat{Z} \hat{β}} ‖}_{\infty} < n λ_{2 n} ρ_{2}^{'} (0 +),

(A.3)

{‖ {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {Y - Φ (\hat{θ})} ‖}_{\infty} < n λ_{3 n} ρ_{3}^{'} (0 +),

(A.4)

λ_{min} ({\hat{Z}}_{M_{β}}^{T} {\hat{Z}}_{M_{β}}) > n λ_{2 n} κ (ρ_{2}, {\hat{β}}_{M_{β}}),

(A.5)

λ_{min} ({\hat{ϕ}}_{M_{θ^{⋆}}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}}) > n λ_{3 n} κ (ρ_{3}, {\hat{θ}}_{M_{θ^{⋆}}}) .

(A.6)

Step 1. We first show the existence of a solution to equations (A.1) and (A.2) inside ℵ for sufficiently large n. For any δ = (δ¹, …, δ^{s_β+s_θ^⋆})^T ∈ ℵ, since d_nβ ≥ n⁻^γ_β logn, d_nθ ≫ n^{−γ_θ^⋆} logn, we have

\begin{matrix} {min}_{j = 1}^{s_{β}} | δ^{j} | \geq min | β_{0}^{j} | - d_{n β} = d_{n β}, & {min}_{j = 1}^{s_{θ^{⋆}}} | δ^{j + s_{β}} | \geq min | θ^{⋆ j} | - d_{n θ} = d_{n θ} \end{matrix}

and sgn(δ_β) = sgn(β_{0M_β}), $sgn (δ_{θ}) = sgn (θ_{M_{θ^{⋆}}}^{⋆})$ . The monotonicity condition of $ρ_{2}^{'} (t)$ , $ρ_{3}^{'} (t)$ gives

\begin{matrix} {‖ n λ_{2 n} {\bar{ρ}}_{2} (δ) ‖}_{\infty} \leq n λ_{2 n} ρ_{2}^{'} (d_{n β}), & {‖ n λ_{3 n} {\bar{ρ}}_{3} (δ) ‖}_{\infty} \leq n λ_{3 n} ρ_{3}^{'} (d_{n θ}) . \end{matrix}

(A.7)

We write the left hand side of (A.1) as

{\hat{Z}}_{M_{β}}^{T} {Y - Φ (δ_{θ}) - {\hat{Z}}_{M_{β}} δ_{β}} = ξ_{1 M_{β}} + ξ_{2 M_{β}} + {({\hat{Z}}_{M_{β}} - Z_{M_{β}})}^{T} {μ - Φ (δ_{θ})} + {\hat{Z}}_{M_{β}}^{T} {\hat{Z}}_{M_{β}} (β_{0 M_{β}} - δ_{β}) + {\hat{Z}}_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β_{0 M_{β}} - Z_{M_{β}}^{T} {Φ (δ_{θ}) - Φ} . ≜ I_{1} + I_{2} + I_{3} + I_{4} + I_{5} + I_{6},

(A.8)

on the set ε₃ ∪ ε₅ ∪ ε₁₃, we have

{‖ I_{1} ‖}_{\infty} + {‖ I_{2} ‖}_{\infty} + {‖ I_{3} ‖}_{\infty} = O (\sqrt{n log n}) .

(A.9)

Define

η_{1} = {(\hat{Z} - Z)}^{T} {μ - Φ (δ_{θ})}, η_{2} = {(\hat{Z} - Z)}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β_{0 M_{β}} .

Note that η_{1M_β} = I₃ in (A.8), which we represent here using a second order Taylor expansion around α_{0M_α},

I_{3} = X_{M_{β}}^{T} W (δ_{θ}) Δ X_{M_{α}} (α_{0 M_{α}} - {\hat{α}}_{M_{α}}) + \frac{1}{2} r_{I_{3}},

(A.10)

where r_I₃ in (A.10) corresponds to second order remainder, whose jth component is given as

{({\hat{α}}_{M_{α}} - α_{0 M_{α}})}^{T} X_{M_{α}}^{T} W (δ_{θ}) \sum (\tilde{α}) diag (x^{j}) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}),

where Σ(α̃) is a diagonal matrix with the ith diagonal element $π^{″} (x_{1 α i}^{T} \tilde{α})$ with α̃ lying in the line segment between α̂_{M_α} and α_{0M_α}. Since π″(·) is a bounded function, we can bound ‖r_I₃‖_∞ by

max_{j} {({\hat{α}}_{M_{α}} - α_{0 M_{α}})}^{T} X_{M_{α}}^{T} diag (| W (δ_{θ}) x^{j} |) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}),

(A.11)

whose order of magnitude is O(s_αn^1−2γ_α log² n) by (5.16).

We decompose I₄ in (A.8) as $η_{2 M_{β}} + Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β_{0 M_{β}}$ . Using similar arguments, on the set ε₉, it follows from (5.17) that

{‖ Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β_{0 M_{β}} ‖}_{\infty} \leq max_{j} {({\hat{α}}_{M_{α}} - α_{0 M_{α}})}^{T} X_{M_{α}}^{T} diag (| x^{j} \circ X β_{0} |) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}) + {‖ ξ_{4 M_{β}} ‖}_{\infty} = O (\sqrt{n log n} + s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) .

(A.12)

Using Taylor expansion, it is immediate to see that

{‖ η_{2 M_{β}} ‖}_{\infty} \leq max_{j} {({\hat{α}}_{M_{α}} - α_{0 M_{α}})}^{T} X_{M_{α}}^{T} diag (| x^{j} \circ X β_{0} |) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}) = O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n),

(A.13)

by (5.17). Combining (A.12) and (A.13) gives

{‖ {\hat{Z}}_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β_{0 M_{β}} ‖}_{\infty} = O (\sqrt{n log n}) + O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) .

(A.14)

So far, we have

{‖ I_{1} + I_{2} + I_{3} + I_{4} + I_{5} + I_{6} - X_{M_{β}}^{T} W (δ_{θ}) Δ X_{M_{α}} (α_{0 M_{α}} - {\hat{α}}_{M_{α}}) ‖}_{\infty} = O (\sqrt{n log n}) + O (s_{α} n^{1 - 2 γ_{β}} {log}^{2} n) + O (s_{β} n^{1 - 2 γ_{β}} {log}^{2} n),

(A.15)

by (A.9), (A.10), (A.11) and (A.14). Now we approximate I₄ by $X_{M_{β}}^{T} Δ X_{M_{β}} (δ_{β} - β_{0 M_{β}})$ and bound the magnitude of error ‖ω_{M_β}‖_∞ where ω = (Ẑ^TẐ_{M_β} − X^TΔX_{M_β})(δ_β − β_{0M_β}). We present it as

ω_{M_{β}} = ({\hat{Z}}_{M_{β}}^{T} {\hat{Z}}_{M_{β}} - X_{M_{β}}^{T} Δ X_{M_{β}}) (δ_{β} - β_{0 M_{β}}) = {\hat{Z}}_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) (δ_{β} - β_{0 M_{β}}) + {({\hat{Z}}_{M_{β}} - Z_{M_{β}})}^{T} Z_{M_{β}} (δ_{β} - β_{0 M_{β}}) + (Z_{M_{β}}^{T} Z_{M_{β}} - X_{M_{β}}^{T} Δ X_{M_{β}}) (δ_{β} - β_{0 M_{β}}) ≜ ω_{1 M_{β}} + ω_{2 M_{β}} + ξ_{5 M_{β}} (δ_{β} - β_{0 M_{β}}) .

(A.16)

It follows from first-order Taylor expansion that the jth element in ω_{1M_β} can be presented as

{[(A - \hat{π}) \circ x^{j} \circ {Δ ({\tilde{α}}_{M_{α}}) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}})}]}^{T} X_{M_{β}} (δ_{β} - β_{0 M_{β}}),

(A.17)

where Δ(α̃_{M_α}) is a diagonal matrix with the ith diagonal component π(x_i, α̃_{M_α})(1 − π(x_i, α̃_{M_α})), where α̃_{M_α} lies between the line segment of α̂_{M_α} and α_{0M_α}. We decompose x^j as the Hadamard product of two vectors, denoted by x̄^j ○ x̃^j, where

\begin{matrix} {\bar{x}}^{j} = (\sqrt{| x_{1}^{j} |}, \dots, \sqrt{| x_{n}^{j} |}), \\ {\tilde{x}}^{j} = (sgn (x_{1}^{j}) \sqrt{| x_{1}^{j} |}, \dots, sgn (x_{n}^{j}) \sqrt{| x_{n}^{j} |}) . \end{matrix}

Let φ = (A − π̂) ○ x̃^j ○ {Δ(α̃_{M_α})X_{M_α}(α̂_{M_α} − α_{0M_α})}, we have

{‖ {[(A - \hat{π}) \circ {\tilde{x}}^{j} \circ {Δ ({\tilde{α}}_{M_{α}}) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}})}]}^{T} X_{M_{β}} ‖}_{2} {‖ δ_{β} - β_{0 M_{β}} ‖}_{2} = \sqrt{φ^{T} diag ({\bar{x}}^{j}) X_{M_{β}} X_{M_{β}}^{T} diag ({\bar{x}}^{j}) φ} {‖ δ_{β} - β_{0 M_{β}} ‖}_{2} \leq \sqrt{λ_{max} (X_{M_{β}}^{T} diag (| x^{j} |) X_{M_{β}})} {‖ δ_{β} - β_{0 M_{β}} ‖}_{2} {‖ φ ‖}_{2} .

(A.18)

Since ‖A − π̂‖_∞ ≤ 1, elements in Δ(α̃_{M_α}) are bounded, we have

{‖ φ ‖}_{2} \leq {‖ diag ({\tilde{x}}^{j}) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}) ‖}_{2} \leq \sqrt{λ_{max} {X_{M_{α}}^{T} diag (| x^{j} |) X_{M_{α}}}} {‖ {\hat{α}}_{M_{α}} - α_{0 M_{α}} ‖}_{2} .

(A.19)

Combining (A.18) with (A.19) gives

{‖ ω_{1 M_{β}} ‖}_{\infty} \leq {max}_{j = 1}^{p} \sqrt{λ_{max} {X_{M_{α}}^{T} diag (| x^{j} |) X_{M_{α}}} {‖ {\hat{α}}_{M_{α}} - α_{0 M_{α}} ‖}_{2}^{2}} {max}_{j = 1}^{p} \sqrt{λ_{max} {X_{M_{β}}^{T} diag (| x^{j} |) X_{M_{β}}} {‖ {\hat{β}}_{M_{β}} - β_{0 M_{β}} ‖}_{2}^{2}},

(A.20)

which is $O (\sqrt{s_{α} s_{β}} n^{1 - γ_{a} - γ_{β}} {log}^{2} n)$ by (B.4) and (B.5).

By the same argument, we can verify that ‖ω_{2M_β}‖_∞ is of the same order. Note that on the set ε₁₁,

{‖ ξ_{5 M_{β}} (δ - β_{0 M_{β}}) ‖}_{\infty} \leq {‖ ξ_{5 M_{β}} ‖}_{\infty} {‖ δ - β_{0 M_{β}} ‖}_{\infty} = O (s_{β} n^{1 - 2 γ_{β}} {log}^{2} n),

these together with (A.20), yields

{‖ ω_{M_{β}} ‖}_{\infty} = O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) + O (s_{β} n^{1 - 2 γ_{β}} {log}^{2} n) .

(A.21)

Define vector-valued function

Ψ_{1} (δ_{β}, δ_{θ}) = B_{n β}^{- 1} [{\hat{Z}}_{M_{β}}^{T} {y - Φ (δ_{θ}) - {\hat{Z}}_{M_{β}} δ_{β}} - n λ_{2 n} {\bar{ρ}}_{2} (δ_{β})] = B_{n β}^{- 1} {I_{1} + I_{2} + I_{3} + I_{4} + I_{5} + I_{6} - n λ_{2 n} {\bar{ρ}}_{2} (δ_{β})} = δ_{β} - β_{0 M_{β}} + B_{n β}^{- 1} {I_{1} + I_{2} + I_{3} + ω_{M_{β}} + I_{5} + I_{6} - n λ_{2 n} {\bar{ρ}}_{2} (δ_{β})} ≜ δ_{β} - β_{0 M_{β}} + u_{β},

(A.22)

then equation (A.1) is equivalent to Ψ₁(δ_β, δ_θ) = 0. It follows from (A.7), (A.15) and (A.21) that

{‖ u_{β} ‖}_{\infty} \leq sup_{δ \in H_{θ^{⋆}}} {‖ B_{n β}^{- 1} X_{M_{β}}^{T} W (δ) Δ X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}) ‖}_{\infty} + {‖ B_{n β}^{- 1} ‖}_{\infty} {O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) + O (s_{β} n^{1 - 2 γ_{β}} {log}^{2} n) + O (\sqrt{n log n}) + n λ_{2 n} ρ^{'} (d_{n β})} .

By similar arguments in the proof of Theorem 2 in Fan and Lv (2011), we have

{‖ B_{n α} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}) ‖}_{\infty} = O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) + O (\sqrt{n log n}) + n λ_{1 n} ρ_{1}^{'} (d_{n α}),

(A.23)

on the set ε₁ ∪ ε₂. Thus by (5.10), (B.1), (B.14) and (B.15), we have

{‖ u_{β} ‖}_{\infty} \leq O [b_{α β} {s_{α} n^{- 2 γ_{α}} {log}^{2} n + \sqrt{log n / n} + λ_{1 n} ρ_{1}^{'} (d_{n α})}] + O [b_{β} {s_{α} n^{- 2 γ_{α}} {log}^{2} n + s_{β} n^{- 2 γ_{β}} {log}^{2} n + \sqrt{log n / n} + λ_{2 n} ρ_{2}^{'} (d_{n β})}] .

Therefore by (A.20), for sufficiently large n, if (δ_β − β_{0M_β})^j = n^−γ_β logn,

Ψ_{1}^{j} (δ_{β}, δ_{θ}) > 0,

(A.24)

and if (δ − β_{0M_β})^j = −n^−γ_β logn,

Ψ_{1}^{j} (δ_{β}, δ_{θ}) < 0 .

(A.25)

Similarly we write the left-hand side of (A.2) as

{({\hat{ϕ}}_{M_{θ^{⋆}}} - ϕ_{M_{θ^{⋆}}})}^{T} (e - Z β_{0}) + ξ_{3 M_{θ}^{'}} + {\hat{ϕ}}_{M_{θ^{⋆}}}^{T} (μ - Φ) - {\hat{ϕ}}_{M_{θ^{⋆}}}^{T} (\hat{Φ} - Φ) .

(A.26)

It is immediately to see that

{‖ ξ_{3 M_{θ^{⋆}}} ‖}_{\infty} = O (\sqrt{n log n}),

(A.27)

on the set ε₅. The L_∞ norm of the first term in (A.26) is bounded by

sup_{δ \in H_{θ^{⋆}}} {‖ ξ_{7 M_{β}} (δ) ‖}_{\infty} = O (\sqrt{n log n}),

(A.28)

on the set ε₁₅.

Using second-order Taylor expansion, we approximate the last term in (A.26) by its first-order term ${\hat{ϕ}}_{M_{θ^{⋆}}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}} (δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆})$ . It follows from (5.7) that the L_∞ norm of the remainder term is bounded from above by

{max}_{l = - 1}^{s_{θ^{⋆}}} λ_{max} {\frac{\partial {(| ϕ^{l} (δ_{θ}) |)}^{T} ϕ_{M_{θ^{⋆}}} ({\tilde{δ}}_{θ})}{\partial θ_{M_{θ^{⋆}}}}} {‖ δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆} ‖}_{2}^{2} = O (s_{θ^{⋆}} n^{1 - 2 γ_{θ^{⋆}}} {log}^{2} n),

(A.29)

where δ̃_θ lies between the line segment of $θ_{M_{θ^{⋆}}}^{⋆}$ and δ_θ.

Define Ψ₂(δ_β, δ_θ) = {ϕ_{M_θ^⋆} (δ_θ)^Tϕ_{M_θ^⋆} (δ_θ)}⁻¹[ϕ_{M_θ^⋆} (δ_θ)^T{Y − Φ(δ_θ)} − nλ₃_nρ̄₃(δ_θ)], equation (A.2) is equivalent to Ψ₂(δ_β, δ_θ) = 0. Similarly to Ψ₁(δ_β, δ_θ), we now show Ψ₂(δ_β, δ_θ) is mainly dominated by $δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆}$ . Define $u_{θ} = Ψ_{2} (δ_{β}, δ_{θ}) - δ_{θ} + θ_{M_{θ^{⋆}}}^{⋆}$ , it follows from (5.1), (5.3), (B.13), (A.26), (A.27), (A.28) and (A.29) that

{‖ u_{θ} ‖}_{\infty} \leq {‖ Ψ_{2} (δ_{β}, δ_{θ}) - δ_{θ} + θ_{M_{θ^{⋆}}}^{⋆} ‖}_{\infty} \leq {‖ {ϕ_{M_{θ^{⋆}}} {(δ_{θ})}^{T} ϕ_{M_{θ^{⋆}}} (δ_{θ})}^{- 1} ‖}_{\infty} {{‖ ξ_{3 M_{θ}^{'}} ‖}_{\infty} + {‖ ξ_{7 M_{θ}^{'}} (δ_{θ}) ‖}_{\infty} + {‖ Φ (δ_{θ}) - Φ - ϕ {(δ_{θ})}^{T} (δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆}) ‖}_{\infty} + n λ_{3 n} ρ_{3}^{'} (d_{n θ})} + {‖ {ϕ_{M_{θ^{⋆}}} {(δ_{θ})}^{T} ϕ_{M_{θ^{⋆}}} (δ_{θ})}^{- 1} ϕ_{M_{θ^{⋆}}} {(δ_{θ})}^{T} (μ - Φ) ‖}_{\infty} = o (n^{- γ_{θ^{⋆}}} log n) + O (n^{- γ_{θ^{⋆}}} log n) .

(A.30)

Therefore, we can find a large constant K < ∞, for n large enough such that if ${(δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆})}^{j} = K n^{- γ_{θ^{⋆}}} log n$ ,

Ψ_{2}^{j} (δ_{β}, δ_{θ}) > 0,

(A.31)

and if ${(δ_{θ} - θ_{M_{θ^{⋆}}}^{⋆})}^{j} = - K n^{- γ_{θ^{⋆}}} log n$ ,

Ψ_{2}^{j} (δ_{β}, δ_{θ}) < 0 .

(A.32)

Combining (A.24), (A.25) with (A.31) and (A.32), an application of Miranda's existence theorem shows equations (A.1), (A.2) have a solution (β̂_{M_β}, θ̂_{M_θ^⋆}) in ℵ.

Step 2. Let (β̂^T, θ̂^T)^T be a solution to equations (A.1) and (A.2) with ${\hat{β}}_{M_{β}^{c}} = 0$ and ${\hat{θ}}_{M_{β}^{c}} = 0$ . We show that (β̂^T, θ̂^T)^T satisfies inequalities (A.3) and (A.4). Decompose (A.3) as the sum of the following terms,

{\hat{Z}}_{M_{β}^{c}}^{T} (Y - \hat{Φ} - {\hat{Z}}_{M_{β}}^{T} {\hat{β}}_{M_{β}}) = ξ_{1 M_{β}^{c}} + ξ_{2 M_{β}^{c}} + Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β_{0 M_{β}} + ξ_{5 M_{β}^{c}} ({\hat{β}}_{M_{β}} - β_{0 M_{β}}) + ω_{1 M_{β}^{c}} + ω_{2 M_{β}^{c}} + η_{1 M_{β}^{c}} + X_{M_{β}^{c}}^{T} Δ X_{M_{β}} ({\hat{β}}_{M_{β}} - β_{0 M_{β}}) + η_{2 M_{β}^{c}} - Z_{M_{β}} (\hat{Φ} - Φ) .

(A.33)

On the set ε₄ ∪ ε₆ ∪ ε₁₀ ∪ ε₁₂, it is immediately to see that

{‖ ξ_{1 M_{β}^{c}} ‖}_{\infty} + {‖ ξ_{2 M_{β}^{c}} ‖}_{\infty} + {‖ ξ_{5 M_{β}^{c}} ({\hat{β}}_{M_{β}} - β_{0 M_{β}}) ‖}_{\infty} + {‖ Z_{M_{β}}^{T} (\hat{Φ} - Φ) ‖}_{\infty} = O (n^{1 - d_{β}} \sqrt{log n}) .

(A.34)

By (B.4), (B.5) and (A.20), a first-order Taylor expansion gives

{‖ ω_{1 M_{β}^{c}} ‖}_{\infty} + {‖ ω_{2 M_{β}^{c}} ‖}_{\infty} = O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) + O (s_{β} n^{1 - 2 γ_{β}} {log}^{2} n) .

(A.35)

Similarly it follows from (5.17) and (A.13) that

{‖ η_{2 M_{β}} ‖}_{\infty} = O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) .

(A.36)

On the set ε₁₀, by (5.17) and (A.12), we have

{‖ Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) β ‖}_{\infty} = O (n^{1 - d_{β}} \sqrt{log n}) + O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n) .

(A.37)

Approximating $η_{1 M_{β}^{c}}$ by $X_{M_{β}^{c}}^{T} W (δ_{θ}) Δ X_{M_{α}} (α_{0 M_{α}} - {\hat{α}}_{M_{α}})$ , the L_∞ norm of remainder error term is bounded from above by

{({\hat{α}}_{M_{α}} - α_{0 M_{α}})}^{T} X_{M_{α}}^{T} diag (| W (δ_{θ}) x^{j} |) X_{M_{α}} ({\hat{α}}_{M_{α}} - α_{0 M_{α}}) = O (s_{α} n^{1 - γ_{θ^{⋆}}} {log}^{2} n),

(A.38)

by (5.16). Let $u_{β}^{'} = {\hat{Z}}_{M_{β}^{c}}^{T} (Y - \hat{Φ} - {\hat{Z}}_{1}^{T} {\hat{β}}_{M_{β}}) - X_{M_{β}^{c}}^{T} Δ X_{M_{β}} ({\hat{β}}_{M_{β}} - β_{0 M_{β}}) - X_{M_{β}^{c}}^{T} W ({\hat{θ}}_{1}) Δ X_{M_{α}} (α_{0 M_{α}} - {\hat{α}}_{M_{α}})$ , it follows from (A.33)–(A.38) that

{‖ u_{β}^{'} ‖}_{\infty} = O (n^{1 - d_{β}} \sqrt{log n} + s_{α} n^{1 - 2 γ_{α}} {log}^{2} n + s_{β} n^{1 - 2 γ_{β}} {log}^{2} n) .

(A.39)

Since β̂_{M_β} solves (A.1), we have

{\hat{β}}_{M_{β}} - β_{0 M_{β}} = - u_{β},

(A.40)

where u_β is defined as Ψ₁(β̂_{M_β}, θ̂_{M_θ^⋆}) + β_{0M_β} − β̂_{M_β}. Combining (A.40) with (A.23) and (A.39) gives

{‖ \frac{1}{n λ_{2 n}} {\hat{Z}}_{M_{β}^{c}}^{T} (Y - \hat{Φ} - {\hat{Z}}_{M_{β}}^{T} {\hat{β}}_{M_{β}}) ‖}_{\infty} \leq \frac{1}{n λ_{2 n}} [{‖ u_{β}^{'} ‖}_{\infty} + {‖ X_{M_{β}^{c}}^{T} Δ X_{M_{β}} {(X_{M_{β}}^{T} Δ X_{M_{β}})}^{- 1} ‖}_{\infty} {u_{β} - X_{M_{β}}^{T} W ({\hat{θ}}_{M_{θ^{⋆}}}) Δ X_{M_{α}} (α_{0 M_{α}} - {\hat{α}}_{M_{α}})} + {‖ X_{M_{β}^{c}}^{T} W_{β} W ({\hat{θ}}_{1}) X_{M_{α}} {(X_{M_{α}}^{T} Δ X_{M_{α}})}^{- 1} ‖}_{\infty} {n λ_{1 n} ρ_{1}^{'} (d_{n α}) + O (\sqrt{n log n}) + O (s_{α} n^{1 - 2 γ_{α}} {log}^{2} n)}] \leq o (1) + C ρ_{2}^{'} (0 +),

by (5.11), (B.3), (B.16) and (B.19). Since C < 1, for sufficiently large n, (A.3) is satisfied.

Now we verify (A.4), decomposing ${\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} (Y - \hat{Φ})$ as the sums of

{({\hat{ϕ}}_{M_{θ^{⋆}}^{c}} - ϕ_{M_{θ^{⋆}}^{c}})}^{T} (e - Z β_{0}) + ξ_{3 M_{θ}^{c}} + {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} (μ - Φ) + {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} (Φ - \hat{Φ}),

(A.41)

on the set ε₈ ∪ ε₁₆, we have

{‖ ξ_{3 M_{θ}^{' c}} ‖}_{\infty} + {‖ {({\hat{ϕ}}_{M_{θ^{⋆}}^{c}} - ϕ_{M_{θ^{⋆}}^{c}})}^{T} (e - Z β_{0}) ‖}_{\infty} = O (n^{1 - d_{θ}} \sqrt{log n}) .

(A.42)

Similar to (A.29), a second-order Taylor expansion gives

{‖ {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} (\hat{Φ} - Φ) - {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}} ({\hat{θ}}_{M_{θ^{⋆}}} - θ_{M_{θ^{⋆}}}^{⋆}) ‖}_{\infty} = O (s_{θ^{⋆}} n^{1 - 2 γ_{θ^{⋆}}} {log}^{2} n),

(A.43)

by (5.7). Since (β̂_{M_β}, θ̂_{M_θ^⋆}) is the solution to Ψ₂(δ_β, δ_θ) = 0, it follows from (A.30) that

{‖ {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}} ({\hat{θ}}_{M_{θ^{⋆}}} - θ_{M_{θ^{⋆}}}^{⋆}) - {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}} {({\hat{ϕ}}_{M_{θ^{⋆}}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}})}^{- 1} (μ - Φ) ‖}_{\infty} = {‖ {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}} {({\hat{ϕ}}_{M_{θ^{⋆}}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}})}^{- 1} ‖}_{\infty} {O (\sqrt{n log n} + s_{θ^{⋆}} n^{1 - 2 γ_{θ^{⋆}}} {log}^{2} n) + n λ_{3} ρ_{3}^{'} (d_{n θ})} .

(A.44)

By (A.41)–(A.44) and conditions in (5.2), (5.4), (B.15) and (B.20), the left-hand side of (A.4) can be bounded by

\frac{1}{n λ_{3 n}} {O (n^{1 - d_{θ}} \sqrt{log n}) + O (s_{θ^{⋆}} n^{1 - 2 γ_{θ^{⋆}}} {log}^{2} n)} + \frac{1}{n λ_{3 n}} {‖ {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}} {({\hat{ϕ}}_{M_{θ^{⋆}}}^{T} {\hat{ϕ}}_{M_{θ^{⋆}}})}^{- 1} ‖}_{\infty} {O (\sqrt{n log n}) + O (s_{θ^{⋆}} n^{1 - 2 γ_{θ^{⋆}}} {log}^{2} n) + n λ_{3 n} ρ_{3}^{'} (d_{n θ})} + \frac{1}{n λ_{3 n}} {‖ {\hat{ϕ}}_{M_{θ^{⋆}}^{c}}^{T} {I - P_{ϕ M_{θ^{⋆}}} ({\hat{θ}}_{1})} (μ - Φ) ‖}_{\infty} = o (1) + C ρ_{3}^{'} (0 +),

for C < 1. Therefore (A.4) is satisfied.

Step 3. Now we show the second order conditions (A.5) and (A.6) hold. Because (A.6) is directly implied by (B.17), it suffices to show that $λ_{min} ({\hat{Z}}_{M_{β}}^{T} {\hat{Z}}_{M_{β}}) \geq λ_{min} (X_{M_{β}}^{T} Δ X_{M_{β}})$ for sufficiently large n. Since (Ẑ_{M_β} − Z_{M_β})^T(Ẑ_{M_β} − Z_{M_β}) is positive semi-definite, we have

λ_{min} ({\hat{Z}}_{M_{β}}^{T} {\hat{Z}}_{M_{β}}) \geq λ_{min} (X_{M_{β}}^{T} Δ X_{M_{β}}) + λ_{min} {{({\hat{Z}}_{M_{β}} - Z_{M_{β}})}^{T} Z_{M_{β}} + Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) + ξ_{5 M_{β}}} .

(A.45)

Since any symmetric matrix Ψ, the absolute value of minimum eigenvalue can be bounded by

| λ_{min} (Ψ) | \leq \sqrt{λ_{max} (Ψ^{2})} \leq \sqrt{{‖ Ψ ‖}_{\infty} {‖ Ψ ‖}_{1}} = {‖ Ψ ‖}_{\infty},

(A.5) follows if we can show ${‖ ξ_{5 M_{β}} + {({\hat{Z}}_{M_{β}} - Z_{M_{β}})}^{T} Z_{M_{β}} + Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) ‖}_{\infty} = o (n)$ . But this is immediate to see because

{‖ ξ_{5 M_{β}} ‖}_{\infty} = O (n^{1 / 2 + γ_{β}} / \sqrt{log n}) = o (n),

on the set ε₁₁. Similar to (A.20), ${‖ {({\hat{Z}}_{M_{β}} - Z_{M_{β}})}^{T} Z_{M_{β}} + Z_{M_{β}}^{T} ({\hat{Z}}_{M_{β}} - Z_{M_{β}}) ‖}_{\infty}$ can be bounded from above by

2 max_{j} \sqrt{s_{β} λ_{max} {X_{M_{β}}^{T} diag (| x^{j}) X_{M_{β}}} λ_{max} {X_{M_{α}}^{T} diag (| x^{j} |) X_{M_{α}}} {‖ {\hat{α}}_{M_{β}} - α_{0 M_{α}} ‖}_{2}^{2}},

(A.46)

which is $O (\sqrt{s_{α} s_{β}} n^{1 - γ_{α}} log n) = o (n)$ implied by the constrain max(l₁, l₂) < γ_α. This completes the proof.

Footnotes

The research of Chengchun Shi and Rui Song is supported in part by Grant NSF-DMS 1309465, 1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is supported in part by Grant NCI P01 CA142538.

Supplementary Material

Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”

(doi: 10.1214/16-EJS1178SUPP; .pdf).

References

Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron J Stat. 2007;1:169–194. MR2312149. [Google Scholar]
Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. MR2757118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581 (2003k:62160) [Google Scholar]
Fan A, Lu W, Song R. Sequetial advantage selection for optimal treatment regime. Ann Appl Stat To appear. 2015 doi: 10.1214/15-AOAS849. MR3480486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. MR2849368 (2012k:62211) [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. Ann Statist. 2014;42:819–849. doi: 10.1214/13-aos1198. MR3210988. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gunter L, Zhu J, Murphy SA. Variable selection for qualitative interactions. Stat Methodol. 2011;8:42–55. doi: 10.1016/j.stamet.2009.05.003. MR2741508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li KC, Duan N. Regression analysis under link violation. Ann Statist. 1989;17:1009–1052. MR1015136. [Google Scholar]
Lu W, Zhang HH, Zeng D. Variable selection for optimal treatment decision. Stat Methods Med Res. 2013;22:493–504. doi: 10.1177/0962280211428383. MR3190671. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. MR1983752 (2005b:62167) [Google Scholar]
Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. MR2816351 (2012e:62227) [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics Lecture Notes in Statist. Vol. 179. Springer; New York: 2004. Optimal structural nested models for optimal sequential decisions; pp. 189–326. MR2129402 (2006g:62007) [Google Scholar]
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
Rubin DB. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]
Shi C, Song R, Lu W. Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”. 2016 doi: 10.1214/16-EJS1178SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song R, Kosorok M, Zeng D, Zhao Y, Laber E, Yuan M. On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat. 2015;4:59–68. doi: 10.1002/sta4.78. MR3405390. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. MR2815776 (2012e:62246) [Google Scholar]
Tsiatis AA. Semiparametric theory and missing data Springer Series in Statistics. Springer; New York: 2006. MR2233926 (2007g:62009) [Google Scholar]
Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. Ann Statist. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. MR3127873. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. MR0640163. [Google Scholar]
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. MR2604701 (2011d:62211) [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. MR3040007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. MR3010898. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary appendix

NIHMS834730-supplement-supplementary_appendix.pdf^{(356.8KB, pdf)}

[R1] Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron J Stat. 2007;1:169–194. MR2312149. [Google Scholar]

[R2] Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. MR2757118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581 (2003k:62160) [Google Scholar]

[R4] Fan A, Lu W, Song R. Sequetial advantage selection for optimal treatment regime. Ann Appl Stat To appear. 2015 doi: 10.1214/15-AOAS849. MR3480486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. MR2849368 (2012k:62211) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. Ann Statist. 2014;42:819–849. doi: 10.1214/13-aos1198. MR3210988. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Gunter L, Zhu J, Murphy SA. Variable selection for qualitative interactions. Stat Methodol. 2011;8:42–55. doi: 10.1016/j.stamet.2009.05.003. MR2741508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Li KC, Duan N. Regression analysis under link violation. Ann Statist. 1989;17:1009–1052. MR1015136. [Google Scholar]

[R9] Lu W, Zhang HH, Zeng D. Variable selection for optimal treatment decision. Stat Methods Med Res. 2013;22:493–504. doi: 10.1177/0962280211428383. MR3190671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. MR1983752 (2005b:62167) [Google Scholar]

[R11] Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. MR2816351 (2012e:62227) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics Lecture Notes in Statist. Vol. 179. Springer; New York: 2004. Optimal structural nested models for optimal sequential decisions; pp. 189–326. MR2129402 (2006g:62007) [Google Scholar]

[R13] Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]

[R14] Rubin DB. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]

[R15] Shi C, Song R, Lu W. Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”. 2016 doi: 10.1214/16-EJS1178SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Song R, Kosorok M, Zeng D, Zhao Y, Laber E, Yuan M. On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat. 2015;4:59–68. doi: 10.1002/sta4.78. MR3405390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. MR2815776 (2012e:62246) [Google Scholar]

[R18] Tsiatis AA. Semiparametric theory and missing data Springer Series in Statistics. Springer; New York: 2006. MR2233926 (2007g:62009) [Google Scholar]

[R19] Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. Ann Statist. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. MR3127873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]

[R21] White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. MR0640163. [Google Scholar]

[R22] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. MR2604701 (2011d:62211) [Google Scholar]

[R23] Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. MR3040007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. MR3010898. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Robust learning for optimal treatment decision with NP-dimensionality

Chengchun Shi

Rui Song

Wenbin Lu

Abstract

1. Introduction

2. Method

3. Numerical studies

Table 1. Simulation results for L₂ loss, FN, FP, PCD and values.

4. Real data example

Table 2. Estimated value functions and confidence intervals for the difference of the estimated values.

Table 3. Number of patients receiving BUP or SER, according to the estimated optimal treatment regime.

5. Non-asymptotic weak oracle properties

5.1. The misspecified function

5.2. The covariates

5.3. Weak oracle properties

6. Oracle properties

6.1. Rates of convergence

6.2. Asymptotic distributions

7. Evaluation of value function

8. Conclusion

Supplementary Material

Appendix

Proof of Theorem 5.1

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Robust learning for optimal treatment decision with NP-dimensionality

Chengchun Shi

Rui Song

Wenbin Lu

Abstract

1. Introduction

2. Method

3. Numerical studies

Table 1. Simulation results for L2 loss, FN, FP, PCD and values.

4. Real data example

Table 2. Estimated value functions and confidence intervals for the difference of the estimated values.

Table 3. Number of patients receiving BUP or SER, according to the estimated optimal treatment regime.

5. Non-asymptotic weak oracle properties

5.1. The misspecified function

5.2. The covariates

5.3. Weak oracle properties

6. Oracle properties

6.1. Rates of convergence

6.2. Asymptotic distributions

7. Evaluation of value function

8. Conclusion

Supplementary Material

Appendix

Proof of Theorem 5.1

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Simulation results for L₂ loss, FN, FP, PCD and values.