Abstract
In order to identify important variables that are involved in making optimal treatment decision, Lu, Zhang and Zeng (2013) proposed a penalized least squared regression framework for a fixed number of predictors, which is robust against the misspecification of the conditional mean model. Two problems arise: (i) in a world of explosively big data, effective methods are needed to handle ultra-high dimensional data set, for example, with the dimension of predictors is of the non-polynomial (NP) order of the sample size; (ii) both the propensity score and conditional mean models need to be estimated from data under NP dimensionality.
In this paper, we propose a robust procedure for estimating the optimal treatment regime under NP dimensionality. In both steps, penalized regressions are employed with the non-concave penalty function, where the conditional mean model of the response given predictors may be misspecified. The asymptotic properties, such as weak oracle properties, selection consistency and oracle distributions, of the proposed estimators are investigated. In addition, we study the limiting distribution of the estimated value function for the obtained optimal treatment regime. The empirical performance of the proposed estimation method is evaluated by simulations and an application to a depression dataset from the STAR*D study.
Keywords and phrases: Non-concave penalized likelihood, optimal treatment strategy, oracle property, variable selection
1. Introduction
Personalized medicine, which has gained much attentions over the past few years, is a medical paradigm that emphasizes systematic use of individual patient information to optimize that patient's health care. In this paradigm, the primary interest lies in identifying the optimal treatment strategy that assigns the best treatment to a patient based on his/her observed covariates. Formally speaking, a treatment regime is a function that maps the sample space of patient's covariates to the treatments.
There is a growing literature for estimating the optimal individualized treatment regimes. Existing literature can be casted into as model based methods and direct search methods. Popular model based methods include Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010) and A-learning (Robins, Hernan and Brumback, 2000; Murphy, 2003), where Q-learning models the conditional mean of the response given predictors and treatment while A-learning models the interaction between treatment and predictors, better known as the contrast function. The advantage of A-learning is robustness against the misspecification of the baseline mean function, provided that the propensity score model is correctly specified. Recently, Zhang et al. (2012) proposed inverse propensity score weighted (IPSW) and augmented-IPSW estimators to directly maximize the mean potential outcome under a given treatment regime, i.e. the value function. Moreover, Zhao et al. (2012) recast the estimation of the value function from a classification perspective and use machine learning tools, to directly search for the optimal treatment regimes.
The rapid advances and breakthrough in technology and communication systems make it possible to gather an extraordinary large number of prognostic factors for each individual. For example, in the Sequenced Treatment Alternative to Relieve Depression (STAR*D) study, over 305 covariates are collected from each patient. With such data gathered at hand, it is of significant importance to organize and integrate information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing personalized medicine. There have been extensive developments of variable selection methods for prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), MCP (Zhang, 2010) and many others in the context of penalized regression. Their associated inferential properties have been studied when the number of predictors is fixed, diverging with the sample size and of the non-polynomial order of the sample size.
In contrast to the large amount of work on developing variable selection methods for prediction, the variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size. Among those available, Gunter, Zhu and Murphy (2011) proposed variable ranking methods for the marginal qualitative interaction of predictors with treatment. Fan, Lu and Song (2015) developed a sequential advantage selection method that extends the marginal ranking methods by selecting important variables with qualitative interaction in a sequential fashion. However, no theoretical justifications are provided for these methods. Qian and Murphy (2011) proposed to estimate the conditional mean response using a L1-penalized regression and studied the error bound of the value function for the estimated treatment regime. However, the associated variable selection properties, such as selection consistency, convergence rate and oracle distribution, are not studied. Lu, Zhang and Zeng (2013) introduced a new penalized least squared regression framework, which is robust against the misspecification of the conditional mean function. However, they only studied the case when the number of covariates is fixed and the propensity score model is known as in randomized clinical trials. Song et al. (2015) proposed penalized outcome weighted learning for the case with the fixed number of predictors.
In this paper, we study the penalized least squared regression framework considered in Lu, Zhang and Zeng (2013) when the number of predictors is of the non-polynomial (NP) order of the sample size. In addition, we consider a more general situation where the propensity score model may depend on predictors and needs to be estimated from data, as common in observational studies. A two-step estimation procedure is developed. In the first step, penalized regression models are fitted for the propensity score and the conditional mean of the response given predictors. In the second step, the optimal treatment regime is estimated using the penalized least squared regression with the estimated propensity score and conditional mean models obtained in the first step. There are several challenges in both numerical implementation and derivation of theoretical properties, such as weak oracle and oracle properties, for the proposed estimation procedure. First, since the posited model for the conditional mean of the response given predictors may be misspecified, the associated estimation and variable selection properties under model misspecification with NP dimensionality is not standard. Second, it is unknown how the asymptotic properties of the estimators for the optimal treatment regime obtained in the second step will depend on the estimated propensity score and conditional mean models obtained in the first step under NP dimensionality. To our knowledge, these two challenges have never been studied in the literature. Moreover, we estimate the value function of the estimated optimal regime and study the estimator's theoretical properties.
The remainder of the paper is organized as follows. The proposed method for estimating the optimal treatment regime is introduced in Section 2. Simulation results are presented in Section 3. An application to a dataset from the STAR*D study is illustrated in Section 4. Section 5 and 6 demonstrate the weak oracle and oracle properties of the resulting estimators, respectively. The estimator for the value function of the estimated optimal treatment regime is given in Section 7, followed by a Conclusion Section. All the technical proofs are given in the Appendix.
2. Method
Let Y denote the response, A ∈ 𝒜 denote the treatment received, where 𝒜 is the set of available treatment options, and X denote the baseline covariates including constant one. For demonstration purpose, we focus on a binary treatment regime, i.e., 𝒜 = {0, 1}, with 0 for the standard treatment and 1 for the new treatment. We consider the following semiparametric model:
(2.1) |
where h0(X) is the unspecified baseline function, β0 is the p-dimensional regression coefficients and e is an independent error with mean 0 and variance σ2. Under the assumptions of stable unit treatment value (SUTVA) and no unmeasured confounders (Rubin, 1974), it can be shown that the optimal treatment regime dopt(x) for patients with baseline covariates X = x takes the form
where I(·) is the indicator function.
Our primary interest is in estimating the regression coefficients β0 defining the optimal treatment regime. Let π(x) = P(A = 1|X = x) be the propensity score. We assume a logistic regression model for π(x):
(2.2) |
with p-dimensional parameter α0. Here, we allow the propensity score to depend on covariates, which is common in observational studies and the parameters α0 can be estimated from the data. For randomized clinical trials, π(x, α0) is a constant. We assume the majority of elements in β0 and α0 are zero and refer to the support supp(β0), supp(α0) as the true underlying sparse model of the indices.
Consider a study with n subjects. Assume X = (x1, …, xn)T is deterministic. The observed data consist of {(Yi, Ai, xi) : i = 1, ⋯, n}. Define μ(x) = h0(x) + π(x, α0)xTβ0, the conditional mean of the response given covariates X = x. We propose the following two-step estimation procedure to estimate the optimal treatment regime. In the first step, we posit a model Φ(x, θ) for the conditional mean function μ(x), and consider the penalized estimation for the propensity score and conditional mean models as follows.
Define
(2.3) |
and
(2.4) |
where αj and θj refer to the jth element in α and θ, q is the dimension of θ, and ρ1 and ρ2 are folded concave penalty functions with the tuning parameters λ1n and λ2n, respectively. We allow p, q to be of NP order of n and assume logp = O(n1−2dβ) and logq = O(n1−2dθ) for some dβ and , respectively. The posited model Φ(x, θ) may be misspecified.
Define Φ̂i = Φ(xi, θ̂) and π̂i = π(xi, α̂). In the second step, we consider the following penalized least square estimation:
(2.5) |
where ρ3 is a folded-concave penalty function with the tuning parameter λ3n. Here the folded-concave penalty functions ρ1, ρ2 and ρ3 are assumed to satisfy the following condition:
Condition 2.1. ρ(t, λ) is increasing and concave in t ∈ [0, ∞), and has a continuous derivative ρ′(t, λ) with ρ′(0+, λ) > 0. In addition, ρ′(t, λ) is increasing in λ ∈ [0, ∞) and ρ′(0+, λ) is independent of λ.
Popular penalties, such as LASSO, SCAD and MCP, satisfy Condition (2.1). In our implementation, we use SCAD penalty. Here, we adopt a two-step estimation procedure due to its computational simplicity. Alternatively, we can jointly estimate the parameters θ in the conditional mean model and β in the contrast function in a single penalized regression. However, this joint approach will require more computational effort since the tuning parameters for θ and β need to be selected simultaneously. In contrast, our two-step method only requires a single tuning parameter at each step and thus can be easily implemented by existing softwares, for example, the R package ncvreg.
3. Numerical studies
In this section, we evaluate the numerical performance of the proposed estimators in various settings. We generated the propensity score from the logistic regression model (2.2), with only one important covariate with the coefficient of 1.5. We chose three forms for the baseline function h0(x), including a simple linear form, a quadratic form and a complex non-linear form,
Model I: ,
Model II: ,
Model III: ,
where X is a p-dimensional vector of covariates and X̃ = (1, XT)T. We set p = 1000. Covariates were generated independently from two distributions: standard normal or s shifted exponential distribution with mean 0 and variance 1.
For each model, the first two covariates were chosen as important variables both in the baseline mean function and the contrast function with θ0 = (−2, −1, 0, …, 0)T and β0 = (0, −1.5, 1.5, 0, …, 0)T. We considered two different sample sizes, n = 300 and n = 500. For each scenario, we conducted 1000 replications. In our method, we fitted a linear model for Φ(X, θ) and used the SCAD penalty for variable selection. The tuning parameter was chosen using 10-fold cross-validation.
To evaluate the performance of the proposed estimator, we also compared our method with the penalized Q-learning using the SCAD penalty. Specifically, we fitted a linear model with baseline covariate effects and treatment-covariates interaction. Note that it is correctly specified under model I but misspecified under models II and III.
Let β̂ and β̃ denote our estimator and the penalized Q-learning estimator, respectively. We report the L2 loss of β̂ and β̃, the number of missed important variables (denoted as FN), the number of selected noisy variables (denoted as FP) and the average percentage of making correct decisions (denoted as PCD), which is defined as for treatment rules d̂(x) = I(xTβ̂ > 0) and d̃(x) = I(xTβ̃ > 0). In addition, we estimated E{Y⋆(d̂)}, E{Y⋆(d̃)} and E{Y⋆(dopt)}, the value functions of the estimated optimal treatment regimes by our method and the penalized Q-learning method, and of the true optimal regime, respectively, using Monte Carlo simulations. For a given treatment rule d(x), we compute E{Y⋆(d)} by averaging the responses for 20000 subjects generated from the true model with A being determined by d(x). We report the averages of mean responses over 1000 replications as well as their standard deviations.
Table 1 summarizes the results. The penalized Q-learning method performs pretty well under Model I where the fitted linear model is correctly specified and is more efficient than the proposed method as expected. For example, when covariates are i.i.d normal and n = 300, the PCD is around 99.3% and the estimated value function is very close to the true optimal, E{Y⋆(dopt)}. In contrast, under this setting, the PCD of our proposed method is 97.5%, and the estimated value function is slightly lower.
Table 1. Simulation results for L2 loss, FN, FP, PCD and values.
Measures | n | Model I | Model II | Model III | |
---|---|---|---|---|---|
Robust learning with covariates i.i.d normal | L2 loss | 300 | 0.276 | 1.743 | 1.171 |
500 | 0.189 | 1.453 | 0.700 | ||
FP | 300 | 5.104 | 9.148 | 12.481 | |
500 | 4.143 | 9.742 | 12.616 | ||
FN | 300 | 0.000 | 0.893 | 0.125 | |
500 | 0.000 | 0.471 | 0.002 | ||
PCD | 300 | 0.975 | 0.734 | 0.834 | |
500 | 0.983 | 0.789 | 0.904 | ||
EY*(d̂) | 300 | 1.842(0.021) | 4.544(0.157) | 2.716(0.089) | |
500 | 1.845(0.019) | 4.643(0.116) | 2.797(0.048) | ||
EY*(dopt) | 1.847 | 4.846 | 2.847 | ||
| |||||
Penalized Q-learning with covariates i.i.d normal | L2 loss | 300 | 0.080 | 4.861 | 1.729 |
500 | 0.061 | 4.928 | 1.833 | ||
FP | 300 | 0.001 | 8.191 | 7.745 | |
500 | 0.000 | 4.438 | 7.972 | ||
FN | 300 | 0.000 | 0.050 | 0.757 | |
500 | 0.000 | 0.006 | 0.553 | ||
PCD | 300 | 0.993 | 0.550 | 0.714 | |
500 | 0.994 | 0.538 | 0.690 | ||
EY*(d̃) | 300 | 1.846(0.021) | 4.117(0.165) | 2.508(0.192) | |
500 | 1.846(0.020) | 4.091(0.093) | 2.457(0.204) | ||
EY*(dopt) | 1.847 | 4.846 | 2.847 | ||
| |||||
Robust learning with covariates i.i.d exponential | L2 loss | 300 | 0.290 | 1.768 | 1.186 |
500 | 0.199 | 1.495 | 0.730 | ||
FP | 300 | 6.596 | 9.700 | 13.240 | |
500 | 4.972 | 10.512 | 13.932 | ||
FN | 300 | 0.000 | 0.793 | 0.142 | |
500 | 0.000 | 0.466 | 0.003 | ||
PCD | 300 | 0.958 | 0.724 | 0.809 | |
500 | 0.971 | 0.761 | 0.871 | ||
EY*(d̂) | 300 | 1.744(0.018) | 4.500(0.179) | 2.670(0.095) | |
500 | 1.747(0.018) | 4.562(0.161) | 2.736(0.041) | ||
EY* (dopt) | 1.751 | 4.751 | 2.783 | ||
| |||||
Penalized Q-learning with covariates i.i.d exponential | L2 loss | 300 | 0.264 | 2.580 | 2.225 |
500 | 0.121 | 3.236 | 2.408 | ||
FP | 300 | 0.003 | 12.257 | 13.234 | |
500 | 0.000 | 21.383 | 15.479 | ||
FN | 300 | 0.045 | 0.824 | 0.288 | |
500 | 0.005 | 0.377 | 0.072 | ||
PCD | 300 | 0.954 | 0.610 | 0.609 | |
500 | 0.978 | 0.595 | 0.584 | ||
EY*(d̃) | 300 | 1.744(0.018) | 4.500(0.179) | 2.670(0.095) | |
500 | 1.743(0.037) | 4.197(0.289) | 2.201(0.224) | ||
EY*(dopt) | 1.751 | 4.751 | 2.783 |
However, for Models II and III, the penalized Q-learning method could lead to substantial bias and works much worse than the proposed method. Taking the second model as an example, when covariates are normal and n = 300, ‖β̃ − β0‖2 = 4.86, approximately third times as large as ‖β̂ − β0‖2. The PCD of the estimated treatment regime obtained by the penalized Q-learning is 55.0%, only a little better than a random guess. In contrast, for this scenario, the PCD of our proposed method is 73.4%. Moreover, when sample size increases, the performance of the penalized Q-learning method is even worse. This is due to the misspecification of the baseline mean function. For our method, there's a big increase in the PCD as the sample size gets larger. The L2 loss and average number of missed important variables are also greatly reduced. This demonstrates the robustness of the proposed method to the misspecification of the baseline mean function.
4. Real data example
We applied our method to the data set from the STAR*D study for 4041 patients with nonpsychotic major depressive disorder (MDD). The aim of the study was to determine the effectiveness of different treatments for those people who have not responded to initial medication treatment. At Level 1, all patients received citalopram (CIT), an selective serotonin reuptake inhibit (SSRI) medication. After 8-12 weeks, three more levels of treatments were offered to participants whose previous treatment didn't give an acceptable response. Available treatments at Level 2 included sertraline (SER), venlafaxine (VEN), bupropion (BUP) and cognitive therapy (CT) and augmenting CIT which combines CIT with one more treatment. At Level 2A, switch options to VEN or BUP treatment were provided for patients receiving CT but without sufficient improvement. Four treatments were available at Level 3 for participants without anticipated response, including medication switch to mirtazapine (MIRT), nortriptyline (NTP), and medication augmentation with either lithium (Li) and thyroid hormone (THY). Finally, treatment with tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN) were provided at Level 4 for those without sufficient improvement at Level 3.
Here, we only focused on a subset of data for those patients receiving treatment BUP (coded as 1) or SER (0) at Level 2. The outcome of interest was the 16-item Quick Inventory of Depressive Symptomatology-Clinician-Ratings (QIDS-C16), which indicated the severity of patient's depressive symptom. The maximum vale of QIDS-C16 was 24 and its distribution was highly skewed. Hence, we considered the transformation Yi = log(25 − QIDS-C16) as our response. Larger value of Yi indicates better response. All baseline variables at Level 1 and intermediate outcomes at Level 2 were included as covariates in our study, yielding 305 covariates in total for each patient. There are 383 patients receiving treatment BUP or SER at Level 2, however, only 319 patients have complete records of all 305 covariates and the response. Among them, 153 were treated with BUP and 166 with SER. Our proposed method selected 14 variables that are important for treatment decision. We reestimate the coefficients of these variable by solving A-learning estimating equations (Robins, 2004) and obtained the resulting estimated optimal treatment regime.
To examine the performance of the estimated optimal treatment regime, we compared it with the fixed treatment regimes by assigning all patients to either BUP or SER, in terms of the estimated value functions obtained by the IPSW method (Zhang et al., 2012). The results for the estimated value functions were given in Table 2. In addition, we reported the 95% confidence intervals for the difference between the estimated values of the obtained optimal regime and the fixed regime based on 500 bootstrap samples. Our estimated optimal treatment regime gave larger estimated values than those of the fixed regimes, BUP and SER. The difference is significant when comparing to the BUP treatment at 5% level, but is less significant when comparing to the SER treatment. One reason is that our estimated optimal regime assigns the majority of patients (about two-thirds) to the SER treatment. Please refer to Table 3 for the numbers of patients receiving BUP or SER according to the estimated optimal regime.
Table 2. Estimated value functions and confidence intervals for the difference of the estimated values.
Treatment regime | Estimated value function | Diff | 95% CI on Diff |
---|---|---|---|
Estimated optimal regime | 3.10 | ||
BUP | 2.55 | 0.55 | [0.07, 1.13] |
SER | 2.80 | 0.30 | [−0.08, 0.64] |
Table 3. Number of patients receiving BUP or SER, according to the estimated optimal treatment regime.
receives BUP | receives SER | total | |
---|---|---|---|
assigns BUP | 66 | 50 | 116 |
assigns SER | 93 | 110 | 203 |
total | 153 | 160 | 319 |
In addition, as suggested by a referee, we examined the effects of missing data. Specifically, we deleted one patient whose response was missing, and imputed all the missing values in covariates using the R package missForrest available in CRAN. This package uses a random forest trained based on the observe entries in the design matrix to predict those missing values. The optimal treatment regime obtained based on the imputed data was similar to the one based on the complete-case analysis as shown above. It selected 14 variables among which 11 variables were also included in the estimated optimal treatment regime without imputation. In addition, the bootstrap results suggested that the estimated value of the estimated optimal treatment regime is significantly larger than those of the fixed treatment regimes, under 0.05 significance level. Since results are similar, we omitted them here.
5. Non-asymptotic weak oracle properties
In this section we show that the proposed estimator enjoys the weak oracle property, that is, α̂, β̂ and θ̂ defined in (2.3)-(2.5) are sign consistent with probability tending to 1, and are consistent with respect to the L∞ norm. Weak oracle properties of θ̂ are established in the sense that it converges to some least false parameter θ⋆ when the main effect model is misspecified.
Theorem 5.1 provides the main results. Some regularity conditions are discussed in subsections 5.1 and 5.2. A major technical challenge in deriving weak oracle properties of β̂ is to analyze the deviation in (5.18), for which we develop a general empirical process result in the supplementary article (Shi et al., 2016). This result is important in its own right and can be used in analyzing many other high-dimensional semiparametric models where the index parameter of an empirical process is a plug-in estimator. The following notation is introduced to simplify our presentation.
Let 1 denote a vector of ones, E denote the identity matrix, O denote the zero matrix consisting of all zeros. For any matrix Ψ, let P(Ψ) denote the projection matrix Ψ(ΨTΨ)−1ΨT. ΨM the submatrix of Ψ formed by columns in the subset M. For any vector a, b, let “○” denote the Hadamard product: a ○ b = (a1b1, …, anbn)T, |a| = (|a1|, …, |an|)T, diag(a) as the diagonal matrix with elements of vector a and aM the subvector of a formed by elements in M. The jth element in a is denoted as aj. Let ‖ · ‖p be the Lp norm of vectors or matrices. Let ‖Y‖ψm be the Orlicz norm of a random variable Y,
for any m ≥ 1.
Let Mα = supp(α0), Mβ = supp(β0), Mθ⋆ = supp(θ⋆), and , , be their complements. Assume each xj, is standardized such that ‖xj‖2 = √n. Let Φ(θ) = [Φ(x1, θ), …, Φ(xn, θ)]T, ϕ(θ) = [ϕ1(θ), …, ϕq(θ)] denote its Jacobian matrix. The derivatives are taken componentwise, i.e.,
for all l = 1, …, q. We denote Φ(θ⋆) and ϕ(θ⋆) as Φ and ϕ when there's no confusion. We use a short-hand Φ̂, ϕ̂ for Φ(θ̂), ϕ(θ̂).
5.1. The misspecified function
We first define the least false parameter under the misspecification due to the posited mean function Φ(x, θ). For regression models with fixed number of predictors, the definition of the least false parameter under model misspecification has been widely studied in the literature (e.g, White, 1982; Li and Duan, 1989). However, for regression models with NP dimensionality, its definition is more tricky. Here, we define our least false parameter as follows.
For each θ ∈ ℝq, let dnθ = 1/2 minj{|θj| : θj ≠ 0}, Mθ be the support of θ, μ = (μ(x1), …, μ(xn))T and
Consider the set
for some constant C0, and s0 ≪ n. We assume the set Θ to be nonempty and define the least false parameter as
In addition, we assume
(5.1) |
for some γ0 ≥ 0. By its definition, θ⋆ satisfies
(5.2) |
and |Mθ⋆| ≤ s0.
Remark 5.1. Conditions (5.1) and (5.2) are key assumptions determining the degree of model misspecification. Condition (5.1) requires that the posited working model Φ can provide a good approximation for μ. In that case, the residual μ – Φ will be orthogonal to the jacobian matrix ϕMθ⋆ and the left-hand side of (5.1) will be small. In general, our assumptions are weaker than the weak sparsity assumption imposed for Lasso (Bunea, Tsybakov and Wegkamp, 2007), which assumes the L2 approximation error ‖μ – Φ‖2 converges to 0 at some certain rate.
Condition 5.1. We assume the following conditions:
(5.3) |
(5.4) |
(5.5) |
(5.6) |
(5.7) |
for some constants 0 ≤ a3 ≤ 1/2, 0 ≤ γθ⋆ ≤ γ0, sθ⋆ = |Mθ⋆|. If the response is unbounded, we require
(5.8) |
and the right-hand side of (5.6) shall be modified to .
Remark 5.2. Conditions (5.6) and (5.7) put constraints on the derivatives of ϕ, requiring the misspecified function to be smooth. The right-hand side order in (5.6) is not too restrictive when nγθ⋆ ≫ sθ⋆ logn.
Two common examples of the main-effect function Φ are provided below to examine the validity of Condition 5.1.
Example 1. Set Φ = 0. Then, no model is needed for Φ. It is easy to check that Condition 5.1 is satisfied.
Example 2. When a linear model is specified, i.e., Φ(x, θ) = xTθ, conditions (5.6) and (5.7) are automatically satisfied since the second-order derivative of Φ vanishes. In this example, θ⋆ takes the form
and . Note that is the regression coefficients between XMθ⋆ and μ. Condition (5.1) holds automatically since
Condition (5.2) becomes
(5.9) |
Each element in the left-hand side vector in (5.9) can be viewed as the inner product of the residuals obtained by fitting XMθ⋆ on each noise variable in and those fitted by regressing XMθ⋆ on μ. When μ depends only on XMθ⋆, (5.9) holds for Gaussian linear model.
5.2. The covariates
Condition 5.2. Assume that
(5.10) |
(5.11) |
(5.12) |
(5.13) |
(5.14) |
(5.15) |
(5.16) |
(5.17) |
for some constants 0 ≤ γα, γβ, a2 ≤ 1/2, where
The sequence bαβ in (5.10) shall satisfy
Remark 5.3. Conditions (5.10) and (5.11) control the impact of the deviation of the estimated propensity score from its true value on β̂, thus are not needed when the propensity scores are known. By the definition of W(δ), magnitudes of the left-hand side in these two conditions depend on how accurate Φ models μ. The sequence bαβ in (5.10) can converge to 0 when XMβ and XMα are weakly correlated. Each element in the left-hand side of (5.11) is the multiple regression coefficient of the corresponding variable in on W(δ)XMα, using weighted least squares with weights π ○ (1 − π), after adjusted by XMβ, which characterize their weak dependence given XMβ. These two conditions are generally weaker than those imposed by Fan and Lv (2011) (Condition 2), and are therefore more likely to hold.
Remark 5.4. The right-hand side in (5.15) can be relaxed to O(n1/2+γθ⋆/logn) when using the linear model. The additional term is due to the penalty on the complexity of the main effect model. This condition typically controls the deviation
(5.18) |
where Z = diag(A−π)X. A common approach to bound the deviation is to utilize the classical Bernstein's inequality. However this approach does not work here, because the indexing parameter in the process Φ(·) in (5.18) is an estimator. To handle this challenge, we bound the left-hand side in (5.18) by
A general theory that covers the above result is provided in Proposition C.1 in the supplementary article.
Remark 5.5. Conditions (5.16) and (5.17) aim to control the L∞ norm of the quadratic term of the Taylor series as a function of α̂, expanded at α0. Similar to (5.10) and (5.11), the two conditions are not needed when α0 is known to us.
5.3. Weak oracle properties
Theorem 5.1 (Weak oracle property). Assume that conditions B.1 and B.3 in the supplementary Appendix and conditions 5.1, 5.2 hold, and maxi ‖ei‖ψ1 < ∞, where ei is the residual for the ith patient in (2.1). Then there exist local minimizers α̂, θ̂ and β̂ of the loss functions (2.3), (2.4), and (2.5) respectively, such that with probability at least 1 − c̄/(n + p + q):
, , ,
‖α̂Mα − α0Mα‖∞ = O(n−γαlogn), ‖β̂Mβ − β0Mβ‖∞ = O(n−γβlogn), ,
for c̄ is some positive constant.
Remark 5.6. In Theorem 5.1, part (a) corresponds to the sparse recovery while (b) gives the estimators' convergence rates. Weak oracle property of α̂ directly follows from Theorem 2 in Fan and Lv (2011). However, to prove this property of β̂ requires further efforts, to account for the variability due to plugging in θ̂ and α̂. L∞ convergence rate of α̂Mα as well as the nonsparsity size sα, play an important role in determining how fast β̂Mβ converges.
Remark 5.7. The convergence rate of θ̂ will not affect that of β̂. This is because we require the posed propensity score model to be correct, the estimation of β is robust with respect to the model misspecification of the main effect parameters θ. Simulation results also validate our theoretical findings.
6. Oracle properties
In this section we study the oracle property of the estimator β̂. We assume that max(sα, sβ) ≪ √n and nγθ⋆ ≫ sθ⋆logn. The convergence rates of the estimators are established in Section 6.1 and their asymptotic distributions are provided in Section 6.2.
6.1. Rates of convergence
Condition 6.1. In addition to (5.16) and (5.17) in Condition 5.2, assume that the right-hand side of (5.15) is strengthened to , and the following conditions hold,
(6.1) |
(6.2) |
(6.3) |
(6.4) |
(6.5) |
Remark 6.1. Similar to the interpretation of (5.10) and (5.11), (6.1) corresponds to a notion of weak dependence between variables in XMα and XMβ while (6.2) require and XMα are weakly correlated after adjusted by XMβ. Besides, it can be verified that (6.3)-(6.5) hold with large probability when the baseline covariates possesses subgaussian tail.
Theorem 6.1. Assume that conditions 2.1, 5.1 and 6.1 and conditions B.2 and B.4 in the supplementary Appendix hold, and maxi‖ei‖ψ1 < ∞. Constraints on bθ⋆, dθ, dnθ and λ3n are same as in Theorem 5.1. Further assume with sα = O(nl1), sβ = O(nl2), and nγθ⋆ ≫ sθ⋆logn. Then there exists a strict local minimizer β̂ of the loss function (2.5), α̂ of (2.3), such that , with probability tending to 1 as n → ∞, and , .
Remark 6.2. We note that when establishing the oracle property of β̂, only the weak oracle property of θ̂ is required. This is due to the robustness of the A-learning methods and the fact that the propensity score is correctly specified.
Remark 6.3. Precision of β̂Mβ is affected by that of α̂Mα, since ‖β̂Mβ − β0Mβ‖2 is at least the same order of magnitude as ‖α̂Mα − α0Mα‖2. When the propensity score is known, convergence rate of β̂Mβ is improved to .
6.2. Asymptotic distributions
We define Σ12 and Σ22 as
where W is a shorthand for W(θ⋆).
To establish the weak convergence of the estimators, we introduce the following conditions.
Condition 6.2. Assume that
(6.6) |
(6.7) |
(6.8) |
(6.9) |
(6.10) |
where xMαi and xMβi stand for the ith row of the matrix XMα and XMβ respectively.
Remark 6.4. Conditions (6.7) and (6.8) are the Lyapunov conditions which guarantee the normality of α̂Mα and β̂Mβ. Condition (6.9) puts constraints on the maximum eigenvalue of the variance-covariance matrix of by requiring it to be finite. Condition (6.10) holds when Φ(δ) converges to Φ uniformly in terms of L∞ norm with δ in the region Hθ⋆. When ‖μ − Φ‖∞ is bounded, (6.8) and (6.9) are simultaneously satisfied.
Theorem 6.2 (Oracle property). Under conditions in Theorem 6.1 and Condition 6.2, assume max(sα, sβ) = o(n1/3), the right-hand side of (5.15) is strengthened to , as n → ∞. Then with probability tending to 1, , in Theorem 6.1 must satisfy
α̂2 =0, β̂2 = 0,
is asymptotically normally distributed with mean 0, covariance matrix Ω, which is the limit of
where A1n is a q1 × sα matrix and A2n is a q2 × sβ matrix such that
We note that conditions on the smoothness of the misspecified function (5.15) is strengthened. To better understand the above theorem, we provide the following two corollaries. The first corollary gives the limiting distribution when we specify both the propensity score and main-effect model while the second one corresponds to case when the propensity score is known in advance.
Corollary 6.1. Under conditions of Theorem 6.2, when we correctly specify the main-effect model, i.e., μ = Φ, and are jointly asymptotically normally distributed, with the covariance matrix Ω′, which is the limit of the following matrix,
Remark 6.5. Comparing the results in Corollary 6.1 and in Theorem 6.2, the term accounts for the partial specification of model (2.1). In the most extreme case where we correctly specify Φ, β̂Mβ will achieve its minimum variance and is independent of α̂Mα. In general, we can gain efficiency by posing a good working model for Φ. Numerical studies also suggest that a linear model such as Φ = Xθ is preferred compared to the constant model. This is in line to our theoretical justification since W is a diagonal matrix with the ith diagonal element μi − Φi.
Corollary 6.2. When the propensity score is known, under conditions of Theorem 6.2 with all α̂'s replaced by α0, then with probability tending to 1 as n → ∞, is asymptotically normally distributed with mean 0, co-variance matrix Ω″ which is the limit of
where
Remark 6.6. An interesting fact implied by Corollary 6.2 is that the asymptotic variance of β̂Mβ will be smaller than that of the same estimator had we known the propensity score in advance. A similar result is given in the asymptotic distribution of the mean response for the value function in the next section. This is in line with the semiparametric theory in fixed p case where the variance of augmented-IPWS estimator would be smaller when we estimate the parameter in the coarsening probability model, even if we know what the true value is (see Chapter 9 in Tsiatis, 2006). By doing so, we can actually borrow information from the linear association between covariates in WXMβ and those in XMα.
7. Evaluation of value function
In this section, we derive a non-parametric estimate for the mean response under the optimal treatment regime. By (2.1), define our average population-level response under a specific regime as
where the treatment decision for the ith patient is given as . The mean response under the true optimal regime is denoted as Vn(β0) and it is easy to verify that β0 is the maximizer of the function Vn.
Similarly as in Murphy (2003), we propose to estimate Vn(β0) using
(7.1) |
This estimator is not doubly robust but offers protection against misspecification of the baseline function and improved efficiency It's not doubly robust because we require the propensity score model to be correctly specified to ensure the oracle property of β̂. A key condition which guarantees asymptotic normality of (7.1) is given as follows.
Condition 7.1. Assume there exists some constant C′, such that for all ε > 0,
Remark 7.1. The above condition has similar interpretation as Condition (3.3) in Qian and Murphy (2011), where random design were utilized. Condition 7.1 requires that the absolute value of the average contrast function can not be too small, which together with the condition sβ = o(n1/4) ensures the following stochastic approximation condition:
(7.2) |
Theorem 7.1. Assume that conditions in Theorem 6.2 hold. If Condition 7.1 holds and the nonsparsity size sβ satisfies sβ = o(n1/4), then with probability going to 1, √n{V̂n − Vn(β0)} is asymptotically normally distributed with variance , which is limit of
(7.3) |
where υn stands for the vector , and Σ22 is defined in Theorem 6.2.
Remark 7.2. Note that we only need sβ = o(n1/2) to guarantee the weak oracle property of β̂ or convergence rate of ‖β̂Mβ − β0Mβ‖2. This condition is strengthened to sβ = o(n1/3) to show the asymptotic normality of β̂Mβ. Theorem 7.1 further requires sβ = o(n1/4) as to ensure the approximation condition (7.2).
Remark 7.3. When (7.2) is satisfied, the asymptotic normality of V̂n follows immediately from the oracle property of the estimator β̂Mβ. The first term σ2 in (7.3) is due to variation of the error term ei while the last two terms correspond to the asymptotic variance of β̂Mβ.
We provide a corollary here which corresponds to the case where the main-effect model is correctly specified.
Corollary 7.1. In addition to the conditions in Theorem 7.1, if the main-effect model is correct, √n{V̂n − Vn(β0)} is asymptotically normally distributed with variance , which is defined as the limit of
where υn is defined in Theorem 7.1.
Similar to the asymptotic distribution of β̂Mβ, the following corollary suggests that the proposed estimator is more efficient in the case when we estimate the propensity score by fitting a penalized logistic regression.
Corollary 7.2. Assume the propensity score is known, and conditions in Theorem 7.1 hold with all α̂'s replaced by α0, then with probability going to 1, √n{V̂n − Vn(β0)} is asymptotically normally distributed with variance , which is the limit of
with υn defined in Theorem 7.1, and defined in Corollary 6.2.
By the definition of υn and the condition that , the asymptotic variance will reach its minimum when is close to the propensity score. We characterize this result in the following Corollary.
Corollary 7.3. Under the conditions in Theorem 7.1, if we further assume that
then with probability going to 1, √n{V̂n − Vn(β0)} is asymptotically normally distributed with the variance σ2.
Remark 7.4. Such a result is expected with the following intuition: in an observational study, if the clinician or the decision maker has a high chance to assign the optimal treatment to an individual patient, i.e., the propensity score is close to , the variation in estimating the value function will be decreased. In other words, the more skillful the clinician or the decision maker is, the closer the observed individual response Yi approaches the potential outcome under the optimal treatment regime.
8. Conclusion
In this article, we propose a two-step estimator for estimating the optimal treatment strategy which selects variables and estimates parameters simultaneously in both propensity score and outcome regression models using penalized regression. Our methodology can handle data set whose dimensionality is allowed to grow exponentially fast compared to the sample size. Oracle properties of the estimators are given. Variable selection is also involved in the misspecified model and new mathematical techniques are developed to study the estimator's properties in a general form of optimization. The estimator is shown to be more efficient when the misspecified working model is “closer” to the conditional mean of the response, although our approach does not require correct specification of the baseline function. Numerical results demonstrate that the proposed estimator enjoys model selection consistency and has overall satisfactory performance.
In the case when there are multiple local solutions of our objective functions (2.5), (2.3) or (2.4), although our asymptotic theory only suggests the existence of a local minimum possessing the oracle property, it is worth mentioning that we can actually identify the desired oracle estimator using existing algorithms (see Fan, Xue and Zou, 2014; Wang, Kim and Li, 2013). Theoretical properties can be established in a similar fashion.
The proposed method requires to specify the propensity score model correctly. In randomized studies, the propensity score is known in advance and thus the assumption is automatically satisfied. However, for observational studies, there's no guarantee. In practice, some prior information on treatment decision mechanism used by physicians may be helpful for building a reasonable propensity score. In addition, model diagnostic tests can be used to check the goodness-of-fit of the posited propensity score model, such as a logistic regression model. In general, this might be easier than checking the goodness-of-fit of the regression model for the response. In addition, in our current work, we assume the design matrix X to be deterministic mainly for technical convenience. To the best of our knowledge, the penalized regression with the folded-concave penalties has never been studied in random design settings with NP dimensionality. To consider random design settings, we need to impose some tail conditions on X, and the derivation of some technical results needs to be modified. This is beyond the scope of our current paper and will be investigated elsewhere.
The current framework is focused on point treatment study. It will be interesting and practically useful to extend our results to dynamic treatment regimes. Significant efforts are needed to handle model misspecification in multiple stages. This is an interesting research topic that needs further investigation.
Supplementary Material
Appendix
Here, we only give the proof of Theorem 5.1. More technical conditions and proofs for Theorems 6.1, 6.2 and 7.1 are given in the supplementary Appendix. To establish Theorem 5.1, we need the following lemmas. The proofs of these lemmas are also given in the supplementary Appendix.
Lemma 1. Let z = (z1, …, zn)T be an n-dimensional independent random response vector with mean 0 and a ∈ ℝn.
-
If z1, …, zn are bounded in [c, d], then for any ε ∈ (0, ∞),
-
If z1, …, zn satisfy maxi ‖zi‖ψ1 ≤ ω, then for any ε ∈ (0, ∞),
Lemma 2. Define , where εk is defined in Appendix G, under conditions in Theorem 5.1, we have Pr(ε) ≥ 1 − c̄/(n + p + q) for some c̄ > 0.
Notation. Let Z = diag(A − π)X, Ẑ = diag(A − π̂)X, and
and π = (π(x1), …, π(xn)). For a given matrix Ψ, the superscript Ψj is used to refer to the vector which is the jth column of matrix Ψ while the subscript Ψi stands for the ith row of Ψ. We will write Φ(θ), ϕ(θ) with as Φ(θMθ⋆), ϕ(θMθ⋆) for convenience.
Proof of Theorem 5.1
We break the proof into three steps. Based on Theorem 1 in Fan and Lv (2011), it suffices to prove the existence of β̂Mβ, θ̂Mθ⋆ inside the hypercube
with K a large constant, conditional on the event ε, satisfying
(A.1) |
(A.2) |
(A.3) |
(A.4) |
(A.5) |
(A.6) |
Step 1. We first show the existence of a solution to equations (A.1) and (A.2) inside ℵ for sufficiently large n. For any δ = (δ1, …, δsβ+sθ⋆)T ∈ ℵ, since dnβ ≥ n−γβ logn, dnθ ≫ n−γθ⋆ logn, we have
and sgn(δβ) = sgn(β0Mβ), . The monotonicity condition of , gives
(A.7) |
We write the left hand side of (A.1) as
(A.8) |
on the set ε3 ∪ ε5 ∪ ε13, we have
(A.9) |
Define
Note that η1Mβ = I3 in (A.8), which we represent here using a second order Taylor expansion around α0Mα,
(A.10) |
where rI3 in (A.10) corresponds to second order remainder, whose jth component is given as
where Σ(α̃) is a diagonal matrix with the ith diagonal element with α̃ lying in the line segment between α̂Mα and α0Mα. Since π″(·) is a bounded function, we can bound ‖rI3‖∞ by
(A.11) |
whose order of magnitude is O(sαn1−2γα log2 n) by (5.16).
We decompose I4 in (A.8) as . Using similar arguments, on the set ε9, it follows from (5.17) that
(A.12) |
Using Taylor expansion, it is immediate to see that
(A.13) |
by (5.17). Combining (A.12) and (A.13) gives
(A.14) |
So far, we have
(A.15) |
by (A.9), (A.10), (A.11) and (A.14). Now we approximate I4 by and bound the magnitude of error ‖ωMβ‖∞ where ω = (ẐTẐMβ − XTΔXMβ)(δβ − β0Mβ). We present it as
(A.16) |
It follows from first-order Taylor expansion that the jth element in ω1Mβ can be presented as
(A.17) |
where Δ(α̃Mα) is a diagonal matrix with the ith diagonal component π(xi, α̃Mα)(1 − π(xi, α̃Mα)), where α̃Mα lies between the line segment of α̂Mα and α0Mα. We decompose xj as the Hadamard product of two vectors, denoted by x̄j ○ x̃j, where
Let φ = (A − π̂) ○ x̃j ○ {Δ(α̃Mα)XMα(α̂Mα − α0Mα)}, we have
(A.18) |
Since ‖A − π̂‖∞ ≤ 1, elements in Δ(α̃Mα) are bounded, we have
(A.19) |
Combining (A.18) with (A.19) gives
(A.20) |
which is by (B.4) and (B.5).
By the same argument, we can verify that ‖ω2Mβ‖∞ is of the same order. Note that on the set ε11,
these together with (A.20), yields
(A.21) |
Define vector-valued function
(A.22) |
then equation (A.1) is equivalent to Ψ1(δβ, δθ) = 0. It follows from (A.7), (A.15) and (A.21) that
By similar arguments in the proof of Theorem 2 in Fan and Lv (2011), we have
(A.23) |
on the set ε1 ∪ ε2. Thus by (5.10), (B.1), (B.14) and (B.15), we have
Therefore by (A.20), for sufficiently large n, if (δβ − β0Mβ)j = n−γβ logn,
(A.24) |
and if (δ − β0Mβ)j = −n−γβ logn,
(A.25) |
Similarly we write the left-hand side of (A.2) as
(A.26) |
It is immediately to see that
(A.27) |
on the set ε5. The L∞ norm of the first term in (A.26) is bounded by
(A.28) |
on the set ε15.
Using second-order Taylor expansion, we approximate the last term in (A.26) by its first-order term . It follows from (5.7) that the L∞ norm of the remainder term is bounded from above by
(A.29) |
where δ̃θ lies between the line segment of and δθ.
Define Ψ2(δβ, δθ) = {ϕMθ⋆ (δθ)TϕMθ⋆ (δθ)}−1[ϕMθ⋆ (δθ)T{Y − Φ(δθ)} − nλ3nρ̄3(δθ)], equation (A.2) is equivalent to Ψ2(δβ, δθ) = 0. Similarly to Ψ1(δβ, δθ), we now show Ψ2(δβ, δθ) is mainly dominated by . Define , it follows from (5.1), (5.3), (B.13), (A.26), (A.27), (A.28) and (A.29) that
(A.30) |
Therefore, we can find a large constant K < ∞, for n large enough such that if ,
(A.31) |
and if ,
(A.32) |
Combining (A.24), (A.25) with (A.31) and (A.32), an application of Miranda's existence theorem shows equations (A.1), (A.2) have a solution (β̂Mβ, θ̂Mθ⋆) in ℵ.
Step 2. Let (β̂T, θ̂T)T be a solution to equations (A.1) and (A.2) with and . We show that (β̂T, θ̂T)T satisfies inequalities (A.3) and (A.4). Decompose (A.3) as the sum of the following terms,
(A.33) |
On the set ε4 ∪ ε6 ∪ ε10 ∪ ε12, it is immediately to see that
(A.34) |
By (B.4), (B.5) and (A.20), a first-order Taylor expansion gives
(A.35) |
Similarly it follows from (5.17) and (A.13) that
(A.36) |
On the set ε10, by (5.17) and (A.12), we have
(A.37) |
Approximating by , the L∞ norm of remainder error term is bounded from above by
(A.38) |
by (5.16). Let , it follows from (A.33)–(A.38) that
(A.39) |
Since β̂Mβ solves (A.1), we have
(A.40) |
where uβ is defined as Ψ1(β̂Mβ, θ̂Mθ⋆) + β0Mβ − β̂Mβ. Combining (A.40) with (A.23) and (A.39) gives
by (5.11), (B.3), (B.16) and (B.19). Since C < 1, for sufficiently large n, (A.3) is satisfied.
Now we verify (A.4), decomposing as the sums of
(A.41) |
on the set ε8 ∪ ε16, we have
(A.42) |
Similar to (A.29), a second-order Taylor expansion gives
(A.43) |
by (5.7). Since (β̂Mβ, θ̂Mθ⋆) is the solution to Ψ2(δβ, δθ) = 0, it follows from (A.30) that
(A.44) |
By (A.41)–(A.44) and conditions in (5.2), (5.4), (B.15) and (B.20), the left-hand side of (A.4) can be bounded by
for C < 1. Therefore (A.4) is satisfied.
Step 3. Now we show the second order conditions (A.5) and (A.6) hold. Because (A.6) is directly implied by (B.17), it suffices to show that for sufficiently large n. Since (ẐMβ − ZMβ)T(ẐMβ − ZMβ) is positive semi-definite, we have
(A.45) |
Since any symmetric matrix Ψ, the absolute value of minimum eigenvalue can be bounded by
(A.5) follows if we can show . But this is immediate to see because
on the set ε11. Similar to (A.20), can be bounded from above by
(A.46) |
which is implied by the constrain max(l1, l2) < γα. This completes the proof.
Footnotes
The research of Chengchun Shi and Rui Song is supported in part by Grant NSF-DMS 1309465, 1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is supported in part by Grant NCI P01 CA142538.
Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”
(doi: 10.1214/16-EJS1178SUPP; .pdf).
References
- Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron J Stat. 2007;1:169–194. MR2312149. [Google Scholar]
- Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19:317–343. doi: 10.1177/0962280209105013. MR2757118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. MR1946581 (2003k:62160) [Google Scholar]
- Fan A, Lu W, Song R. Sequetial advantage selection for optimal treatment regime. Ann Appl Stat To appear. 2015 doi: 10.1214/15-AOAS849. MR3480486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. MR2849368 (2012k:62211) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. Ann Statist. 2014;42:819–849. doi: 10.1214/13-aos1198. MR3210988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gunter L, Zhu J, Murphy SA. Variable selection for qualitative interactions. Stat Methodol. 2011;8:42–55. doi: 10.1016/j.stamet.2009.05.003. MR2741508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li KC, Duan N. Regression analysis under link violation. Ann Statist. 1989;17:1009–1052. MR1015136. [Google Scholar]
- Lu W, Zhang HH, Zeng D. Variable selection for optimal treatment decision. Stat Methods Med Res. 2013;22:493–504. doi: 10.1177/0962280211428383. MR3190671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B Stat Methodol. 2003;65:331–366. MR1983752 (2005b:62167) [Google Scholar]
- Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Statist. 2011;39:1180–1210. doi: 10.1214/10-AOS864. MR2816351 (2012e:62227) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics Lecture Notes in Statist. Vol. 179. Springer; New York: 2004. Optimal structural nested models for optimal sequential decisions; pp. 189–326. MR2129402 (2006g:62007) [Google Scholar]
- Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiol. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
- Rubin DB. Estimating causal effects of treatments in randomized and non-randomized studies. J Edu Psychol. 1974;66:688–701. [Google Scholar]
- Shi C, Song R, Lu W. Supplement to “Robust Learning for Optimal Treatment Decision with NP-Dimensionality”. 2016 doi: 10.1214/16-EJS1178SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song R, Kosorok M, Zeng D, Zhao Y, Laber E, Yuan M. On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat. 2015;4:59–68. doi: 10.1002/sta4.78. MR3405390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 1996;73:273–282. MR2815776 (2012e:62246) [Google Scholar]
- Tsiatis AA. Semiparametric theory and missing data Springer Series in Statistics. Springer; New York: 2006. MR2233926 (2007g:62009) [Google Scholar]
- Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. Ann Statist. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. MR3127873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8:279–292. [Google Scholar]
- White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. MR0640163. [Google Scholar]
- Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. MR2604701 (2011d:62211) [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. MR3040007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Amer Statist Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. MR3010898. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.