High-Dimensional Inference for Personalized Treatment Decision

X Jessie Jeng; Wenbin Lu; Huimin Peng

doi:10.1214/18-EJS1439

. Author manuscript; available in PMC: 2018 Nov 9.

Published in final edited form as: Electron J Stat. 2018 Jun 21;12(1):2074–2089. doi: 10.1214/18-EJS1439

High-Dimensional Inference for Personalized Treatment Decision

X Jessie Jeng ^1,^*, Wenbin Lu ^1,^†, Huimin Peng ¹

PMCID: PMC6226259 NIHMSID: NIHMS987206 PMID: 30416643

Abstract

Recent development in statistical methodology for personalized treatment decision has utilized high-dimensional regression to take into account a large number of patients’ covariates and described personalized treatment decision through interactions between treatment and covariates. While a subset of interaction terms can be obtained by existing variable selection methods to indicate relevant covariates for making treatment decision, there often lacks statistical interpretation of the results. This paper proposes an asymptotically unbiased estimator based on Lasso solution for the interaction coefficients. We derive the limiting distribution of the estimator when baseline function of the regression model is unknown and possibly misspecified. Confidence intervals and p-values are derived to infer the effects of the patients’ covariates in making treatment decision. We confirm the accuracy of the proposed method and its robustness against misspecified function in simulation and apply the method to STAR*D study for major depression disorder.

Keywords: Large p Small n, Model Misspecification, Optimal Treatment Regime, Robust Regression

1. Introduction

Precision medicine aims to design optimal treatment for each individual according to their specific conditions in the hope of lowering medical cost and improving efficacy of treatments. Existing methods can be broadly partitioned into regression-based or classification-based approaches. Popular regression-based approaches include Q-learning [21, 11, 3, 6, 7, 17] and A-learning [13, 10, 8, 16, 15]. Q-learning models the conditional mean of the outcome given covariates and treatment while A-learning directly models the interaction between treatment and covariates that is sufficient for treatment decisions. Other regression-based methods have been developed in [18], [12], [19], etc. In contrast, classification-based approaches, also known as policy search or value search, estimate the marginal mean of the outcome for every treatment regime within a pre-specified class and then take the maximizer as the estimated optimal regime; see, for example, [13], [22], [23], [25], [26]. As emerging technology makes it possible to gather an extraordinarily large number of prognostic factors for each individual, such as genetic information and clinical measurements, regression-based and classification-based methods have been extended to handle high-dimensional data [12, 25, 26, 19]

This paper focuses on regression-based approaches and plans to address the current limitation that the selected interaction effects often lack statistical interpretation when the number of covariates exceeds the sample size. Consider the following semi-parametric model. Denote X as the n × p design matrix, where n is the number of patients and p is the number of patients’ covariates. Let $\tilde{X}$ be the design matrix with an additional column of 1’s and ${\tilde{X}}_{i}$ the i-th row of $\tilde{X}$ . Denote Y_i as the outcome and A_i = {0,1} the treatment assignment of the i-th patient. Assume

Y_{i} = μ (X_{i}) + A_{i} (β_{0}^{T} {\tilde{X}}_{i}) + ϵ_{i},

(1)

where μ(X_i) is an unspecified baseline function and β₀ is the vector of unknown coefficients for the interactions between treatment and patients’ covariates. We are interested in inference on β₀ and optimal treatment decisions based on the estimated β₀.

We propose an asymptotically unbiased estimator for β₀ and derive its limiting distribution when p > n. Consequently, confidence intervals and p-values of the interaction coefficients can be calculated. Because the baseline function is unknown and possibly misspecified, we investigate the robustness of the estimator to misspecified μ(X_i).

The proposed estimator for β₀ and the p-value calculated from data can be utilized to make significance-based optimal treatment decision. We illustrate the efficiency of the new treatment regime in a real application to STAR*D study for major depression disorder.

2. Method and Theory

2.1. An A-learning framework

We adopt the robust regression approach of [16] and transform the interaction from $A_{i} (β_{0}^{T} {\tilde{X}}_{i})$ to $(A_{i} - π (X_{i})) (β_{0}^{T} {\tilde{X}}_{i})$ , where π(X_i) = E(A_i|X_i) is the propensity score. Because E(A_i −π|X_i) = 0, the transformed interaction is orthogonal to the baseline function μ(X_i) given X_i. By doing so, we can protect the estimation of β₀ from the effect of baseline function misspecification [8, 16]. For simplicity of presentation, we consider a completely randomized study with prespecified propensity score, i.e. A = {0,1} and π(X_i) = P(A_i = 1|X_i) = π. Let $μ_{π} (X_{i}) = μ (X_{i}) + π (β_{0}^{T} {\tilde{X}}_{i})$ , then (1) is equivalent to

Y_{i} = μ_{π} (X_{i}) + (A_{i} - π) (β_{0}^{T} {\tilde{X}}_{i}) + ϵ_{i} .

The true μ_π(·) is unknown. Let ${\hat{μ}}_{π} (\cdot)$ be an estimator of μ_π(·) that does not necessarily converge to the true μ_π(·). Assume ${\hat{μ}}_{π} (\cdot) \to μ_{π}^{*} (\cdot)$ uniformly for some function $μ_{π}^{*} (\cdot)$ . Because E(A_i − π|X_i) = 0, then $μ_{π}^{*} (X_{i})$ and $(A_{i} - π) β_{0}^{T} {\tilde{X}}_{i}$ are orthogonal, which implies that

β_{0} = {argmin}_{β} {E {[Y - μ_{π}^{*} (X) - D (A, π) \tilde{X} β]}^{2}},

(2)

where D(A,π) = diag{A₁ − π,…,A_n − π}. We apply the Lasso to regress $Y - {\hat{μ}}_{π} (X)$ on $D (A, π) \tilde{X}$ and obtain

\hat{β} = {argmin}_{β} {{‖ Y - {\hat{μ}}_{π} (X) - D (A, π) \tilde{X} β ‖}_{2}^{2} / n + 2 λ_{n, p} ‖ β ‖_{1}} .

(3)

It is well-known that the limiting distribution of Lasso solution is difficult to derive. The misspecified baseline function adds another layer of difficulty [2].

2.2. Unbiased estimation with misspecified baseline function

We propose a de-sparsified estimator for β₀ motivated by [24] and [20]:

\hat{b} = \hat{β} + \hat{Θ} {D (A, π) \tilde{X}}^{T} {Y - {\hat{μ}}_{π} (X) - D (A, π) \tilde{X} \hat{β}} / n,

(4)

where $\hat{Θ}$ is an estimator of the precision matrix of $D (A, π) \tilde{X}$ .

We show that this estimator is asymptotically unbiased and derive its limiting distribution as follows. Decompose $\hat{b} - β_{0}$ into three terms:

\hat{b} - β_{0} = η - Δ_{1} - Δ_{2},

Where $η = \hat{Θ} {D (A, π) \tilde{X}}^{T} (Y - μ_{π}^{*} (X) - D (A, π) \tilde{X} β_{0}) / n, Δ_{1} = (\hat{Θ} \hat{Σ} - I) (\hat{β} - β_{0}), Δ_{2} = \hat{Θ} {D (A, π) \tilde{X}}^{T} (({\hat{μ}}_{π} (X) - μ_{π}^{*} (X)) / n$ , and $\hat{Σ}$ is the sample covariance matrix of $D (A, π) \tilde{X}$ .

Consequently, for an arbitrary q × (p + 1) matrix, H, we can decompose $\sqrt{n} H (\hat{b} - β_{0})$ into three terms. The reason to study $\sqrt{n} H (\hat{b} - β_{0})$ is to explore the asymptotic joint distribution of q arbitrary linear contrasts of ${\sqrt{n} ({\hat{b}}_{1} - β_{0, 1}), \dots, \sqrt{n} ({\hat{b}}_{p} - β_{0, p})}$ , which includes the limiting distribution of a single $\sqrt{n} ({\hat{b}}_{j} - β_{0, j})$ as a special case. We derive the limiting distribution of $\sqrt{n} H (\hat{b} - β_{0})$ by showing that ${‖ \sqrt{n} H Δ_{1} ‖}_{\infty} = o_{P} (1), {‖ \sqrt{n} H Δ_{2} ‖}_{\infty} = o_{P} (1)$ , and $\sqrt{n} H η \to_{d} N (0, G)$ .

We first show a preliminary result on the lasso solution $\hat{β}$ with misspecified baseline function. As $\hat{β} - β_{0}$ is an important component in Δ₁, this preliminary result helps to derive the asymptotic properties of $\sqrt{n} H Δ_{1}$ . Consider the following assumptions, where C denotes a generic constant, possibly varying from place to place:

Condition 1: The random error _i of the true model (1) are independent and identically distributed with $E (ϵ_{i} | \tilde{X}, A) = 0$ and $E (ϵ_{i}^{2} | \tilde{X}, A) \leq C$ .

Condition 2: There exists a $μ_{π}^{*} (\cdot)$ such that for any ${\tilde{X}}_{i} \in R^{p + 1}$ ,

$\sup_{{\tilde{X}}_{i}} | {\hat{μ}}_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}) | = o_{p} (1)$ and
$E {(μ_{π}^{*} ({\tilde{X}}_{i}) - μ_{π} ({\tilde{X}}_{i}))}^{2} \leq C$

Condition 3: Denote ${\tilde{X}}^{j}$ as the jth column of $\tilde{X}$ . ${‖ {\tilde{X}}^{j} ‖}_{\infty} \leq C$ for all 1 ≤ j ≤ p + 1.

Condition 4: Define Σ as the covariance matrix of $D (A, π) \tilde{X}$ . Σ has smallest eigenvalue $Λ_{m i n}^{2} (Σ) \geq C > 0$ .

Condition 5: The number of important interaction terms satisfies that $s_{0} = {‖ β_{0} ‖}_{0} = o {\sqrt{n} / \log (p)}$ .

Lemma 2.1. Consider model (1). Assume Conditions 1 – 5. The lasso solution from (3) with $λ_{n, p} ≍ \sqrt{\log (p) / n}$ satisfies

{‖ \hat{β} - β_{0} ‖}_{1} = o_{p} (1 / \sqrt{\log p}) .

(5)

Next, we show that $\sqrt{n} H Δ_{1}$ and $\sqrt{n} H Δ_{2}$ are at the order of o_p (1). Similar to [20], we apply lasso for nodewise regression to construct $\hat{Θ}$ . Details of the construction are presented in Section A.1. Consider additional assumptions:

Condition 6: The prespecified matrix H_q×(p+1) satisfies

||H||_∞ ≤ C and
h_t = |{j ≠ t: H_t,j ≠ 0}| ≤ C for any t ∈ {1,…,q}.

Condition 7: The precision matrix Θ of $D (A, π) \tilde{X}$ satisfies

||Θ||_∞ ≤ C and
s_j = |{k ≠ j: Θ_j,k ≠ 0}| ≤ C for any j ∈ {1,…,p}.

Lemma 2.2. Consider model (1). Assume Conditions 1 – 7. Obtain $\hat{β}$ by (3) with $λ_{n, p} ≍ \sqrt{\log (p) / n}$ and $\hat{Θ}$ by nodewise regression with $λ_{j} ≍ \sqrt{\log (p) / n}$ for any j ∈ {1,…,p}. Then

{‖ \sqrt{n} H Δ_{1} ‖}_{\infty} = o_{P} (1)

(6)

{‖ \sqrt{n} H Δ_{2} ‖}_{\infty} = o_{P} (1) .

(7)

The remaining component of $\sqrt{n} H (\hat{b} - β_{0})$ is $\sqrt{n} H η$ . We show that $\sqrt{n} H η$ converges to a multivariate normal distribution. Since η does not involve $\hat{β} - β_{0}$ , less conditions are needed in the following lemma.

Lemma 2.3. Consider model (1). Assume Conditions 1– 3 and 6 – 7. Obtain $\hat{Θ}$ by nodewise regression with $λ_{j} ≍ \sqrt{\log (p) / n}$ for any j ∈ {1,…,p}. Then

\sqrt{n} H η \to_{d} N (0, G),

(8)

where G = HΘVΘ^TH^T is a q×q nonnegative matrix, $V = E [{(A_{i} - π)}^{2} σ_{ϵ}^{2} (A_{i}, {\tilde{X}}_{i}) {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] + π (1 - π) E [{(μ_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}))}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}]$ , $σ_{ϵ}^{2} (A_{i}, {\tilde{X}}_{i}) = V a r (ϵ_{i} | A_{i}, {\tilde{X}}_{i})$ , and ||G||_∞ < _∞.

We summarize the above results in the following theorem.

Theorem 2.4. Consider model (1). Assume Conditions 1 – 7. Obtain $\hat{β}$ by (3) with $λ_{n, p} ≍ \sqrt{\log (p) / n}$ and $\hat{Θ}$ by nodewise regression with $λ_{j} ≍ \sqrt{\log (p) / n}$ for any j ∈ {1,…,p}. Then

\sqrt{n} H (\hat{b} - β_{0}) \to_{d} N (0, G),

where G is defined in (8).

2.3. Some examples for ${\hat{μ}}_{π}$

In real applications, we need to choose an estimator ${\hat{μ}}_{π}$ for the baseline function. The only requirement for ${\hat{μ}}_{π}$ is Condition 2, which is very general. Here we discuss two examples for ${\hat{μ}}_{π}$ .

Example 1: ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = c o n s t a n t$ . In this case, Condition 2 (a) holds trivially.

Condition 2 (b) is implied by

Condition 8: $E (Y_{i}^{2}) \leq C$ .

The following corollary presents the limiting distribution of $\sqrt{n} H (\hat{b} - β_{0})$ in this case. Proof of this corollary is straightforward and, thus, omitted.

Corollary 2.5. Consider model (1). Implement ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = c o n s t a n t$ . Assume

Conditions 1, 3 – 7, and 8. Obtain $\hat{β}$ by (3) with $λ_{n, p} ≍ \sqrt{\log (p) / n}$ and $\hat{Θ}$ by nodewise regression with $λ_{j} ≍ \sqrt{\log (p) / n}$ for any j ∈ {1,…,p}. Then,

\sqrt{n} H (\hat{b} - β_{0}) \to_{d} N (0, G),

where G is defined in (8).

Example 2: ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = {\hat{γ}}^{T} {\tilde{X}}_{i}$ , where $\hat{γ}$ is obtained by

(\hat{γ}, \hat{β}) = {argmin}_{γ, β} {‖ Y - \tilde{X} γ - D (A, π) \tilde{X} β ‖_{2}^{2} / n + 2 λ_{n, p} (‖ γ ‖_{1} + ‖ β ‖_{1})} .

(9)

Note that the solution $\hat{β}$ from (9) is equivalent to the solution from (3) (detail of the proof is included in the proof of Corollary 2.6). In this case, we replace Condition 2 with Condition 9: Define $γ^{*} = {argmin}_{γ} {E {(μ_{π} ({\tilde{X}}_{i}) - γ^{T} {\tilde{X}}_{i})}^{2}}$ ; γ^∗ satisfies

$s_{γ} = {‖ γ^{*} ‖}_{0} = o (\sqrt{n} / \log (p))$ and
$E {(γ^{* T} {\tilde{X}}_{i} - μ_{π} ({\tilde{X}}_{i}))}^{2} \leq C$

Corollary 2.6. Consider model (1). Implement ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = {\hat{γ}}^{T} {\tilde{X}}_{i}$ and obtain $\hat{γ}$ and $\hat{β}$ from (9) with $λ_{n, p} ≍ \sqrt{\log (p) / n}$ Assume Conditions 1, 3 – 7, and 9.

Obtain $\hat{Θ}$ by nodewise regression with $λ_{j} ≍ \sqrt{\log (p) / n}$ for any j ∈ {1,…,p}.

Then,

\sqrt{n} H (\hat{b} - β_{0}) \to_{d} N (0, G),

where G is defined in (8).

2.4. Interval estimation and p-value

The theoretical result on the asymptotic normality of the de-sparsified estimator $\hat{b}$ can be utilized to construct confidence intervals for the coefficients of interest and to perform hypothesis testing. The variance-covariance matrix G can be approximated by $H \hat{Θ} V {\hat{Θ}}^{T} H^{T}$ where

\hat{V} = \frac{1}{n} \sum_{i = 1}^{n} {(A_{i} - π)}^{2} {Y_{i} - {\hat{μ}}_{π} ({\tilde{X}}_{i}) - (A_{i} - π) {\hat{β}}^{T} {\tilde{X}}_{i}}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T} .

Therefore, a pointwise (1 − α) confidence interval of β_0,j can be constructed as

{{\hat{b}}_{j} - c (α, n), {\hat{b}}_{j} + c (α, n)},

(10)

where $c (α, n) = Φ^{- 1} (1 - α / 2) \sqrt{{(H \hat{Θ} \hat{V} {\hat{Θ}}^{T} H^{T})}_{j, j} / n}$ , and Φ(·) is the c.d.f of N(0,1).

One can also calculate the asymptotic p-value for testing

H_{0} : β_{0, j} = 0 vs . H_{A} : β_{0, j} \neq 0

for a given j ∈ {1,…,p}. A non-zero β_0,j means that variant X_j is relevant in making treatment decision for a patient.

3. Simulation

We consider three simulation settings with different baseline functions:

Case 1: $Y = 1 + X γ + A (\tilde{X} β_{0}) + ϵ$ ,
Case 2: $Y = 1 + 0.5 {(1 + X γ)}^{2} + A (\tilde{X} β_{0}) + ϵ$ ,
Case 3: $Y = 1 + 1.5 s i n (π X γ) + X_{1}^{2} + A (\tilde{X} β_{0}) + ϵ$ , where X₁ is the first covariate of X.

We simulate the random error from N(0,1) and the design matrix X from multivariate normal MN(0,Σ), where Σ = AR(0.5). The parameters in the baseline functions is set as γ = (1,1,0,···,0)^T, and the parameter of interest β₀ has s nonzero elements with values 1 or 1.5. The treatment main effect is set at zero.

We apply our de-sparsified estimator $\hat{b}$ ^: in (4) with ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = {\hat{γ}}^{T} {\tilde{X}}_{i}$ and derive confidence intervals by (10). To evaluate the finite-sample performance of our method, we report the following measures. Denote CI_j as the 95% confidence interval for β_0,j and S_true = {j ∈ {2,…p + 1}: β_0,j ≠ 06}.

MAB( $\hat{b}$ ): the mean absolute bias of $\hat{b}$ for relevant variables, calculated as $s^{- 1} \sum_{j \in S_{t r u e}} | {\hat{b}}_{j} - β_{0, j} |$ .
MAB( $\hat{β}$ ): the mean absolute bias of lasso solution for relevant variables, calculated as $s^{- 1} \sum_{j \in S_{t r u e}} | {\hat{β}}_{j} - β_{0, j} |$ .
Coverage for noise variables: the empirical value of ${(p - s)}^{- 1} \sum_{j \in S_{t r u e}^{c}} P [0 \in | {CI}_{j}]$ .
Coverage for relevant variables: the empirical value of $s^{- 1} \sum_{j \in S_{t r u e}} P [β_{0, j} \in {CI}_{j}]$ .
Length of CI for noise variables: the empirical value of ${(p - s)}^{- 1} \sum_{j \in S_{t r u e}^{c}} length ({CI}_{j})$ .
Length of CI for relevant variables: the empirical value of $s^{- 1} \sum_{j \in S_{t r u e}} length ({CI}_{j})$ .

Tables 1–2 summarize the performance measures for case 1–3 from 100 simulations with n = 200, p = 300 and s = 5 or 10. It is shown that the mean absolute bias for relevant variables of our estimator $\hat{b}$ is significantly smaller than that of the lasso solution in all settings. The coverages of the confidence intervals are fairly consistent for all cases 1–3, concurring with the theoretical results on the robustness of the method in Corollary 2.6. We notice that the coverage of confidence intervals for noise variables is very close to the nominal level of 0.95, but that for the relevant variants is lower than 0.95. Similar phenomena have been observed for the original de-sparsified estimator in [20] and [4]. This phenomenon indicates that even though the bias of the original Lasso estimator has been corrected asymptotically by the de-sparsifying procedure, the non-zero coefficients can still be under-estimated with finite sample. Further, given that the implemented ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = {\hat{γ}}^{T} {\tilde{X}}_{i}$ , it agrees with our expectation that the lengths of the confidence intervals are shortest for Case 1 when the true baseline function is linear and increases in Case 2 and 3 with nonlinear baseline functions.

Table 1.

Coverage of confidence interval and mean absolute bias (MAB) of $\hat{b}$ with p = 300 and s = 5. Standard errors are in parenthesis.

	Intensity=1			Intensity=1.5
	Case 1	Case 2	Case 3	Case 1	Case 2	Case 3

$MAB (\hat{b})$	0.17(0.06)	0.40(0.14)	0.30(0.10)	0.17(0.06)	0.41(0.15)	0.30(0.11)

$MAB (\hat{β})$	0.23(0.06)	0.50(0.14)	0.39(0.11)	0.23(0.06)	0.53(0.18)	0.40(0.13)

Coverage for noise variables	0.95(0.01)	0.95(0.01)	0.95(0.03)	0.95(0.01)	0.94(0.05)	0.95(0.01)

Coverage for relevant variables	0.88(0.13)	0.89(0.15)	0.85(0.18)	0.88(0.13)	0.86(0.20)	0.86(0.17)

Length of CI for noise variables	0.65(0.04)	1.50(0.17)	1.21(0.91)	0.65(0.04)	1.50(0.17)	1.12(0.12)

Length of CI for relevant variables	0.65(0.04)	1.49(0.16)	1.20(0.92)	0.65(0.04)	1.49(0.16)	1.11(0.11)

Open in a new tab

Table 2.

Coverage of confidence interval and mean absolute bias (MAB) of $\hat{b}$ with p = 300 and s = 10.

	Intensity=1			Intensity=1.5
	Case 1	Case 2	Case 3	Case 1	Case 2	Case 3

$MAB (\hat{b})$	0.19(0.04)	0.42(0.10)	0.33(0.08)	0.19(0.04)	0.46(0.27)	0.43(0.96)

$MAB (\hat{β})$	0.24(0.06)	0.51(0.10)	0.40(0.09)	0.24(0.06)	0.57(0.27)	0.52(0.96)

Coverage for noise variables	0.95(0.03)	0.94(0.04)	0.95(0.03)	0.94(0.05)	0.94(0.04)	0.94(0.05)

Coverage for relevant variables	0.86(0.14)	0.86(0.16)	0.84(0.14)	0.84(0.18)	0.84(0.16)	0.82(0.18)

Length of CI for noise variables	0.69(0.05)	1.58(0.18)	1.19(0.13)	0.69(0.05)	1.64(0.46)	1.23(0.47)

Length of CI for relevant variables	0.69(0.05)	1.58(0.18)	1.18(0.13)	0.69(0.05)	1.63(0.45)	1.22(0.47)

Open in a new tab

4. Application to STAR*D study

We consider the dataset from multi-site, randomized, multi-step STAR*D (Sequenced Treatment Alternatives to Relieve Depression) study for Major depressive disorder (MDD), which is a chronic and recurrent common disorder [5, 14]. In STAR*D study, patients received citalopram (CIT) which is a selective serotonin reuptake inhibitor antidepressant at Level 1. Patients who have had unsatisfactory outcome at level 1 are included in level 2 to randomly receive one of the two treatment switch options: sertraline (SER) and bupropion (BUP). Patients who received treatment at level 2 but have not showed sufficient improvement will be randomized at level 2A.

Our data contains 319 patients who received treatment switch options at level 2. Among them, 48% of the patients received BUP and 52% received SER. Except for treatment indicator, 308 covariate variables of the patients are included. The outcome of interest is the Quick Inventory of Depressive SymptomatologySelf-report QIDS-SR₁₆. Similar to existing studies on STAR*D [15], we transform the outcome to be the negative of the original outcome. We are interested in making optimal treatment decision between SER and BUR to maximize the mean outcome.

We apply the de-sparsified estimator $\hat{b}$ in (4) with ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = {\hat{γ}}^{T} {\tilde{X}}_{i}$ and derive the p-value based on the limiting distribution in Corollary 2.6. The top-ranked p-values that are less than 0.05 are presented in Table 3 with the corresponding covariates.

Table 3.

Top-ranked p-values and the corresponding covariates that are most relevant for making treatment decision.

Index	Covariate	p-value	^$\hat{b}$	se of $\hat{b}$
1	Qccur_r_rate	0.0056	−1.84	0.66
2	URNONE	0.0079	2.35	0.88
3	NVTRM	0.0097	−1.45	0.56
4	IMPWR	0.010	−1.47	0.57
5	hWL	0.011	−1.58	0.62
6	GLT2W	0.013	−1.52	0.61
7	DSMTD	0.017	−1.50	0.63
8	IMSPY	0.019	−1.36	0.58
9	hMNIN	0.020	1.40	0.60
10	EARNG	0.028	−1.43	0.65
11	URPN	0.029	−1.23	0.56
12	PETLK	0.039	−1.18	0.57
13	NVCRD	0.041	1.17	0.57
14	EMSTU	0.046	−1.39	0.69

Open in a new tab

The meanings of the notations in Table 3 are as follows. Qccur r rate: QIDSC score changing rates. URNONE: no symptoms in patients’ urination category. NVTRM: tremors. IMPWR: indicating whether patients thought they have special powers. hWL: hRS Weight loss. DSMTD: recurrent thoughts of death, recurrent suicidal ideation, or suicide attempt. hMNIN: hRS Middle insomnia. EMSTU: did you worry a lot that you might do something to make people think that you were stupid or foolish?

We have also derived the 95% confidence intervals by (10) for the interaction effects between all covariates and treatment options. The confidence intervals for the top 14 interaction effects are presented in Figure 1.

Fig 1. — The 95% confidence intervals of the top 14 interaction effects

We also apply cross validation to evaluate the effects of the selected variables in making optimal treatment decision. Specifically, we randomly divide the data in half into the training dataset and the testing dataset. On the training dataset, we use the 14 selected variables to compute the least squares estimator for interaction coefficients and formulate the estimated optimal treatment regime. On the corresponding testing dataset, we use the estimated value function derived by the inverse probability weighted estimator [23] to measure the performance of the estimated optimal treatment regime. For comparison, we also compute the estimated value functions for the regimes of assigning only SER and only BUP. We conducted 1000 data splittings and present the boxplot of the estimated value functions of all the treatment regimes in Figure 2. It shows that our estimated optimal treatment regime generally gives larger estimated value functions than those of assigning only SER and only BUP.

Fig 2. — Boxplot of the estimated value functions from cross validation testing samples. ’optimal’ is for the estimated optimal treatment regime; ’all.BUP’ is for the regime of assigning only BUP; ’all.SER’ is for the regime of assigning only SER.

5. Discussion

In this paper, we have considered randomized studies with a pre-specified propensity score. The proposed method can be extended to observation studies with high-dimensional covariates, where the propensity score model needs to be correctly specified from data. Related work on propensity score estimation can be found in [16]. However, the derivation of the limiting distribution of our desparsified estimator for the treatment-covariates interaction coefficients would be much more involved and requires further investigation.

Appendix A: Appendix

A.1. Construction of $\hat{Θ}$

We apply lasso for nodewise regression to obtain a matrix $\hat{Θ}$ such that $\hat{Θ} \hat{Σ}$ is close to I as in [9]. Let ${\tilde{X}}_{- j}$ denote the matrix obtained by removing the jth column of $D (A, π) \tilde{X}$ . For each j = 1,…,p + 1, let

{\hat{γ}}_{j} = \underset{γ \in ℝ^{p}}{argmin} (n^{- 1} {‖ D (A, π) {\tilde{X}}_{j} - {\tilde{X}}_{- j} γ ‖}_{2}^{2} + 2 λ_{j} ‖ γ ‖_{1})

with components ${\hat{γ}}_{j, k}$ ,k = 1,…,p + 1 and k ≠ j. Further, define

{\hat{τ}}_{j}^{2} = n^{- 1} {‖ D (A, π) {\tilde{X}}_{j} - {\tilde{X}}_{- j} {\hat{γ}}_{j} ‖}_{2}^{2} + 2 λ_{j} {‖ {\hat{γ}}_{j} ‖}_{1}

and

\hat{Θ} = diag ({\hat{τ}}_{1}^{- 2}, \dots, {\hat{τ}}_{p + 1}^{- 2}) (\begin{matrix} 1 & - {\hat{γ}}_{1, 2} & \dots & - {\hat{γ}}_{1, p + 1} \\ - {\hat{γ}}_{2, 1} & 1 & \dots & - {\hat{γ}}_{2, p + 1} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ - {\hat{γ}}_{p + 1, 1} & - {\hat{γ}}_{p + 1, 2} & \dots & 1 \end{matrix}) .

A.2. A preliminary lemma

Define

ϵ_{\hat{μ}} = ϵ - [{\hat{μ}}_{π} (\tilde{X}) - μ_{π} (\tilde{X})] .

(11)

Lemma A.1. Assume conditions 1 – 3, we have

{‖ 2 ϵ_{\hat{μ}}^{T} D (A, π) \tilde{X} / n ‖}_{\infty} = O_{p} (\sqrt{\log p} / \sqrt{n}) .

Proof of Lemma A.1:

ϵ_{\hat{μ}}^{T} D (A, π) \tilde{X} = {(ϵ - [μ_{π}^{*} (\tilde{X}) - μ_{π} (\tilde{X})] - [{\hat{μ}}_{π} (\tilde{X}) - μ_{π}^{*} (\tilde{X})])}^{T} D (A, π) \tilde{X} = ϵ^{T} D (A, π) \tilde{X} - {[μ_{π}^{*} (\tilde{X}) - μ_{π} (\tilde{X})]}^{T} D (A, π) \tilde{X} - {[{\hat{μ}}_{π} (\tilde{X}) - μ_{π}^{*} (\tilde{X})]}^{T} D (A, π) \tilde{X}

(12)

For the first term of (12), it is clear that by condition 1,

E (ϵ^{T} D (A, π) \tilde{X}) = E E (ϵ^{T} D (A, π) \tilde{X} | A, \tilde{X}) = 0.

Further, by the moment inequality in Chapter 6.2.2 of [1],

E ({‖ ϵ^{T} D (A, π) \tilde{X} ‖}_{\infty}^{2}) \leq 8 \log (2 p) \sum_{i = 1}^{n} {(A_{i} - π)}^{2} {(\max_{1 \leq j \leq p} | X_{i}^{j} |)}^{2} E (ϵ_{i}^{2}) \leq C n \log (p),

where the second inequality is by conditions 1 and 3. Then, by Markov inequality,

{‖ 2 ϵ^{T} D (A, π) \tilde{X} / n ‖}_{\infty} = O_{p} (\sqrt{\log (p)} / \sqrt{n}) .

(13)

Next, consider the second term of (12). It is easy to see that since $E (D (A, π) | \tilde{X}) = 0$ ,

E ({[μ_{π}^{*} (\tilde{X}) - μ_{π} (\tilde{X})]}^{T} D (A, π) \tilde{X}) = E E ({[μ_{π}^{*} (\tilde{X}) - μ_{π} (\tilde{X})]}^{T} D (A, π) \tilde{X} | \tilde{X}) = 0.

Then, similar arguments as those leading to (13) combined with conditions 2 (b) and 3 gives

{‖ 2 {[μ_{π}^{*} (\tilde{X}) - μ_{π} (\tilde{X})]}^{T} D (A, π) \tilde{X} / n ‖}_{\infty} = O_{p} (\sqrt{\log (p)} / \sqrt{n}) .

(14)

Finally, by condition 2 (a), the third term of (12)

{‖ 2 {[{\hat{μ}}_{π} (\tilde{X}) - μ_{π}^{*} (\tilde{X})]}^{T} D (A, π) \tilde{X} / n ‖}_{\infty} = o_{p} (\sqrt{\log (p)} / \sqrt{n}) .

(15)

Combining (13) - (15) gives the claim in Lemma A.1.

A.3. Proof of Lemma 2.1

By the construction of $\hat{β}$ , we have

{‖ Y - {\hat{μ}}_{π} (\tilde{X}) - D (A, π) \tilde{X} ‖}_{2}^{2} / n + 2 λ_{n, p} ‖ \hat{β} ‖_{1} \leq {‖ Y - {\hat{μ}}_{π} (\tilde{X}) - D (A, π) \tilde{X} β_{0} ‖}_{2}^{2} / n + 2 λ_{n, p} {‖ β_{0} ‖}_{1} .

Recall the definition of $ϵ_{\hat{μ}}$ in (11), then

{‖ D (A, π) \tilde{X} (\hat{β} - β_{0}) ‖}_{2}^{2} / n + 2 λ_{n, p} ‖ \hat{β} ‖_{1} \leq 2 ϵ_{\hat{μ}}^{T} D (A, π) \tilde{X} (\hat{β} - β_{0}) / n + 2 λ_{n, p} {‖ β_{0} ‖}_{1} .

The first term of the righ-hand side

2 ϵ_{\hat{μ}}^{T} D (A, π) \tilde{X} (\hat{β} - β_{0}) / n \leq {‖ 2 ϵ_{\hat{μ}}^{T} D (A, π) \tilde{X} / n ‖}_{\infty} {‖ \hat{β} - β_{0} ‖}_{1},

where the order of ${‖ 2 ϵ_{\hat{μ}}^{T} D (A, π) \tilde{X} / n ‖}_{\infty}$ is derived in Lemma A.1.

The rest of the proof is similar to the proof of the second claim of Lemma 2 in [2], where it is shown that given conditions 3 – 5, the compatibility condition holds with probability tending to 1. Then using Lemma A.1 and similar arguments in Section 6.2.2 of [1], the inequality in (5) holds.

A.4. Proof of Lemma 2.2

Consider (6) first. Recall

\sqrt{n} H Δ_{1} = \frac{1}{\sqrt{n}} H \hat{Θ} {D (A, π) \tilde{X}}^{T} {{\hat{μ}}_{π} (\tilde{X}) - μ_{π}^{*} (\tilde{X})} .

Rewrite $\sqrt{n} H Δ_{1}$ as

\sqrt{n} H Δ_{1} = H \hat{Θ} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (A_{i} - π) {\tilde{X}}_{i} {{\hat{μ}}_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i})} \leq \sup_{{\tilde{X}}_{t}} | {\hat{μ}}_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}) | H Θ \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (A_{i} - π) {\tilde{X}}_{i} .

Since $E [(A_{i} - π) {\tilde{X}}_{i}] = 0$ and ${‖ E [{(A_{i} - π)}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] ‖}_{\infty} < \infty$ by condition 3, by

Multivariate Central Limit Theorem, the (p + 1)-vector

\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (A_{i} - π) {\tilde{X}}_{i} = O_{p} (1) .

On the other hand, condition 7 combined with similar arguments as in the proof of the first claim of Lemma 2 in [2] gives ${‖ {\hat{Θ}}^{j} - Θ^{j} ‖}_{1} = o_{p} (1 / \sqrt{\log p})$ Summarizing the above with condition 6 and Slutsky’s Theorem gives

‖ H \hat{Θ} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (A_{i} - π) {\tilde{X}}_{i} ‖_{\infty} = O_{p} (1) .

(16)

(6) follows from (16) and condition 2 (a).

Next, consider (7).

{‖ \sqrt{n} H Δ_{2} ‖}_{\infty} = {‖ \sqrt{n} H (\hat{Θ} \hat{Σ} - I) (\hat{β} - β_{0}) ‖}_{\infty} \leq \sqrt{n} ‖ H (\hat{Θ} \hat{Σ} - I) ‖_{\infty} {‖ \hat{β} - β_{0} ‖}_{1} .

By condition 6, $‖ H (\hat{Θ} \hat{Σ} - I) ‖_{\infty} \leq C ‖ \hat{Θ} \hat{Σ} - I ‖_{\infty}$ Similar arguments as those leading to inequality (10) in [20], combined with conditions 3 and 7, result in $‖ \hat{Θ} \hat{Σ} - I ‖_{\infty} = O_{p} (\sqrt{\log p} / \sqrt{n})$ . Combing the above with Lemma 2.2 gives (7).

A.5. Proof of Lemma 2.3

We first show $E (\sqrt{n} H η) = 0$ . By definition of η, we have

E (\sqrt{n} H η) = E [\frac{1}{\sqrt{n}} H \hat{Θ} {D (A, π) \tilde{X}}^{T} (Y - μ_{π}^{*} \tilde{X} - D (A, π) \tilde{X} β_{0})] = \frac{1}{\sqrt{n}} H E [Θ {D (A, π) \tilde{X}}^{T} ϵ] + \frac{1}{\sqrt{n}} H E [Θ {D (A, π) \tilde{X}}^{T} (μ_{π} (\tilde{X}) - μ_{π}^{*} (\tilde{X}))]

The first term of the above is equal to 0 because $E (\int | \tilde{X}, A) = 0$ by condition 1. The second term also equals to 0 since $E (D (A, π) | \tilde{X}) = 0$ .

Next, rewrite $\sqrt{n} H η$ as

\sqrt{n} H η = H \hat{Θ} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} W_{i},

where

W_{i} = (A_{i} - π) (y_{i} - μ_{π}^{*} ({\tilde{X}}_{i}) - (A_{i} - π) β_{0}^{T} {\tilde{X}}_{i}) {\tilde{X}}_{i}

The second moment of W_i is

E [{(A_{i} - π)}^{2} {(ϵ_{i} + μ_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}))}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] = E [{(A_{i} - π)}^{2} ϵ_{i}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] + E [{(A_{i} - π)}^{2} {(μ_{π} ({\tilde{X}}_{i}))}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}]

The first term

E [{(A_{i} - π)}^{2} ϵ_{i}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] = E [{(A_{i} - π)}^{2} E (ϵ_{i}^{2} | \tilde{X}, A) {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] = E [{(A_{i} - π)}^{2} σ_{ϵ}^{2} (A_{i}, {\tilde{X}}_{i}) {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}]

The second term

E [{(A_{i} - π)}^{2} {(μ_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}))}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] = E [E ({(A_{i} - π)}^{2} | {\tilde{X}}_{i}) {(μ_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}))}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] = π (1 - π) E [{(μ_{π} ({\tilde{X}}_{i}) - μ_{π}^{*} ({\tilde{X}}_{i}))}^{2} {\tilde{X}}_{i} {\tilde{X}}_{i}^{T}] .

Note that $σ_{ϵ}^{2} (A_{i}, {\tilde{X}}_{i}) < \infty$ by condition 1. This combined with conditions 2 (b) and 3 gives ||V||∞ < ∞. Consequently, ||HΘVΘTHT|| < ∞ is implied by conditions 6 – 7.

On the other hand, condition 7 combined with similar arguments as in the proof of the first claim of Lemma 2 in [2] gives ${‖ {\hat{Θ}}^{j} - Θ^{j} ‖}_{1} = o_{p} (1 / \sqrt{\log p})$ . By Multivariate Central Limit Theorem and Slutsky’s Theorem,

H \hat{Θ} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} W_{i} \to_{d} N (0, H Θ V Θ^{T} H^{T}) .

Result in (8) follows.

A.6. Proof of Corollary 2.6

After obtaining $\hat{γ}$ and $\hat{β}$ from (9), implement ${\hat{μ}}_{π} ({\tilde{X}}_{i}) = {\hat{γ}}^{T} {\tilde{X}}_{i}$ . Define

β^{*} = {argmin}_{β} {‖ Y - \tilde{X} \hat{γ} - D (A, π) \tilde{X} β ‖_{2}^{2} / n + 2 λ_{n, p} ‖ β ‖_{1}} .

First, we show that $\hat{β} = β^{*}$ By the construction of β^∗,

{‖ Y - \tilde{X} \hat{γ} - D (A, π) \tilde{X} β^{*} ‖}_{2}^{2} / n + 2 λ_{n, p} {‖ β^{*} ‖}_{1} \leq ‖ Y - \tilde{X} \hat{γ} - D (A, π) \tilde{X} ‖_{2}^{2} / n + 2 λ_{n, p} ‖ \hat{β} ‖_{1} .

(17)

On the other hand, the construction of $\hat{β}$ implies

‖ Y - \tilde{X} \hat{γ} - D (A, π) \tilde{X} \hat{β} ‖_{2}^{2} / n + 2 λ_{n, p} (‖ \hat{γ} ‖_{1} + ‖ \hat{β} ‖_{1}) \leq {‖ Y - \tilde{X} \hat{γ} - D (A, π) \tilde{X} β^{*} ‖}_{2}^{2} / n + 2 λ_{n, p} (‖ \hat{γ} ‖_{1} + {‖ β^{*} ‖}_{1}) \leq ‖ Y - \tilde{X} \hat{γ} - D (A, π) \tilde{X} ‖_{2}^{2} / n + 2 λ_{n, p} (‖ \hat{γ} ‖_{1} + ‖ \hat{β} ‖_{1}),

where the second inequality is by (17). Therefore, by the uniqueness of the solution of convex optimization, $\hat{β} = β^{*}$ .

Secondly, we show that

{‖ \hat{γ} - γ^{*} ‖}_{1} + {‖ \hat{β} - β_{0} ‖}_{1} = o_{p} (1 / \sqrt{\log p}) .

(18)

Given the fact that $μ_{π} ({\tilde{X}}_{i})$ and $(A_{i} - π) β_{0}^{T} {\tilde{X}}_{i}$ are orthogonal and $E (ϵ_{i} | \tilde{X}) = 0$ from condition 1, definition of γ^∗ implies

(γ^{*}, β_{0}) = {argmin}_{γ, β} {E {(Y_{i} - γ^{T} X_{i} - (A_{i} - π) β^{T} X_{i})}^{2}} .

Define $ξ_{i} = Y_{i} - γ^{* T} X_{i} - (A_{i} - π) β_{0}^{T} X_{i} = ϵ_{i} + μ_{π} ({\tilde{X}}_{i}) - γ^{* T} X_{i}$ . Then conditions 1 and 9 (b) imply

E {(ξ_{i})}^{2} \leq C .

(19)

By the second claim of Lemma 2 of [2], (19) combined with conditions 3 – 5 and 9 (a) gives (18).

Next, it is easy to see that condition 2 (a) is implied by (18) and condition 3, and condition 2 (b) holds trivially given condition 9 (b). Therefore, condition 2 is satisfied and the rest of the proof is the same as the proof of Theorem 2.4.

Footnotes

MSC 2010 subject classifications: Primary 62J05, 62F35; secondary 62P10.

References

[1].Bühlmann P and van de Geer S (2011). Statistics for high-dimensional data - methods, theory and applications. Springer. [Google Scholar]
[2].Bühlmann P and van de Geer S (2015). High-dimensional inference in misspecified linear models. Electronic Journal of Statistics 9, 1449–1473. [Google Scholar]
[3].Chakraborty B, Murphy SA, and Strecher VJ (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research 19(3), 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Dezeure R, Bühlmann P, Meier L, and Meinshausen N (2015). Highdimensional inference: Confidence intervals, p-values and r-software hdi. Statistical Science 30(4), 533–558. [Google Scholar]
[5].Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, et al. (2003). Background and rationale for the sequenced treatment alternatives to relieve depression (star* d) study. Psychiatric Clinics of North America 26(2), 457–494. [DOI] [PubMed] [Google Scholar]
[6].Goldberg Y, Song R, and Kosorok MR (2013). Adaptive q-learning. Institute of Mathematical Statistics Collections 9, 150–162. [Google Scholar]
[7].Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for q-learning. Biometrika 101, 831–847. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Lu W, Zhang HH, and Zeng D (2013). Variable selection for optimal treatment decision. Statistical Methods in Medical Research 22(5), 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Meinshausen N and Buhlmann P (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics 34(3), 1436–1462. [Google Scholar]
[10].Murphy S (2003). Optimal dynamic treatment regimes. Journal of The Royal Statistical Society: Series B 65, 331–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Murphy S (2005). A generalization error for q-learning. Journal of Machine Learning Research 6, 1073–1097. [PMC free article] [PubMed] [Google Scholar]
[12].Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. The Annals of Statistics 39, 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Robins J, Hernan MA, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11, 550–560. [DOI] [PubMed] [Google Scholar]
[14].Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. (2004). Sequenced treatment alternatives to relieve depression (star* d): rationale and design. Controlled Clinical Trials 25(1), 119–142. [DOI] [PubMed] [Google Scholar]
[15].Shi C, Fan A, Song R, and Lu W (2018). High-dimensional a-learning for optimal dynamic treatment regimes. The Annals of Statistics 46(3), 925–957. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Shi C, Song R, and Lu W (2016). Robust learning for optimal treatment decision with np-dimensionality. Electronic Journal of Statistics 10, 2894–2921. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Song R, Wang W, Zeng D, and Kosorok MR (2015). Penalized qlearning for dynamic treatment regimens. Statistica Sinica 25, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Su X, Tsai C-L, Wang H, Nickerson DM, and Li B (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10, 141–158. [Google Scholar]
[19].Tian L, Alizadeh AA, Gentles AJ, and Tibshirani R (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 109, 1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42(3), 1166–1202. [Google Scholar]
[21].Watkins CJCH and Dayan P (1992). Q-learning. Machine Learning 8, 279–292. [Google Scholar]
[22].Zhang B, Tsiatis AA, Davidian M, Zhang M, and Laber EB (2012). Estimating optimal treatment regimes from a classification perspective. Stat 1, 103–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68, 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Zhang C and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society. Series B 76(1), 217–242. [Google Scholar]
[25].Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, and Kosorok MR (2015). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102, 151–168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Bühlmann P and van de Geer S (2011). Statistics for high-dimensional data - methods, theory and applications. Springer. [Google Scholar]

[R2] [2].Bühlmann P and van de Geer S (2015). High-dimensional inference in misspecified linear models. Electronic Journal of Statistics 9, 1449–1473. [Google Scholar]

[R3] [3].Chakraborty B, Murphy SA, and Strecher VJ (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research 19(3), 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Dezeure R, Bühlmann P, Meier L, and Meinshausen N (2015). Highdimensional inference: Confidence intervals, p-values and r-software hdi. Statistical Science 30(4), 533–558. [Google Scholar]

[R5] [5].Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, et al. (2003). Background and rationale for the sequenced treatment alternatives to relieve depression (star* d) study. Psychiatric Clinics of North America 26(2), 457–494. [DOI] [PubMed] [Google Scholar]

[R6] [6].Goldberg Y, Song R, and Kosorok MR (2013). Adaptive q-learning. Institute of Mathematical Statistics Collections 9, 150–162. [Google Scholar]

[R7] [7].Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for q-learning. Biometrika 101, 831–847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Lu W, Zhang HH, and Zeng D (2013). Variable selection for optimal treatment decision. Statistical Methods in Medical Research 22(5), 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Meinshausen N and Buhlmann P (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics 34(3), 1436–1462. [Google Scholar]

[R10] [10].Murphy S (2003). Optimal dynamic treatment regimes. Journal of The Royal Statistical Society: Series B 65, 331–366. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Murphy S (2005). A generalization error for q-learning. Journal of Machine Learning Research 6, 1073–1097. [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. The Annals of Statistics 39, 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Robins J, Hernan MA, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11, 550–560. [DOI] [PubMed] [Google Scholar]

[R14] [14].Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. (2004). Sequenced treatment alternatives to relieve depression (star* d): rationale and design. Controlled Clinical Trials 25(1), 119–142. [DOI] [PubMed] [Google Scholar]

[R15] [15].Shi C, Fan A, Song R, and Lu W (2018). High-dimensional a-learning for optimal dynamic treatment regimes. The Annals of Statistics 46(3), 925–957. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Shi C, Song R, and Lu W (2016). Robust learning for optimal treatment decision with np-dimensionality. Electronic Journal of Statistics 10, 2894–2921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Song R, Wang W, Zeng D, and Kosorok MR (2015). Penalized qlearning for dynamic treatment regimens. Statistica Sinica 25, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Su X, Tsai C-L, Wang H, Nickerson DM, and Li B (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10, 141–158. [Google Scholar]

[R19] [19].Tian L, Alizadeh AA, Gentles AJ, and Tibshirani R (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 109, 1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42(3), 1166–1202. [Google Scholar]

[R21] [21].Watkins CJCH and Dayan P (1992). Q-learning. Machine Learning 8, 279–292. [Google Scholar]

[R22] [22].Zhang B, Tsiatis AA, Davidian M, Zhang M, and Laber EB (2012). Estimating optimal treatment regimes from a classification perspective. Stat 1, 103–104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68, 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Zhang C and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society. Series B 76(1), 217–242. [Google Scholar]

[R25] [25].Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, and Kosorok MR (2015). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102, 151–168. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

High-Dimensional Inference for Personalized Treatment Decision

X Jessie Jeng

Wenbin Lu

Huimin Peng

Abstract

1. Introduction

2. Method and Theory

2.1. An A-learning framework

2.2. Unbiased estimation with misspecified baseline function

2.3. Some examples for ${\hat{μ}}_{π}$

2.4. Interval estimation and p-value

3. Simulation

Table 1.

Table 2.

4. Application to STAR*D study

Table 3.

Fig 1.

Fig 2.

5. Discussion

Appendix A: Appendix

A.1. Construction of $\hat{Θ}$

A.2. A preliminary lemma

A.3. Proof of Lemma 2.1

A.4. Proof of Lemma 2.2

A.5. Proof of Lemma 2.3

A.6. Proof of Corollary 2.6

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

High-Dimensional Inference for Personalized Treatment Decision

X Jessie Jeng

Wenbin Lu

Huimin Peng

Abstract

1. Introduction

2. Method and Theory

2.1. An A-learning framework

2.2. Unbiased estimation with misspecified baseline function

2.3. Some examples for μ^π

2.4. Interval estimation and p-value

3. Simulation

Table 1.

Table 2.

4. Application to STAR*D study

Table 3.

Fig 1.

Fig 2.

5. Discussion

Appendix A: Appendix

A.1. Construction of Θ^

A.2. A preliminary lemma

A.3. Proof of Lemma 2.1

A.4. Proof of Lemma 2.2

A.5. Proof of Lemma 2.3

A.6. Proof of Corollary 2.6

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3. Some examples for ${\hat{μ}}_{π}$

A.1. Construction of $\hat{Θ}$