Improved doubly robust estimation in learning optimal individualized treatment rules

Yinghao Pan; Ying-Qi Zhao

doi:10.1080/01621459.2020.1725522

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2020 Sep 8;116(533):283–294. doi: 10.1080/01621459.2020.1725522

Improved doubly robust estimation in learning optimal individualized treatment rules

Yinghao Pan ^1,², Ying-Qi Zhao ^1,^2,^*

PMCID: PMC8132732 NIHMSID: NIHMS1578951 PMID: 34024961

Abstract

Individualized treatment rules (ITRs) recommend treatment according to patient characteristics. There is a growing interest in developing novel and efficient statistical methods in constructing ITRs. We propose an improved doubly robust estimator of the optimal ITRs. The proposed estimator is based on a direct optimization of an augmented inverse-probability weighted estimator (AIPWE) of the expected clinical outcome over a class of ITRs. The method enjoys two key properties. First, it is doubly robust, meaning that the proposed estimator is consistent when either the propensity score or the outcome model is correct. Second, it achieves the smallest variance among the class of doubly robust estimators when the propensity score model is correctly specified, regardless of the specification of the outcome model. Simulation studies show that the estimated ITRs obtained from our method yield better results than those obtained from current popular methods. Data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study is analyzed as an illustrative example.

Keywords: Double robustness, Individualized treatment rule, Personalized medicine, Propensity score

1. Introduction

In recent years, personalized medicine, or precision medicine, has received tremendous attention in clinical practice and medical research (Hamburg and Collins, 2010; Chan and Ginsburg, 2011; Collins and Varmus, 2015). Its development originates from the fact that patients often exhibit heterogenous responses to treatments. A drug that works for the majority of individuals may not work for a subgroup of patients with certain characteristics. For example, trastuzumab is shown to be effective for treating HER2-overexpressing metastatic breast cancer as it is specifically designed to target HER2 amplification (Vogel et al., 2002). Individualized treatment rules (ITRs) formalize personalized treatment decisions, which recommend treatments using patients’ own information, with the optimal ITR maximizing the mean of a pre-specified clinical outcome if followed by the patient population.

Using data collected from clinical trials or observational studies, numerous methods have been developed on estimation of optimal ITRs. One approach is to fully or partly specify a model of the clinical outcome given treatment and covariates, and then use the fitted model to infer the optimal ITR. This includes Q-learning (Qian and Murphy, 2011) and A-learning (Murphy, 2003; Robins, 2004; Blatt et al., 2004). Q-learning models the conditional mean of the outcome given treatment and covariates while A-learning directly models the differential treatment effects between treatments. However, one drawback of Q- and A-learning is that the optimal treatment rule is indirectly estimated through posited regression models, and thus sensitive to model misspecification. Value-search or direct-maximization methods offer an alternative to regression-based methods by directly maximizing an estimator of the marginal mean outcome over a pre-specified class of ITRs (Zhao et al., 2012; Zhang et al., 2012; Zhou et al., 2017; Zhao et al., 2019), thereby separating the class of decision rules from the posited regression models.

In particular, Zhang et al. (2012) estimated the optimal ITR by maximizing an augmented inverse-probability weighted estimator (AIPWE) for the population mean outcome over a class of ITRs. The aforementioned estimator is doubly robust (DR) in the sense that it consistently estimates the optimal ITR if either the propensity score or the outcome regression model is correctly specified. Doubly robust estimation has enjoyed great popularity in missing data and causal inference models (Scharfstein et al., 1999; Robins and Rotnitzky, 2001; Van der Laan and Robins, 2003; Bang and Robins, 2005). The DR estimators require specification of two nuisance working models, one for the missingness or treatment assignment mechanism, and another one for the distribution of complete data or potential outcomes. Historically, estimation of the nuisance parameters indexing the working models in DR estimators had received little attention, partly because the asymptotic properties of the DR estimators do not depend on the choice of nuisance parameter estimates when both working models are correctly specified (Tsiatis, 2007). As a result, standard maximum likelihood estimators are used, i.e., logistic regression for the propensity score model, and linear regression for the outcome model. This standard practice starts to change after Kang and Schafer (2007) cautioned against the use of DR estimators when both working models are misspecified. Several discussion articles (Robins et al., 2007; Tsiatis and Davidian, 2007; Tan, 2007) further pointed out that the choice of nuisance parameter estimates can have a dramatic impact on the properties of the DR estimators when at least one working model is misspecified. Indeed, in the context of estimating optimal ITRs using DR methods, there is still room for improved performance. For example, as illustrated in simulation studies of Zhao et al. (2019), the usual DR estimator (Zhang et al., 2012) can be inefficient, i.e, exhibits a large variation when the outcome regression model is misspecified. The poor performance may be partly a consequence of the default use of maximum likelihood estimators for the coefficients in the misspecified outcome regression model (Cao et al., 2009). This motivates us to develop improved DR approaches for learning optimal ITRs.

Several improved DR estimators have been proposed in missing data and causal inference models for the purpose of variance reduction. The nuisance parameters indexing the outcome model are estimated so as to minimize the variance of the DR estimator under a correctly specified propensity score model (Rubin and van der Laan, 2008; Cao et al., 2009; Tan, 2010; Tsiatis et al., 2011). In this article, we propose to estimate the optimal ITR by maximizing an improved DR estimator of the population mean outcome among a set of ITRs. Our proposed estimator is doubly robust. In addition, it achieves the smallest variance among its class of DR estimators when the propensity score models are correctly specified, regardless of the specification of the outcome models. As we demonstrate, this approach leads to estimated optimal regimes achieving comparable or better performance than those from Zhang et al. (2012).

The heterogeneity in response to treatments exists not only between patients but also within each patient. A patient’s response to treatment can change over time because individual characteristics, and the nature of disease itself, evolve. This motivates the development of dynamic treatment regimes (DTRs) (Murphy, 2003), which are sequential decision rules that adapt over time to the clinical status of each patient. At each decision point, the available patient history data are used as input for the decision rule, and an individualized treatment is recommended for the next stage. Construction of optimal DTRs has been of great interest, where several methods are developed to handle multi-stage problems (Zhang et al., 2013; Laber et al., 2014; Schulte et al., 2014; Zhao et al., 2015; Wallace and Moodie, 2015; Liu et al., 2018). In this paper, we also discuss extending the proposed method to estimate optimal DTRs with added efficiency and robustness.

This article proposes 2 major contributions to the literature. (1) We propose improved DR approaches for estimating optimal ITRs, which has not been investigated in the field of personalized medicine. (2) Current literature such as Cao et al. (2009) and Tsiatis et al. (2011) employed inverse-probability weighted estimating equations to estimate the nuisance parameters. Instead, we propose augmented inverse-probability weighted estimating equations for this purpose, which brings further stability.

The remainder of the article is organized as follows. In Section 2, we introduce background information and review existing doubly robust estimators in learning optimal ITRs. We then formally describe the proposed improved doubly robust estimator in single-stage optimal treatment problems. Theoretical results are presented in Section 3. In Section 4, we present simulation studies to evaluate finite sample performance of the proposed method. The method is then illustrated using data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) Study in Section 5. Some concluding remarks are given in Section 6. Technical results are relegated to the supplementary material.

2. Method

2.1. Background and preliminaries

We consider the estimation of the optimal ITR in the single-stage setting. We observe ${(X_{i}, A_{i}, Y_{i})}_{i = 1}^{n}$ , comprising n independent and identically distributed triplets of $(X, A, Y)$ , where $X \in X$ denotes the patient’s baseline variables; $A \in A = {- 1, 1}$ denotes the assigned treatment; Y denotes the clinical outcome of interest, coded so that the larger the better. The data comes from either randomized trials or observational studies. An ITR is a map $d : X \mapsto A$ such that a patient presenting with $X = x$ will receive treatment $d (x)$ .

Let $D$ denote a class of ITRs of interest. To formally define the optimal ITR, $d^{opt}$ , we adopt the potential outcome framework (Rubin, 1974). Let Y(a) denote the potential outcome under treatment $a \in {- 1, 1}$ . The potential outcome under any ITR, d, can be defined as $Y (d) = Y (1) I {d (X) = 1} + Y (- 1) I {d (X) = - 1}$ , where $I {\cdot}$ is the indicator function. Here we suppress the dependence of Y(d) on X. The performance of d is measured by the marginal mean outcome $V (d) ≜ E {Y (d)}$ , the so-called value function associated with the rule d. In other words, the value function V(d) represents the overall population mean if treatment were to be assigned according to d. The optimal ITR, $d^{opt}$ , is a rule that maximizes V(d) among $D$ , i.e., $V (d^{opt}) \geq V (d)$ for all $d \in D$ .

In order to connect the potential outcomes with the observed data, we make the following assumptions: (i) consistency, $Y = Y (1) I (A = 1) + Y (- 1) I (A = - 1)$ ; (ii) positivity, $P (A = a | X) > 0$ for $a = \pm 1$ and for all X; (iii) no unmeasured confounding, $A ⊥ {Y (- 1), Y (1)} | X$ . These are standard and well-studied assumptions in causal inference (Imbens and Rubin, 2015). Assumption (iii) is trivial in a randomized trial but unverifiable in an observation study (Robins et al., 2000).

Define $Q_{0} (x, a) ≜ E (Y | X = x, A = a)$ , then under the aforementioned assumptions, it can be shown that

V (d) = E_{X} [Q_{0} (X, 1) I {d (X) = 1} + Q_{0} (X, - 1) I {d (X) = - 1}],

where the outer expectation $E_{X} [\cdot]$ is taken with respect to the marginal distribution of X. The above formulation implies that $d^{opt} (x) = \arg \max_{a \in {- 1, 1}} Q_{0} (x, a)$ . One approach is to posit a regression model $Q (X, A; β)$ for $Q_{0} (X, A)$ , and estimate the nuisance parameter β by some β; e.g. least squares. Subsequently the optimal ITR is estimated by $\hat{d} (x) = \arg \max_{a \in {- 1, 1}} Q (x, a; β)$ (Qian and Murphy, 2011). This is usually referred to as an indirect approach, which could lead to inconsistent estimators of $d^{opt}$ when the posited model Q(X, A; β) is incorrect.

To alleviate the above issue, value-search or direct-maximization methods attempt to estimate $d^{opt}$ by directly maximizing an estimator of the value function over the class $D$ . The key step is to construct a consistent and robust estimator of the value function, say $\hat{V} (\cdot)$ . Then $d^{opt}$ is estimated by $\hat{d} = \arg \max_{d \in D} \hat{V} (d)$ . Let $π_{0} (a, x) ≜ P (A = a | X = x)$ denote the true propensity score, so the value function can be rewritten as (Qian and Murphy, 2011; Zhao et al., 2012)

V (d) = E [\frac{Y}{π_{0} (A, X)} I {A = d (X)}] .

In an observation study, $π_{0} (A, X)$ is unknown. A parametric model $π (A, X; γ)$ may be posited; for example, a logistic regression model $π (1, X; γ) = {1 + \exp (- X^{⊤} γ)}^{- 1}$ , $X = {(1, X^{⊤})}^{⊤}$ . Let $\hat{γ}$ denote the maximum likelihood estimator for $γ$ based on ${(A_{i}, X_{i})}_{i = 1}^{n}$ , an inverse-probability weighted estimator (IPWE) for V(d) is

{\hat{V}}^{IPWE} (d; \hat{γ}) ≜ ℙ_{n} [\frac{Y}{π (A, X; \hat{γ})} I {A = d (X)}],

where $ℙ_{n}$ is the empirical measure. It is straightforward to show that the IPWE is consistent for V(d) if $π (A, X; γ)$ is correctly specified, that is, $π_{0} (A, X) = π (A, X; γ_{0})$ for some γ₀, but may not be otherwise.

Following ideas from Robins et al. (1994), an AIPWE can be constructed:

{\hat{V}}^{AIPWE} (d; \hat{γ}, β) ≜ ℙ_{n} [\frac{Y I {A = d (X)}}{π {d (X), X; \hat{γ}}} - \frac{I {A = d (X)} - π {d (X), X; \hat{γ}}}{π {d (X), X; \hat{γ}}} Q {X, d (X); β}] .

(1)

By adding an augmentation term that involves both estimated propensity scores and regression models, the AIPWE improves efficiency and provides additional protection against model misspecification. The AIPWE is doubly robust in that it consistently estimates V(d) as long as one of the nuisance working models is correctly specified, i.e., $π (A, X; \hat{γ}) \overset{p}{\to} π_{0} (A, X) or Q (X, A; β) \overset{p}{\to} Q_{0} (X, A)$ .

Throughout the paper, we focus on the AIPWE, and suppress the superscript ‘AIPWE’ in $\hat{V} (d; \hat{γ}, β)$ . We refer to the estimator (1), with γ estimated by maximum likelihood and β estimated by least squares, as the usual doubly robust estimator from Zhang et al. (2012). However, when the propensity score model is correctly specified, but the outcome model is not, it is inefficient to adopt the least squares estimates of β in (1), where $\hat{V} (d; \hat{γ}, β)$ could have a large variation. This motivates us to develop improved DR estimators with desirable efficiency properties.

2.2. Improved doubly robust estimators when the propensity score is fully specified

We first consider a fully specified propensity score model $π (A, X)$ , say, involving no nuisance parameters. We will relax this shortly. Here, the specified propensity score model $π (A, X)$ may or may not be the same as the true propensity $π_{0} (A, X)$ . For a fixed treatment regime d, the class of AIPW estimators for V(d) is

\hat{V} (d; β) ≜ ℙ_{n} [\frac{Y I {A = d (X)}}{π {d (X), X}} - \frac{I {A = d (X)} - π {d (X), X}}{π {d (X), X}} Q {X, d (X); β}] .

(2)

We use $\hat{V} (d; β)$ to emphasize that the estimator does not involve the nuisance parameters related to propensity score, and varying the choice of β leads to different DR estimators with potentially very different behaviors. In the following, we will derive an estimator for β, denoted by $β^{opt}$ , such that the resulting value function estimator $\hat{V} (d; β^{opt})$ satisfies two properties:

(i)
Doubly robust. $\hat{V} (d; β^{opt})$ consistently estimates V(d) when either the propensity score or the outcome model is correctly specified.
(ii)
If the propensity score is correctly specified, it achieves the smallest asymptotic variance among all estimators of form (2), regardless of the specification of the outcome model.

Hence, when the outcome model is correctly specified, but the propensity score may not be, the desired $β^{opt}$ must converge in probability to β₀, where $Q (X, A; β_{0}) = Q_{0} (X, A)$ . On the other hand, when the propensity score is correctly specified, $Var {\hat{V} (d; β^{opt})} \leq Var {\hat{V} (d; β)}$ for any $β$ .

Lemma 1.

Let β be any root-n consistent estimator converging in probability to some $β^{*}$ , i.e., $β - β^{*} = O_{p} (n^{- 1 / 2})$ . When the propensity score is correct, $π (A, X) = π_{0} (A, X)$ , but $Q (X, A; β)$ may or may not be, the influence function for $\hat{V} (d; β)$ is

\frac{Y I {A = d (X)}}{π_{0} {d (X), X}} - \frac{I {A = d (X)} - π_{0} {d (X), X}}{π_{0} {d (X), X}} Q {X, d (X); β^{*}} - V (d) .

(3)

A proof is given in Appendix A. The preceding result shows that when the propensity score is correct, the asymptotic variance of $\hat{V} (d; β)$ does not depend on the sampling variation of β but only on its limit in probability β*. Based on Lemma 1 and the law of total variance, its asymptotic variance is proportional to

E (Var [\frac{Y I {A = d (X)}}{π_{0} {d (X), X}} - \frac{I {A = d (X)} - π_{0} {d (X), X}}{π_{0} {d (X), X}} Q {X, d (X); β^{*}} | X]) + Var (E [\frac{Y I {A = d (X)}}{π_{0} {d (X), X}} - \frac{I {A = d (X)} - π_{0} {d (X), X}}{π_{0} {d (X), X}} Q {X, d (X); β^{*}} | X]) = (I) + (I I) .

(4)

Notice that (II) in (4) equals $Var [Q_{0} {X, d (X)}]$ , which does not depend on β*. Furthermore, it can be shown that (see Appendix A for details)

(I) = E [\frac{1 - π_{0} {d (X), X}}{π_{0} {d (X), X}} Q^{2} {X, d (X); β^{*}}] + E {[\frac{Y I {A = d (X)}}{π_{0} {d (X), X}} - Q_{0} {X, d (X)}]}^{2} - 2 E [\frac{1 - π_{0} {d (X), X}}{π_{0} {d (X), X}} Q_{0} {X, d (X)} \cdot Q {X, d (X); β^{*}}]

Denote the minimizer of (4) as β^opt. By taking the derivative of (4) with respect to β* and setting it equal to zero, β^opt is the solution to

E (\frac{1 - π_{0} {d (X), X}}{π_{0} {d (X), X}} [Q_{0} {X, d (X)} - Q {X, d (X); β}] Q_{β} {X, d (X); β}) = 0,

(5)

where $Q_{β} (X, A; β) ≜ \partial Q (X, A; β) / \partial β$ .

Hence, if the outcome model $Q (X, A; β)$ is correct; that is, $Q (X, A; β_{0}) = Q_{0} (X, A)$ for some β₀, then in fact $β^{opt} = β_{0}$ . If the outcome model is incorrect, such $β^{opt}$ still exists and minimizes (4). However, the usual least squares estimates $β^{LS}$ solving $ℙ_{n} [{Y - Q (X, A; β)} Q_{β} (X, A; β)] = 0$ does not converge to this $β^{opt}$ . This explains why the usual DR estimator $\hat{V} (d; β^{LS})$ is sub-optimal when the outcome regression model is misspecified.

In the following, we will propose two different forms of estimators $β^{opt}$ , which converge in probability to $β^{opt}$ and satisfy (i) and (ii) simultaneously. We first consider $β^{opt1}$ as the solution to the following inverse-probability weighted estimating equation

ℙ_{n} (\frac{I {A = d (X)}}{π {d (X), X}} \frac{1 - π {d (X), X}}{π {d (X), X}} [Y - Q {X, d (X); β}] Q_{β} {X, d (X); β}) = 0.

(6)

This can be viewed as a weighted least squares based on subjects whose treatment assignments coincide with those recommended by d, with weights $[1 - π {d (X), X}] / π^{2} {d (X), X}$ . When the propensity score is correct, but the outcome regression may not be, the left-hand side of (6) converges in probability to the left-hand side of (5), hence $β^{opt 1} \overset{p}{\to} β^{opt}$ . On the other hand, when the outcome regression is correct but the propensity score may not be, the left-hand side of (6) converges in probability to

E (\frac{π_{0} {d (X), X} [1 - π {d (X), X}]}{π^{2} {d (X), X}} [Q_{0} {X, d (X)} - Q {X, d (X); β}] Q_{β} {X, d (X); β}),

which equals 0 when $β = β_{0}$ , thus $β^{opt 1} \overset{p}{\to} β_{0}$ . The following lemma formally establishes the improved doubly robust property of the proposed estimator $\hat{V} (d; β^{opt 1})$ . See Appendix A for the proof.

Lemma 2.

$\hat{V} (d; β^{opt 1}) \overset{p}{\to} V (d)$ when either the propensity score or the outcome regression model is correctly specified. In addition, when the propensity score model is correct, $\hat{V} (d; β^{opt 1})$ achieves the smallest asymptotic variance among all estimators of form (2).

The estimating equation (6) only utilizes the subjects whose treatment assignments coincide with those recommended by d. Since we need to search for the best treatment rule in a large class of ITRs, $D$ , it is possible that for some d, there are very few subjects satisfying $A = d (X)$ . This leads to highly unstable $β^{opt 1}$ and could be problematic, in particular when the sample size n is very small. To address this issue, we propose an augmented inverse-probability weighted estimating equation, denoted by $β^{opt 2}$ , which is the solution to

(*) - ℙ_{n} (\frac{I {A = d (X)} - π {d (X), X}}{π {d (X), X}} . \frac{1 - π {d (X), X}}{π {d (X), X}} [{\hat{Q}}_{0} {X, d (X)} - Q {X, d (X); β}] Q_{β} {X, d (X); β}) = 0 .

Here, (*) is the left hand side of (6), and ${\hat{Q}}_{0} {X, d (X)} = {\hat{Q}}_{0} (X, 1) I {d (X) = 1} + {\hat{Q}}_{0} (X, - 1) I {d (X) = - 1} . {\hat{Q}}_{0} (X, a)$ is the estimator for $E (Y | X, A = a)$ . We propose to use nonparametric techniques for obtaining ${\hat{Q}}_{0} (X, a)$ , which provides flexibility in model specification. For continuous X, we apply the kernel regression method, i.e.,

{\hat{Q}}_{0} (X, a) = \frac{\sum_{i = 1}^{n} K_{H} (X_{i} - X) I (A_{i} = a) Y_{i}}{\sum_{i = 1}^{n} K_{H} (X_{i} - X) I (A_{i} = a)},

where $K_{H} (\cdot) = | H |^{- 1 / 2} K (H^{- 1 / 2} \cdot)$ is a multivariate kernel with a bandwidth matrix H. When X contains both continuous and categorical variables, the ‘generalized product kernels’ from Racine and Li (2004) is used. Under some regularity conditions, ${\hat{Q}}_{0} {X, d (X)}$ is a consistent estimator for $Q_{0} {X, d (X)}$ . As a consequence, $β^{opt 2} \overset{p}{\to} β^{opt}$ when the propensity score is correct; and $β^{opt 2} \overset{p}{\to} β_{0}$ when the outcome regression is correct. The following lemma formally establishes the improved doubly robust property of the estimator $\hat{V} (d; β^{opt 2})$ . The technical conditions and the proofs are provided in Appendix A.

Lemma 3.

$\hat{V} (d; β^{opt 2}) \to V (d)$ when either the propensity score or the outcome regression model is correctly specified. In addition, when the propensity score is correct, $\hat{V} (d; β^{opt 2})$ achieves the smallest asymptotic variance among all estimators of form (2).

2.3. Scenario where there is a nuisance parameter in the propensity score model

In practice, if the propensity scores are unknown, we can posit a parametric propensity score model $π (A, X; γ)$ involving some nuisance parameters. To construct an improved DR estimator for the value function, we must take into account the effect of estimating γ. Consider the class of AIPW estimators presented in (1). Let $\hat{γ}$ be the maximum likelihood estimator of γ based on ${(A_{i}, X_{i})}_{i = 1}^{n}$ . We aim to find $β^{opt}$ such that $\hat{V} (d; \hat{γ}, β^{opt})$ is doubly robust, and has the smallest asymptotic variance among the class of estimators (1) when the propensity score is correctly specified.

Since $\hat{γ}$ is the maximizer of the binomial likelihood

\prod_{i = 1}^{n} π {(1, X_{i}; γ)}^{I (A_{i} = 1)} {1 - π (1, X_{i}; γ)}^{I (A_{i} = - 1)},

the score vector for γ is

S_{γ} (A, X, γ) = I (A = 1) \frac{π_{γ} (1, X; γ)}{π (1, X; γ)} - I (A = - 1) \frac{π_{γ} (1, X; γ)}{1 - π (1, X; γ)},

where $π_{γ} (1, X; γ) = \partial π (1, X; γ) / \partial γ$ . When $π (A, X; γ)$ is correctly specified, i.e. $π (A, X; γ_{0}) = π_{0} (A, X)$ for some γ₀, and β converging in probability to β*, the influence functions corresponding to estimators of the form (1) have the following expression

\tilde{φ} (Y, A, X, γ_{0}, β^{*}) - Γ_{0} (β^{*}) Σ_{γ γ, 0}^{- 1} S_{γ} (A, X, γ_{0}),

(7)

where $Σ_{γ γ, 0} = E {S_{γ} (A, X, γ_{0}) S_{γ}^{⊤} (A, X, γ_{0})}, Γ_{0} (β) = - E {\partial \tilde{φ} (Y, A, X, γ_{0}, β) / \partial γ^{⊤}}$ , and $\tilde{φ} (Y, A, X, γ, β) = \frac{Y I {A = d (X)}}{π {d (X), X; γ}} - \frac{I {A = d (X)} - π {d (X), X; γ}}{π {d (X), X; γ}} Q {X, d (X); β} - V (d) .$

Compared with (3), the influence functions (7) involve an additional term due to estimation of γ. However, this additional term disappears when both models are correct. Define $Φ_{0} (β) ≜ - E {\partial^{2} \tilde{φ} (Y, A, X, γ_{0}, β) / \partial γ^{⊤} \partial β}$ . In a slight abuse of notation, denote the minimizer of the variance of (7) as $β^{opt}$ . It is the solution to

E (\frac{1 - π_{0} {d (X), X}}{π_{0} {d (X), X}} [Q_{β} {X, d (X); β} + Φ_{0} (β) Σ_{γ γ, 0}^{- 1} \frac{π_{γ} {d (X), X; γ_{0}}}{1 - π_{0} {d (X), X}}] . [Q_{0} {X, d (X)} - Q {X, d (X); β} - Γ_{0} (β) Σ_{γ γ, 0}^{- 1} \cdot \frac{π_{γ} {d (X), X; γ_{0}}}{1 - π_{0} {d (X), X}}]) = 0 .

(8)

Detailed derivations of the influence function and its variance are deferred to Appendix A.

To compress notations, we write to $R ≜ I {A = d (X)}$ . Consider $β^{opt 3}$ as the solution to

ℙ_{n} (\frac{R [1 - π {d (X), X; \hat{γ}}}{π^{2} {d (X), X; \hat{γ}}} [Q_{β} {X, d (X); β} + \hat{Φ} (β) {\hat{Σ}}_{γ γ}^{- 1} \cdot \frac{π_{γ} {d (X), X; \hat{γ}}}{1 - π {d (X), X; \hat{γ}}}] [Y - Q {X, d (X); β} - \hat{Γ} (β) {\hat{Σ}}_{γ γ}^{- 1} \frac{π_{γ} {d (X), X; \hat{γ}}}{1 - π {d (X), X; \hat{γ}}}]) = 0,

(9)

where ${\hat{Σ}}_{γ γ} = ℙ_{γ γ} {S_{γ} (A, X, \hat{γ}) S_{γ}^{⊤} (A, X, \hat{γ})}, \hat{Γ} (β) = - ℙ_{n} {\partial \tilde{φ} (Y, A, X, \hat{γ}, β) / \partial γ^{⊤}}$ , and $\hat{Φ} (β) = - ℙ_{n} {\partial^{2} \tilde{φ} (Y, A, X, \hat{γ}, β) / \partial γ^{⊤} \partial β}$ . In Appendix A, we show that $\hat{V} (d; \hat{γ}, β^{opt 3})$ is doubly robust, and achieves the smallest asymptotic variance with $β^{opt 3} \overset{p}{\to} β^{opt}$ when the propensity score is correct.

Correspondingly, we can construct an augmented inverse-probability weighted estimating equation and consider $β^{opt4}$ as the solution to

(* *) - ℙ_{n} (\frac{[R - π {d (X), X; \hat{γ}}] [1 - π {d (X), X; \hat{γ}}]}{π^{2} {d (X), X; \hat{γ}}} \cdot [Q_{β} {X, d (X); β} + \hat{Φ} (β) {\hat{Σ}}_{γ γ}^{- 1} \cdot \frac{π_{γ} {d (X), X; \hat{γ}}}{1 - π {d (X), X; \hat{γ}}}] \cdot [{\hat{Q}}_{0} {X, d (X)} - Q {X, d (X); β} - \hat{Γ} (β) {\hat{Σ}}_{γ γ}^{- 1} \frac{π_{γ} {d (X), X; \hat{γ}}}{1 - π {d (X), X; \hat{γ}}}]) = 0,

(10)

where (**) is the left hand side of (9). Using a similar argument, $\hat{V} (d; \hat{γ}, β^{opt 4})$ satisfies (i) and (ii), and is improved doubly robust.

In the above discussion, we proposed improved DR estimators of V(d) for a fixed treatment regime d. Notice that by (8), the optimal value $β^{opt}$ is d-dependent, i.e., different d’s correspond to different $β^{opt}$ . Rigorously speaking, we should write $β^{opt}$ (d), and $β^{opt}$ (d) for the nuisance parameter estimates. To estimate $d^{opt}$ , we first identify the corresponding $β^{opt} (d)$ and the $\hat{V} {d; \hat{γ}, β^{opt} (d)}$ , for each $d \in D$ . We then find the optimal d among the class $D$ that leads to the largest $\hat{V} {d; \hat{γ}, β^{opt} (d)}$ , i.e., $\hat{d} = \arg \max_{d \in D} \hat{V} {d; \hat{γ}, β^{opt} (d)}$ . In practice, the ITR is often indexed by a set of parameters, for instance, $d (x) = sign {x^{⊤} η}$ , where $x = {(1, x^{⊤})}^{⊤}$ . Since $\hat{V} {d; \hat{γ}, β^{opt} (d)}$ is a nonsmooth function of η, standard optimization methods can be problematic. We used a genetic algorithm discussed by Goldberg (1989), which is available in the R package rgenoud (Mebane Jr and Sekhon, 2011). In the rest of the paper, we suppress the letter d in $β^{opt} (d)$ and $β^{opt} (d)$ when there is no confusion.

3. Theoretical results

In this section, we establish asymptotic normality of the proposed estimators and the usual doubly robust estimator of V(d). We do not discuss the situation when both propensity score and outcome models are misspecified, given that the resulting estimator is not consistent for V(d). We first consider the case where propensity score is fully specified. We have the following result.

Theorem 1.

(Asymptotic normality when propensity score model is full specified). When either the propensity score or the outcome model is correct,

\sqrt{n} {\hat{V} (d; β^{LS}) - V (d)} \overset{D}{\to} N (0, U_{1} (θ_{0}^{LS})),

\sqrt{n} {\hat{V} (d; β^{opt1}) - V (d)} \overset{D}{\to} N (0, U_{2} (θ_{0}^{opt1})),

\sqrt{n} {\hat{V} (d; β^{opt2}) - V (d)} \overset{D}{\to} N (0, U_{3} (θ_{0}^{opt2})) .

See Appendix C for detailed expressions of $U_{1} (θ), U_{2} (θ), U_{3} (θ)$ . The true parameters are $θ_{0}^{LS} = {(β_{LS}^{* ⊤}, V (d))}^{⊤}$ where $β_{LS}^{*}$ satisfies $E [Q_{β} (X, A; β_{LS}^{*}) {Y - Q (X, A; β_{LS}^{*})}] = 0$ . $θ_{0}^{opt 1} = {({β_{opt 1}^{*}}^{⊤}, V (d))}^{⊤}$ where $β_{opt 1}^{*}$ satisfies $E (\frac{I {A = d (X)}}{π {d (X), X}} \cdot \frac{1 - π {d (X), X}}{π {d (X), X}} [Y - Q {X, d (X); β_{opt2}^{*}}] Q_{β} {X, d (X); β_{opt1}^{*}}) = 0$ . $θ_{0}^{opt 2} = {({β_{opt 2}^{*}}^{⊤}, V (d))}^{⊤}$ where $β_{opt 2}^{*}$ satisfies

E (\frac{I {A = d (X)}}{π {d (X), X}} \cdot \frac{1 - π {d (X), X}}{π {d (X), X}} [Y - Q {X, d (X); β_{opt2}^{*}}] Q_{β} {X, d (X); β_{opt 2}^{*}}) - E (\frac{I {A = d (X)} - π {d (X), X}}{π {d (X), X}} \cdot \frac{1 - π {d (X), X}}{π {d (X), X}} . [Q_{0} {X, d (X)} - Q {X, d (X); β_{opt2}^{*}}] Q_{β} {X, d (X); β_{opt 2}^{*}}) = 0 .

The estimators $\hat{V} (d; β^{LS})$ and $\hat{V} (d; β^{opt1})$ involve solving jointly a set of M-estimating equations (Stefanski and Boos, 2002). Thus, the asymptotic variance of $\hat{V} (d; β^{LS})$ and $\hat{V} (d; β^{opt1})$ can be calculated based on standard M-estimation theory. The estimator $\hat{V} (d; β^{opt2})$ is obtained by solving a set of estimating equations where some infinite dimensional parameters, in this case, $Q_{0} (X, a), a = \pm 1$ , are estimated nonparametrically in the first stage, which is referred to as semiparametric M-estimators (Andrews, 1994; Newey, 1994; Chen et al., 2003; Ichimura and Lee, 2010). In Appendix B, we establish asymptotic normality of such semiparametric M-estimators by extending Theorem 2 in Chen et al. (2003). The detailed proof of Theorem 1 can be found in Appendix C.

Remark 1.

When propensity score is correct, it can be shown that $U_{1} (θ) = U_{2} (θ) = U_{3} (θ)$ which equals to (4), the asymptotic variance of the influence function for $\hat{V} (d; β)$ . In addition, when propensity score is correct but outcome model incorrect, $β_{opt1}^{*} = β_{opt2}^{*} = β^{opt}$ . Recall that $β^{opt}$ is defined in (5), which minimizes the asymptotic variance (4). However, $β_{LS}^{*}$ , the limit of least squares estimates, is different from $β^{opt}$ . Consequently, $\hat{V} (d; β^{opt1})$ and $\hat{V} (d; β^{opt2})$ have the same asymptotic variance, which is smaller than that of $\hat{V} (d; β^{LS})$ . Though, in small sample size scenarios, $\hat{V} (d; β^{opt2})$ is preferred since it utilizes the complete data, and could lead to a more stable estimate. When both models are correct, $β^{opt} = β_{LS}^{*} = β_{0}$ , where $β_{0}$ satisfies $Q (X, A; β_{0}) = Q_{0} (X, A)$ . As a result, all three estimators have the same asymptotic variance. When the outcome model is correct but propensity score incorrect, it is not possible to directly compare the asymptotic variances of these estimators.

The following theorem presents asymptotic properties of $\hat{V} (d; \hat{γ}, β^{opt 3})$ and $\hat{V} (d; \hat{γ}, β^{opt 4})$ , the estimators for the value function of d where there is a nuisance parameter in the propensity score model.

Theorem 2.

(Asymptotic normality when there is a nuisance parameter in the propensity score model). When either the propensity score or the outcome model is correct,

\sqrt{n} {\hat{V} (d; \hat{γ}, β^{LS}) - V (d)} \overset{D}{\to} N (0, U_{4} (θ_{0}^{LS 2})) .

The true values arewhere $θ_{0}^{LS 2} = {(γ^{* ⊤}, {β_{LS}^{*}}^{⊤}, V (d))}^{⊤}$ where $γ^{*}$ satisfies $E {S_{γ} (A, X; γ^{*})} = 0$ .

When either the propensity score or the outcome model is correct,

\sqrt{n} {\hat{V} (d; \hat{γ}, β^{opt 3}) - V (d)} \overset{D}{\to} N (0, U_{5} (θ_{0}^{opt 3})),

\sqrt{n} {\hat{V} (d; \hat{γ}, β^{opt4}) - V (d)} \overset{D}{\to} N (0, U_{6} (θ_{0}^{opt4})) .

The true parameters are $θ_{0}^{opt 3} = {(γ^{* ⊤}, {ζ_{opt3}^{*}}^{⊤}, {β_{opt 3}^{*}}^{⊤}, V (d))}^{⊤}$ where $(ζ_{opt 3}^{*}, β_{opt3}^{*})$ is the solution to the following set of equations:

α + E {\partial \tilde{φ} (Y, A, X, γ^{*}, β) / \partial γ} = 0, [ψ_{1}, \dots, ψ_{q}] - E {S_{γ} (A, X, γ^{*}) S_{γ}^{⊤} (A, X, γ^{*})} = 0, [ϕ_{1}, \dots, ϕ_{q}] + E {\partial^{2} \tilde{φ} (Y, A, X, γ^{*}, β) / \partial γ^{⊤} \partial β} = 0,

(11)

and

E (\frac{R [1 - π {d (X), X; γ^{*}}]}{π^{2} {d (X), X; γ^{*}}} [Q_{β} {X, d (X); β} + [ϕ_{1}, \dots, ϕ_{q}] {[ψ_{1}, \dots, ψ_{q}]}^{- 1} \frac{π_{γ} {d (X), X; γ^{*}}}{1 - π {d (X), X; γ^{*}}}] \cdot [Y - Q {X, d (X); β} - α^{⊤} {[ψ_{1}, \dots, ψ_{q}]}^{- 1} \frac{π_{γ} {d (X), X; γ^{*}}}{1 - π {d (X), X; γ^{*}}}]) = 0 .

(12)

The true parameters are $θ_{0}^{opt 4} = {(γ^{* ⊤}, {ζ_{opt 4}^{*}}^{⊤}, {β_{opt 4}^{*}}^{⊤}, V (d))}^{⊤}$ where $(ζ_{opt 4}^{*}, β_{opt 4}^{*})$ is the solution to (11) and

(* * *) - E (\frac{[R - π {d (X), X; γ^{*}}] [1 - π {d (X), X; γ^{*}}]}{π^{2} {d (X), X; γ^{*}}} \cdot [Q_{β} {X, d (X); β} + [ϕ_{1}, \dots, ϕ_{q}] {[ψ_{1}, \dots, ψ_{q}]}^{- 1} \frac{π_{γ} {d (X), X; γ^{*}}}{1 - π {d (X), X; γ^{*}}}] \cdot [Q_{0} {X, d (X)} - Q {X, d (X); β} - α^{⊤} {[ψ_{1}, \dots, ψ_{q}]}^{- 1} \frac{π_{γ} {d (X), X; γ^{*}}}{1 - π {d (X), X; γ^{*}}}]) = 0,

where (***) is the left hand side of (12). See Appendix D for detailed expressions of $U_{4} (θ), U_{5} (θ), U_{6} (θ)$ and the definitions of $ζ = {(α^{⊤}, ψ^{⊤}, ϕ^{⊤})}^{⊤}$ , $ψ = {(ψ_{1}^{⊤}, \dots, ψ_{q}^{⊤})}^{⊤}$ , $ϕ = {(ϕ_{1}^{⊤}, \dots, ϕ_{q}^{⊤})}^{⊤}$ . Note that here we use $[ϕ_{1}, \dots, ϕ_{q}]$ to represent a matrix with j-th column being $ϕ_{j}$ , similarly for $[ψ_{1}, \dots, ψ_{q}]$ .

Remark 2.

When the propensity score model is correct, observe that $θ_{0}^{opt 3} = θ_{0}^{opt 4}$ where $β_{opt 3}^{*} = β_{opt 4}^{*} = β^{opt}$ , where $β^{opt}$ is the the minimizer of the variance of (7) in this case. Furthermore, it can be shown that $U_{4} (θ_{0}^{LS 2})$ , $U_{5} (θ_{0}^{opt 3})$ , $U_{6} (θ_{0}^{opt 4})$ equal to the variance of (7) evaluated at $β = β_{LS}^{*}$ and $β^{*} = β^{opt}$ , respectively. Therefore, when the propensity score is correct but outcome incorrect, $\hat{V} (d; \hat{γ}, β^{opt 3})$ and $\hat{V} (d; \hat{γ}, β^{opt4})$ are asymptotically equivalent and more efficient than $\hat{V} (d; \hat{γ}, β^{LS})$ . When both models are correct, all three estimators are asymptotically equivalent.

Remark 3.

An estimator for $V (d^{opt})$ , the overall population mean under the optimal regime, may be found as $\hat{V} {\hat{d}; \hat{γ}, β^{opt} (\hat{d})}$ . Following Zhang et al. (2012), $n^{1 / 2} [\hat{V} {\hat{d}; \hat{γ}, β^{opt} (\hat{d})} - V (d^{opt})] = n^{1 / 2} [\hat{V} {d^{opt}; \hat{γ}, β^{opt} (d^{opt})} - V (d^{opt})] + o_{p} (1)$ .

Thus, the asymptotic variance of $\hat{V} {\hat{d}; \hat{γ}, β^{opt} (\hat{d})}$ can be approximated by that of $\hat{V} {d^{opt}; \hat{γ}, β^{opt} (d^{opt})}$ , which by Theorem 2 can be estimated using the usual sandwich technique.

4. Simulation Studies

We conducted several simulation studies to evaluate the finite sample performance of our proposed method. The following six methods were compared: Q-learning based on linear regression (QL-LR, Qian and Murphy (2011)); Q-learning based on kernel regression (QL-KR); maximizing ${\hat{V}}^{IPWE} (d; \hat{γ})$ within a pre-specified class of ITRs (IPWE); maximizing $\hat{V} (d; \hat{γ}, β)$ where standard maximum likelihood estimators are used for the nuisance parameters (Usual-DR, Zhang et al. (2012)); maximizing $\hat{V} (d; \hat{γ}, β^{opt3})$ where $β^{opt3}$ solves the IPW estimating equation (9) (Improved-DR); maximizing $\hat{V} (d; \hat{γ}, β^{opt 4})$ where DR). $β^{opt 4}$ solves the augmented IPW estimating equation (10) (Aug-Improved-DR).

The simulation set up is similar to Kang and Schafer (2007) with some modifications. $Z = (Z_{1}, Z_{2}, Z_{3}, Z_{4})$ was generated as standard multivariate normal, and $X = (X_{1}, X_{2}, X_{3}, X_{4})$ was defined as $X_{1} = \exp (Z_{1} / 2), X_{2} = Z_{2} / {1 + \exp (Z_{1})} + 10, X_{3} = {(Z_{1} Z_{3} / 25 + 0.6)}^{3}, X_{4} = {(Z_{2} + Z_{4} + 20)}^{2}$ , so that Z can be expressed in terms of X. The treatment A was generated from ${- 1, 1}$ according to the model $P (A = 1 | X) = \exp {l (X)} / [1 + \exp {l (X)}]$ , where l(x) = − z₁ + 0.5z₂ − 0.25z₃ − 0.1z₄ in Scenario 1, and $l (x) = 0.5 z_{1} - 0.5$ in Scenario 2. The response variable was normally distributed with $Y = 10 + 27.4 Z_{1} + 13.7 Z_{2} + 13.7 Z_{3} + 13.7 Z_{4} + A (- 1 - 10 Z_{1} + 10 Z_{2}) + ϵ$ , where $ϵ ~ N (0, 1)$ . It is straightforward to deduce that $d^{opt} (x) = sign (- 1 - 10 z_{1} + 10 z_{2})$ . Via Monte Carlo simulation with 10⁶ replicates, we obtained $E {Y (d^{opt})} = 21.32$ . The following modeling choices are considered for the propensity and outcome regression models.

CCA correctly specified logistic regression model for $π_{0} (A; X)$ with Z as predictors in both scenarios, and a correctly specified model for $Q_{0} (X, A)$ with Z, A, ZA as predictors in both scenarios.

CI A correctly specified logistic regression model for $π_{0} (A; X)$ with Z as predictors in both scenarios, and an incorrectly specified model for $Q_{0} (X, A)$ with $X, A, Z A$ as predictors in both scenarios.

IC An incorrectly specified logistic regression model for $π_{0} (A; X)$ with X as predictors in Scenario 1, and without any predictors in Scenario 2, and a correctly specified model for $Q_{0} (X, A)$ with Z, A, ZA as predictors in both scenarios.

II An incorrectly specified logistic regression model for $π_{0} (A; X)$ with X as predictors in Scenario 1, and without any predictors in Scenario 2, and an incorrectly specified model for $Q_{0} (X, A)$ with $X, A, Z A$ as predictors in both scenarios.

For IPWE, we use C. and I. to denote correct and incorrect propensity models, respectively. For QL-LR, we use.C and.I to denote correct and incorrect linear regression models. For QL-KR, we use .C and .I to denote kernel regression based on (Z, A) and based on (X, A), respectively. In all direct-maximization methods (IPWE, Usual-DR, Improved-DR, Aug-Improved-DR), we choose $D = {sign (η_{0} + η_{1} z_{1} + η_{2} z_{2} + η_{3} z_{3} + η_{4} z_{4})}$ so that $d^{opt} \in D$ . By imposing $‖ η ‖ = 1$ , $d^{opt}$ corresponds to $(η_{0}, η_{1}, η_{2}, η_{3}, η_{4}) = (- 0.07, - 0.71, 0.71, 0, 0)$ .

For each scenario, we considered four sample sizes for training datasets: n = 100, 250, 500 or 1000, and repeated the simulation 500 times. The ITRs are constructed based on the training set and then evaluated on a large and independent test set (size 10000) based on two criteria: value function, i.e., the overall population mean when we apply the estimated optimal ITR to the test dataset; the misclassification error rate of the estimated optimal ITR from the true optimal ITR, i.e., $ℙ_{n}^{*} [I {\hat{d} (X) \neq d^{opt} (X)}]$ . Here $ℙ_{n}^{*}$ denotes the empirical measure using the test data.

Results for Scenario 1 are presented in Figure 1, where we draw boxplots of the value functions over 500 replications. Here we only report the results for n = 250 or 1000 (see Appendix E for further results, e.g., n = 100 or 500). As expected, Q-learning works the best if the outcome model is correctly specified but has relatively poor performance if this model is incorrect. When the outcome model is correct (CC, IC), Aug-Improved-DR and Usual-DR have similar performance. This is not surprising. Recall that when the outcome model is correct, the proposed nuisance parameter estimate $β^{opt4}$ converges in probability to $β_{0}$ , the same limit of the least squares estimates. When the propensity model is correct but the outcome regression model is misspecified (CI), Aug-Improved-DR dominates Usual-DR, evidenced by larger value functions and smaller variance in value functions, e.g., the mean (sd) of value functions for Aug-Improved-DR are 21.07 (0.43) and 21.23 (0.39) when the sample size is 250 and 1000, respectively. Comparatively, for Usual-DR, the mean (sd) of value functions are 20.52 (0.76) and 21.07 (0.40). In addition, note that Improved-DR and Aug-Improved-DR have almost identical performance when the sample size is large (n = 1000). However, Improved-DR is unstable under small sample size (n = 250). This justifies the need to construct augmented IPW estimating equations to estimate the nuisance parameters, as we discussed in the method section.

Fig. 1 — Simulation results for Scenario 1. Value functions over 500 replications. The optimal value is E{Y(d^opt)} = 21.32.

To better demonstrate the superior performance of our proposed method under the CI setting, we focus on the comparison between Aug-Improved-DR and Usual-DR in terms of the misclassification rates. Results for Scenario 1 are shown in Figure 2 with sample sizes ranging from 100 to 1000. Notice that Aug-Improved-DR produced much smaller misclassification rates as well as smaller variations. In particular, it outperforms the usual DR estimator by a large margin when the sample size is small.

Fig. 2 — Simulation results for Scenario 1 under CI: propensity score correct, outcome model incorrect. Misclassification rates over 500 replications.

Simulation results for Scenario 2 are provided in Figure 3 and Appendix E. Again, the proposed method outperforms other competing methods in both value functions and misclassification rates. In Appendix E, we also report the mean squared errors (MSE) of different methods in terms of estimating η. Aug-Improved-DR has smaller MSE than its competitors.

Fig. 3 — Simulation results for Scenario 2. Value functions over 500 replications. The optimal value is E{Y(d^opt)} = 21.32.

5. Application to the STAR*D Study

We apply the proposed method to analyze data from the STAR*D Study (Rush et al., 2004). Funded by the National Institute of Mental Health, the study was conducted to compare various treatment options for major depressive disorder when patients fail to respond to the initial treatment of citalopram (CIT). From 2001 to 2006, a total of 4041 outpatients with nonpsychotic depression, aged 18–75, were enrolled from 41 clinical sites in the U.S. The score on the 16-item Quick Inventory of Depressive Symptomatology (QIDS) was the primary outcome. The QIDS score ranges from 0 to 27, where higher scores indicate more severe depression.

The trial had four levels (see Fig. 1 in Rush et al. (2004)). Here, we focused on the first two levels. At level-1, patients received CIT for 12 to 14 weeks. Those who achieved clinically meaningful response (total QIDS score under 5) were remitted from future treatments. At level-2, participants without a satisfactory response to CIT had the option to either switch to a different medication, or to augment their existing citalopram. Those in the “switch” group were randomly assigned to bupropion (BUP), cognitive therapy (CT), sertraline (SER), or venlafaxine (VEN). Those in the “augment” group were randomly assigned to CIT+BUP, CIT+buspirone (BUS), or CIT+CT. If a patient had no preference, he/she was assigned to any of the above treatments.

We use the QIDS score at the end of level-2 as the clinical outcome Y and compared two categories of treatments: (i) treatment with selective serotonin reputake inhibitors (SSRI): CIT+BUP, CIT+BUS, CIT+CT, and SER; (ii) non-SSRI: BUP, CT, and VEN. Denote A = 1 for SSRI and A = − 1 for non-SSRI. Since patients in the “augment” group were all treated with SSRIs (violating the positivity assumption), we exclude these subjects from our analysis, which leaves a total of 817 subjects. Among them, 656 and 161 patients were in the “switch” and “no preference” group, respectively. 296 patients received SSRI treatments, while 521 patients received non-SSRI treatments. Comparisons using t-test show that there is no significant difference between the SSRI and the non-SSRI category with respect to QIDS scores.

We applied four methods to estimate the optimal ITR for those patients who had entered level-2. Prognostic variables X include QIDS score at the start of level-2, change of QIDS score during the level-1 period, preference regarding level-2 treatment, and other demographic variables such as gender, race, age, education level and employment status. The propensity scores $π_{0} (A, X)$ estimated by empirical proportions based on preferring to switch or no are preference. We used a linear regression of Y given $(X, A, X A)$ for the outcome model. For all methods, we randomly split the data into training and test set with 1:1 ratio. The estimated ITR was obtained using the training set, and then evaluated on the test set by $ℙ_{n}^{*} [Y I {A = \hat{d} (X)} / {\hat{π}}_{0} (A, X)] / ℙ_{n}^{*} [I {A = \hat{d} (X)} / {\hat{π}}_{0} (A, X)]$ . This procedure is repeated 500 times. Results for IPWE, QL-LR, Usual-DR and Aug-Improved-DR are displayed in Figure 4, where lower scores are desirable. The estimated QIDS score by using Aug-Improved-DR is 9.62 (sd = 0.37), which is smaller than IPWE (10.15, sd = 0.36), QL-LR (9.87, sd = 0.38), and Usual-DR (9.65, sd = 0.40). In addition, Aug-Improved-DR outperformed the one-size-fits-all approaches (QIDS score of 9.98 for SSRI and 10.12 for non-SSRI).

Fig. 4 — QIDS score based on 500 replications. Lower scores are more preferable.

6. Discussion

In this article, we proposed an improved DR estimator for the optimal ITRs by directly maximizing an AIPWE of the marginal mean outcome over a class of ITRs. Our estimator is doubly robust, and designed to be more efficient than other DR estimators when the propensity score model is correctly specified, regardless of the specification of the outcome model. As shown in the numerical studies, the proposed method achieves better performance compared to other existing methods. The proposed method is appealing, given that in many practical applications, correct specification of the outcome model can be challenging, while the propensity score is either known by design or more likely to be correctly specified.

There are several important ways this work may be extended. The first is to extend it to the multi-stage decision setting. Zhang et al. (2013) proposed a doubly robust estimator for the optimal DTR where the nuisance parameters indexing the outcome models are estimated iteratively by a sequence of least squares regressions. More efficient DR estimators could be obtained if we use IPW or augmented IPW estimating equations to estimate these nuisance parameters. This is the direction we are currently pursuing.

Another future direction is to consider biased-reduced doubly robust estimation, i.e., estimate the nuisance parameters so as to minimize the bias of the DR estimator under misspecification of both working models. Vermeulen and Vansteelandt (2015) proposed biased-reduced DR estimators for several missing data and causal inference models. It would be interesting to investigate whether this principle can be adapted to the context of estimating optimal treatment regimes.

Supplementary Material

Supp 1

NIHMS1578951-supplement-Supp_1.pdf^{(1MB, pdf)}

Acknowledgments

The authors gratefully acknowledge support by R01DK108073 awarded by the National Institutes of Health.

Footnotes

SUPPLEMENTARY MATERIAL

The online supplementary materials contain the appendices for the article.

References

Andrews DW (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica, 62:43–72. [Google Scholar]
Bang H. and Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973. [DOI] [PubMed] [Google Scholar]
Blatt D, Murphy SA, and Zhu J. (2004). A-learning for approximate planning. Ann Arbor, 1001:48109–2122. [Google Scholar]
Cao W, Tsiatis AA, and Davidian M. (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chan IS and Ginsburg GS (2011). Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12:217–244. [DOI] [PubMed] [Google Scholar]
Chen X, Linton O, and Van Keilegom I. (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71(5):1591–1608. [Google Scholar]
Collins FS and Varmus H. (2015). A new initiative on precision medicine. New England journal of medicine, 372(9):793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goldberg DE (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition. [Google Scholar]
Hamburg MA and Collins FS (2010). The path to personalized medicine. New England Journal of Medicine, 363(4):301–304. [DOI] [PubMed] [Google Scholar]
Ichimura H. and Lee S. (2010). Characterization of the asymptotic distribution of semiparametric M-estimators. Journal of Econometrics, 159(2):252–266. [Google Scholar]
Imbens GW and Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. [Google Scholar]
Kang JD and Schafer JL (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for Q-learning. Biometrika, 101(4):831–847. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Y, Wang Y, Kosorok MR, Zhao Y, and Zeng D. (2018). Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens. Statistics in medicine, 37(26):3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mebane WR Jr and Sekhon JS (2011). Genetic optimization using derivatives: the rgenoud package for R. Journal of Statistical Software, 42(11):1–26. [Google Scholar]
Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355. [Google Scholar]
Newey WK (1994). The asymptotic variance of semiparametric estimators. Econometrica, 62(6):1349–1382. [Google Scholar]
Qian M. and Murphy SA (2011). Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Racine J. and Li Q. (2004). Nonparametric estimation of regression functions with both categorical and continuous data. Journal of Econometrics, 119(1):99–130. [Google Scholar]
Robins J, Sued M, Lei-Gomez Q, and Rotnitzky A. (2007). Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science, 22(4):544–559. [Google Scholar]
Robins JM (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics, pages 189–326. Springer. [Google Scholar]
Robins JM, Hernan MA, and Brumback B. (2000). Marginal structural models and causal inference in epidemiology. [DOI] [PubMed] [Google Scholar]
Robins JM and Rotnitzky A. (2001). Comment on the bickel and kwon article, “inference for semiparametric models: Some questions and an answer”. Statistica Sinica, 11(4):920–936. [Google Scholar]
Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866. [Google Scholar]
Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688. [Google Scholar]
Rubin DB and van der Laan MJ (2008). Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1). [PMC free article] [PubMed] [Google Scholar]
Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. (2004). Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Controlled clinical trials, 25(1):119–142. [DOI] [PubMed] [Google Scholar]
Scharfstein DO, Rotnitzky A, and Robins JM (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120. [Google Scholar]
Schulte PJ, Tsiatis AA, Laber EB, and Davidian M. (2014). Q-and A-learning methods for estimating optimal dynamic treatment regimes. Statistical science, 29(4):640. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stefanski LA and Boos DD (2002). The calculus of M-estimation. The American Statistician, 56(1):29–38. [Google Scholar]
Tan Z. (2007). Comment: Understanding OR, PS and DR. Statistical Science, 22(4):560–568. [Google Scholar]
Tan Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3):661–682. [Google Scholar]
Tsiatis A. (2007). Semiparametric theory and missing data. Springer Science & Business Media. [Google Scholar]
Tsiatis AA and Davidian M. (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):569. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsiatis AA, Davidian M, and Cao W. (2011). Improved doubly robust estimation when data are monotonely coarsened, with application to longitudinal studies with dropout. Biometrics, 67(2):536–545. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van der Laan MJ and Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]
Vermeulen K. and Vansteelandt S. (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association, 110(511):1024–1036. [Google Scholar]
Vogel CL, Cobleigh MA, Tripathy D, Gutheil JC, Harris LN, Fehrenbacher L, Slamon DJ, Murphy M, Novotny WF, Burchmore M, et al. (2002). Efficacy and safety of trastuzumab as a single agent in first-line treatment of her2-overexpressing metastatic breast cancer. Journal of Clinical Oncology, 20(3):719–726. [DOI] [PubMed] [Google Scholar]
Wallace MP and Moodie EE (2015). Doubly-robust dynamic treatment regimen estimation via weighted least squares. Biometrics, 71(3):636–644. [DOI] [PubMed] [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, and Davidian M. (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, and Davidian M. (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3):681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y-Q, Laber EB, Ning Y, Saha S, and Sands B. (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research. In press. [PMC free article] [PubMed] [Google Scholar]
Zhao Y-Q, Zeng D, Laber EB, and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y-Q, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1578951-supplement-Supp_1.pdf^{(1MB, pdf)}

[R1] Andrews DW (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica, 62:43–72. [Google Scholar]

[R2] Bang H. and Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973. [DOI] [PubMed] [Google Scholar]

[R3] Blatt D, Murphy SA, and Zhu J. (2004). A-learning for approximate planning. Ann Arbor, 1001:48109–2122. [Google Scholar]

[R4] Cao W, Tsiatis AA, and Davidian M. (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chan IS and Ginsburg GS (2011). Personalized medicine: progress and promise. Annual review of genomics and human genetics, 12:217–244. [DOI] [PubMed] [Google Scholar]

[R6] Chen X, Linton O, and Van Keilegom I. (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71(5):1591–1608. [Google Scholar]

[R7] Collins FS and Varmus H. (2015). A new initiative on precision medicine. New England journal of medicine, 372(9):793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Goldberg DE (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition. [Google Scholar]

[R9] Hamburg MA and Collins FS (2010). The path to personalized medicine. New England Journal of Medicine, 363(4):301–304. [DOI] [PubMed] [Google Scholar]

[R10] Ichimura H. and Lee S. (2010). Characterization of the asymptotic distribution of semiparametric M-estimators. Journal of Econometrics, 159(2):252–266. [Google Scholar]

[R11] Imbens GW and Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. [Google Scholar]

[R12] Kang JD and Schafer JL (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for Q-learning. Biometrika, 101(4):831–847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Liu Y, Wang Y, Kosorok MR, Zhao Y, and Zeng D. (2018). Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens. Statistics in medicine, 37(26):3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Mebane WR Jr and Sekhon JS (2011). Genetic optimization using derivatives: the rgenoud package for R. Journal of Statistical Software, 42(11):1–26. [Google Scholar]

[R16] Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355. [Google Scholar]

[R17] Newey WK (1994). The asymptotic variance of semiparametric estimators. Econometrica, 62(6):1349–1382. [Google Scholar]

[R18] Qian M. and Murphy SA (2011). Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Racine J. and Li Q. (2004). Nonparametric estimation of regression functions with both categorical and continuous data. Journal of Econometrics, 119(1):99–130. [Google Scholar]

[R20] Robins J, Sued M, Lei-Gomez Q, and Rotnitzky A. (2007). Comment: Performance of double-robust estimators when “inverse probability” weights are highly variable. Statistical Science, 22(4):544–559. [Google Scholar]

[R21] Robins JM (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics, pages 189–326. Springer. [Google Scholar]

[R22] Robins JM, Hernan MA, and Brumback B. (2000). Marginal structural models and causal inference in epidemiology. [DOI] [PubMed] [Google Scholar]

[R23] Robins JM and Rotnitzky A. (2001). Comment on the bickel and kwon article, “inference for semiparametric models: Some questions and an answer”. Statistica Sinica, 11(4):920–936. [Google Scholar]

[R24] Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866. [Google Scholar]

[R25] Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688. [Google Scholar]

[R26] Rubin DB and van der Laan MJ (2008). Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1). [PMC free article] [PubMed] [Google Scholar]

[R27] Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, et al. (2004). Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Controlled clinical trials, 25(1):119–142. [DOI] [PubMed] [Google Scholar]

[R28] Scharfstein DO, Rotnitzky A, and Robins JM (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120. [Google Scholar]

[R29] Schulte PJ, Tsiatis AA, Laber EB, and Davidian M. (2014). Q-and A-learning methods for estimating optimal dynamic treatment regimes. Statistical science, 29(4):640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Stefanski LA and Boos DD (2002). The calculus of M-estimation. The American Statistician, 56(1):29–38. [Google Scholar]

[R31] Tan Z. (2007). Comment: Understanding OR, PS and DR. Statistical Science, 22(4):560–568. [Google Scholar]

[R32] Tan Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika, 97(3):661–682. [Google Scholar]

[R33] Tsiatis A. (2007). Semiparametric theory and missing data. Springer Science & Business Media. [Google Scholar]

[R34] Tsiatis AA and Davidian M. (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Tsiatis AA, Davidian M, and Cao W. (2011). Improved doubly robust estimation when data are monotonely coarsened, with application to longitudinal studies with dropout. Biometrics, 67(2):536–545. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Van der Laan MJ and Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]

[R37] Vermeulen K. and Vansteelandt S. (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association, 110(511):1024–1036. [Google Scholar]

[R38] Vogel CL, Cobleigh MA, Tripathy D, Gutheil JC, Harris LN, Fehrenbacher L, Slamon DJ, Murphy M, Novotny WF, Burchmore M, et al. (2002). Efficacy and safety of trastuzumab as a single agent in first-line treatment of her2-overexpressing metastatic breast cancer. Journal of Clinical Oncology, 20(3):719–726. [DOI] [PubMed] [Google Scholar]

[R39] Wallace MP and Moodie EE (2015). Doubly-robust dynamic treatment regimen estimation via weighted least squares. Biometrics, 71(3):636–644. [DOI] [PubMed] [Google Scholar]

[R40] Zhang B, Tsiatis AA, Laber EB, and Davidian M. (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Zhang B, Tsiatis AA, Laber EB, and Davidian M. (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3):681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Zhao Y-Q, Laber EB, Ning Y, Saha S, and Sands B. (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research. In press. [PMC free article] [PubMed] [Google Scholar]

[R43] Zhao Y-Q, Zeng D, Laber EB, and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Zhao Y-Q, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2017). Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Improved doubly robust estimation in learning optimal individualized treatment rules

Yinghao Pan

Ying-Qi Zhao

Abstract

1. Introduction