Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions

Baqun Zhang; Anastasios A Tsiatis; Eric B Laber; Marie Davidian

doi:10.1093/biomet/ast014

. Author manuscript; available in PMC: 2014 May 30.

Published in final edited form as: Biometrika. 2013 May 30;100(3):10.1093/biomet/ast014. doi: 10.1093/biomet/ast014

Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions

Baqun Zhang ¹, Anastasios A Tsiatis ², Eric B Laber ³, Marie Davidian ⁴

PMCID: PMC3843953 NIHMSID: NIHMS494366 PMID: 24302771

Summary

A dynamic treatment regime is a list of sequential decision rules for assigning treatment based on a patient’s history. Q- and A-learning are two main approaches for estimating the optimal regime, i.e., that yielding the most beneficial outcome in the patient population, using data from a clinical trial or observational study. Q-learning requires postulated regression models for the outcome, while A-learning involves models for that part of the outcome regression representing treatment contrasts and for treatment assignment. We propose an alternative to Q- and A-learning that maximizes a doubly robust augmented inverse probability weighted estimator for population mean outcome over a restricted class of regimes. Simulations demonstrate the method’s performance and robustness to model misspecification, which is a key concern.

Keywords: A-learning, Double robustness, Outcome regression, Propensity score, Q-learning

1. Introduction

Treatment of patients with chronic disease involves a series of decisions, where the clinician determines the next treatment to be administered based on all information available to that point. A dynamic treatment regime is a set of sequential decision rules, each corresponding to a decision point in the treatment process. Each rule inputs the available information and outputs the treatment to be given from among the possible options. The optimal regime is that yielding the most favorable outcome on average if followed by the patient population.

Q- and A-learning are two main approaches for estimating the optimal dynamic treatment regime using data from a clinical trial or observational study. Q-learning (Watkins & Dayan, 1992) involves postulating at each decision point regression models for outcome as a function of patient information to that point. In A-learning (Robins, 2004; Murphy, 2003), models are posited only for the part of the regression involving contrasts among treatments and for treatment assignment at each decision point. Both are implemented through a backward recursive fitting procedure based on a dynamic programming algorithm (Bather, 2000). Under certain assumptions and correct specification of these models, Q- and A-learning lead to consistent estimation of the optimal regime. See Rosthøj et al. (2006), Murphy et al. (2007), Zhao et al. (2009) and Henderson et al. (2010) for applications; related methods are discussed by Robins (2004), Moodie et al. (2007), Robins et al. (2008), Almirall et al. (2010) and Orellana et al. (2010).

A concern with both Q- and A-learning is the effect of model misspecification on the quality of the estimated optimal regime. If one attempts to circumvent this difficulty by using flexible nonparametric regression techniques (Zhao et al., 2009), the estimated optimal rules may be complicated functions of possibly high-dimensional patient information that are difficult to interpret or implement and hence are unappealing to clinicians wary of black box approaches.

Given these drawbacks, we focus on a restricted class of treatment regimes indexed by a finite number of parameters, where the form of regimes in the class may be derived from posited regression models or prespecified on the grounds of interpretability or cost to depend on key subsets of patient information. Zhang et al. (2012) proposed an approach for estimating the optimal regime within such a restricted class for a single treatment decision based on maximizing directly a doubly robust augmented inverse probability weighted estimator for the population mean outcome over all regimes in the class, assuming that larger outcomes are preferred. Via the double robustness property, the estimated optimal regimes enjoy protection against model misspecification and comparable or superior performance than do competing methods. With judicious choice of the augmentation term, increased efficiency of estimation of the mean outcome is achieved, which translates into more precise estimators for the optimal regime.

We adapt this approach to two or more decision points. This is considerably more complex than for one decision and is based on casting the problem as one of monotone coarsening (Tsiatis, 2006, Chapter 7). We focus for simplicity on the case of two treatment options at each decision point, though the methods extend to a finite number of options. The methods lead to estimated optimal regimes achieving comparable performance to those derived via Q- or A-learning under correctly specified models and have the added benefit of protection against misspecification.

2. Framework

Assume there are K prespecified, ordered decision points and an outcome of interest, a function of information collected across all K decisions or ascertained after the Kth decision, with larger values preferred. At each decision k = 1, …, K, there are two k-specific treatment options coded as 0,1 in the set of options 𝒜_k; write a_k to denote an element of 𝒜_k. Denote a possible treatment history up to and including decision k as ā_k = (a₁, …, a_k) ∈ 𝒜₁ × ⋯ × 𝒜_k = 𝒜̄_k.

We consider a potential outcomes framework. For a randomly chosen patient, let X₁ denote baseline covariates recorded prior to the first decision, and let $X_{k}^{*} (ā_{k - 1})$ be the covariate information that would accrue between decisions (k − 1) and k were s/he to receive treatment history ā_k−1 (k = 2, …, K), taking values x_k ∈ 𝒳_k. Let Y*(ā_K) be the outcome that would result were s/he to receive full treatment history ā_K. Then define the potential outcomes (Robins, 1986) as

W = {X_{1}, X_{2}^{*} (a_{1}), \dots, X_{K}^{*} (ā_{K - 1}), Y^{*} (ā_{K}) for all ā_{K} \in {𝒜̄}_{K}} .

For convenience later, we include X₁, which is always observed and hence is not strictly a potential outcome, in W, and write ${X̄}_{k}^{*} (ā_{k - 1}) = {X_{1}, X_{2}^{*} (a_{1}) \dots, X_{k}^{*} (ā_{k - 1})}$ and x̄_k = (x₁, …, x_k) for k = 1, …, K, where then x̄_k ∈ 𝒳̄_k = 𝒳₁ × ⋯ × 𝒳_k.

A dynamic treatment regime g = (g₁, …, g_K) is an ordered set of decision rules, where g_k(x̄_k, ā_k−1) corresponding to the kth decision takes as input a patient’s realized covariate and treatment history up to decision k and outputs a treatment option a_k ∈ Φ_k(x̄_k, ā_k−1) ⊆ 𝒜_k. In general, Φ_k(x̄_k, ā_k−1) is the set of feasible options at decision k for a patient with realized history (x̄_k, ā_k−1), allowing that some options in 𝒜_k may not be possible for patients with certain histories; here, Φ_k(x̄_k, ā_k−1) ⊆ {0, 1}. Thus, a feasible treatment regime must satisfy g_k(x̄_k, ā_k−1) ∈ Φ_k(x̄_k, ā_k−1) (k = 1, …, K). Denote the class of all feasible regimes by 𝒢.

For g ∈ 𝒢, writing ḡ_k = (g₁, …, g_k) for k = 1, …, K and ḡ_K = g, define the potential outcomes associated with g to be $W_{g} = {X_{1}, X_{2}^{*} (g_{1}), \dots, X_{K}^{*} (ḡ_{K - 1}), Y^{*} (g)}$ , where $X_{k}^{*} (ḡ_{k - 1})$ is the covariate information that would be seen between decisions k − 1 and k were a patient to receive the treatments dictated sequentially by the first k − 1 rules in g, and Y*(g) is the outcome if s/he were to receive the K treatments determined by g. Thus, W_g is an element of W.

Define an optimal treatment regime $g^{opt} = (g_{1}^{opt}, \dots, g_{K}^{opt}) \in 𝒢$ as satisfying

E {Y^{*} (g^{opt})} \geq E {Y^{*} (g)}, g \in 𝒢 .

(1)

That is, g^opt is a regime that maximizes expected outcome were all patients in the population to follow it. The optimal regime g^opt may be determined via dynamic programming, also referred to as backward induction. At the Kth decision point, for any x̄_K ∈ 𝒳̄_K, ā_K−1 ∈ 𝒜̄_K−1, define

g_{K}^{opt} ({x̄}_{K}, ā_{K - 1}) = arg max_{a_{K} \in Φ_{K} ({x̄}_{K}, ā_{K - 1})} E {Y^{*} (ā_{K - 1}, a_{K}) | {X̄}_{K}^{*} (ā_{K - 1}) = {x̄}_{K}},

(2)

V_{K} ({x̄}_{K}, ā_{K - 1}) = max_{a_{K} \in Φ_{K} ({x̄}_{K}, ā_{K - 1})} E {Y^{*} (ā_{K - 1}, a_{K}) | {X̄}_{K}^{*} (ā_{K - 1}) = {x̄}_{K}} .

(3)

For k = K − 1, …, 2 and any x̄_k ∈ 𝒳̄_k, ā_k−1 ∈ 𝒜̄_k−1, define

g_{k}^{opt} ({x̄}_{k}, ā_{k - 1}) = arg max_{a_{k} \in Φ_{k} ({x̄}_{k}, ā_{k - 1})} E [V_{k + 1} {{x̄}_{k}, X_{k + 1}^{*} (ā_{k - 1}, a_{k}), ā_{k - 1}, a_{k}} | {X̄}_{k}^{*} (ā_{k - 1}) = {x̄}_{k}], V_{k} ({x̄}_{k}, ā_{k - 1}) = max_{a_{k} \in Φ_{k} ({x̄}_{k}, ā_{k - 1})} E [V_{k + 1} {{x̄}_{k}, X_{k + 1}^{*} (ā_{k - 1}, a_{k}), ā_{k - 1}, a_{k}} | {X̄}_{k}^{*} (ā_{k - 1}) = {x̄}_{k}] .

For k = 1, x₁ ∈ 𝒳₁, $g_{1}^{opt} (x_{1}) = {arg max}_{a_{1} \in Φ_{1} (x_{1})} E [V_{2} {x_{1}, X_{2}^{*} (a_{1}), a_{1}} | X_{1} = x_{1}]$ and $V_{1} (x_{1}) = {max}_{a_{1} \in Φ_{1} (x_{1})} E [V_{2} {x_{1}, X_{2}^{*} (a_{1}), a_{1}} | X_{1} = x_{1}]$ . Thus, $g_{K}^{opt}$ yields the treatment option at decision K that maximizes the expected potential outcome given prior covariate and treatment history. At decisions k = K − 1, …, 1, $g_{k}^{opt}$ dictates the option that maximizes the expected potential outcome that would be achieved if the optimal rules were followed in the future. An argument that g^opt, so defined, satisfies (1) is given in an unpublished report by Schulte et al. (2013) available from the last author.

This definition of an optimal regime is intuitively given in terms of potential outcomes. In practice, with the exception of X₁, W cannot be observed for any patient. Rather, a patient is observed to experience only a single treatment history. Let A_k be the observed treatment received at decision k and let Ā_k = (A₁, …, A_k) be observed treatment history up to decision k. Let X_k be the covariate information observed between decisions k − 1 and k under the observed treatment history Ā_k−1 (k = 2, …, K), with history X̄_k = (X₁, …, X_k) for k = 1, …, K to decision k. Let Y be the observed outcome under Ā_K. The observed data on a patient are (X̄_K, Ā_K, Y), and the data available from a clinical trial or observational study involving n subjects are independent and identically distributed (X̄_Ki, Ā_Ki, Y_i) for i = 1, …, n.

Under the following standard assumptions, g^opt may equivalently be expressed in terms of the observed data. The consistency assumption states that $X_{k} = X_{k}^{*} (Ā_{k - 1}) = Σ_{ā_{k - 1} \in {𝒜̄}_{k - 1}} X_{k - 1}^{*} (ā_{k - 1}) I (Ā_{k - 1} = ā_{k - 1})$ for k = 2, …, K, and Y = Y*(Ā_K) = ∑_{ā_K ∈ 𝒜̄_K} Y*(ā_K)I(Ā_K = ā_K); that is, a patient’s observed covariates and outcome are the same as the potential ones s/he would exhibit under the treatment history actually received. The stable unit treatment value assumption (Rubin, 1978), implies that a patient’s covariates and outcome are not influenced by treatments received by other patients. A version of the sequential randomization assumption (Robins, 2004) states that W is independent of A_k conditional on (X̄_k, Ā_k−1). This is satisfied by default for data from a sequentially randomized clinical trial (Murphy, 2005), but is not verifiable from data from an observational study. It is reasonable to believe that decisions made in an observational study are based on a patient’s covariate and treatment history; however, all such information associated with treatment assignment and outcome must be recorded in the X̄_k to validate the assumption.

Under these assumptions, from §1 of the Supplementary Material, $p_{Y^{*} (ā_{K}) | {X̄}_{K}^{*} (ā_{K - 1})} (y | {x̄}_{K}) = p_{Y | {X̄}_{K}, Ā_{K}} (y | {x̄}_{K}, ā_{K})$ , so that $E {Y^{*} (ā_{K}) | {X̄}_{K}^{*} (ā_{K - 1}) = {x̄}_{K}} = E (Y | {X̄}_{K} = {x̄}_{K}, Ā_{K} = ā_{K})$ . Thus, letting Q_K(x̄_K, ā_K) = E(Y | X̄_K = x̄_K, Ā_K = ā_K), (2) and (3) become

g_{K}^{opt} ({x̄}_{K}, ā_{K - 1}) = arg max_{a_{K} \in Φ_{K} ({x̄}_{K}, ā_{K - 1})} Q_{K} ({x̄}_{K}, ā_{K - 1}, a_{K}), V_{K} ({x̄}_{K}, ā_{K - 1}) = max_{a_{K} \in Φ_{K} ({x̄}_{K}, ā_{K - 1})} Q_{K} ({x̄}_{K}, ā_{K - 1}, a_{K}) .

Using $p_{X_{k}^{*} (ā_{k - 1}) | {X̄}_{k - 1}^{*} (ā_{k - 2})} ({x̄}_{k} | {x̄}_{k - 1}) = p_{X_{k} | {X̄}_{k - 1}, Ā_{k - 1}} (x_{k} | {x̄}_{k - 1}, ā_{k - 1})$ , for k = K, …, 2,

Q_{k} ({x̄}_{k}, ā_{k}) = E {V_{k + 1} ({x̄}_{k}, X_{k + 1}, ā_{k}) | {X̄}_{k} = {x̄}_{k}, Ā_{k} = ā_{k}} (k = K - 1, \dots, 1), g_{k}^{opt} ({x̄}_{k}, ā_{k - 1}) = arg max_{a_{k} \in Φ_{k} ({x̄}_{k}, ā_{k - 1})} Q_{k} ({x̄}_{k}, ā_{k - 1}, a_{k}) (k = K - 1, \dots, 2), V_{k} ({x̄}_{k}, ā_{k - 1}) = max_{a_{k} \in Φ_{k} ({x̄}_{k}, ā_{k - 1})} Q_{k} ({x̄}_{k}, ā_{k - 1}, a_{k}) (k = K - 1, \dots, 2),

and $g_{1}^{opt} (x_{1}) = {arg max}_{a_{1} \in Φ_{1} (x_{1})} Q_{1} (x_{1}, a_{1})$ , V₁(x₁) = max_{a₁∈Φ₁(x₁)} Q₁(x₁, a₁). The Q_k(x̄_k, ā_k) and V_k(x̄_k, ā_k−1) are referred to as Q-functions and value functions and are derived from the distribution of the observed data.

3. Q- and A- learning

Q-learning is based on the developments in §2. Linear or nonlinear models Q_k(x̄_k, ā_k; β_k) in a finite-dimensional parameter β_k may be posited and estimators β̂_k obtained via a backward iterative process for k = K, …, 1 by solving least squares estimating equations; see §2 of the Supplementary Material. The estimated optimal regime is $ĝ_{Q}^{opt} = (ĝ_{Q, 1}^{opt}, \dots, ĝ_{Q, K}^{opt})$ , where $ĝ_{Q, 1}^{opt} (x_{1}) = g_{Q, 1}^{opt} (x_{1}; {β̂}_{1}) = {arg max}_{a_{1} \in Φ_{1} (x_{1})} Q_{1} (x_{1}, a_{1}; {β̂}_{1})$ , and $ĝ_{Q, k}^{opt} ({x̄}_{k}, ā_{k - 1}) = g_{Q, k}^{opt} ({x̄}_{k}, ā_{k - 1}; {β̂}_{k}) = {arg max}_{a_{k} \in Φ_{k} ({x̄}_{k}, ā_{k - 1})} Q_{k} ({x̄}_{k}, ā_{k - 1}, a_{k}; {β̂}_{k})$ for k = 2, …, K. Unless all models are correctly specified, $ĝ_{Q}^{opt}$ may not be a good estimator for g^opt.

The A-learning method we consider is a version of g-estimation (Robins, 2004); see §2 of the Supplementary Material. Write Q_k(x̄_k, ā_k) as h_k(x̄_k, ā_k−1) + a_kC_k(x̄_k, ā_k−1), h_k(x̄_k, ā_k−1) = Q_k(x̄_k, ā_k−1, 0) and C_k(x̄_k, ā_k−1) = Q_k(x̄_k, ā_k−1, 1) − Q_k(x̄_k, ā_k−1, 0). We refer to C_k(x̄_k, ā_k−1) as the Q-contrast function; with two treatment options, A_kC_k(x̄_k, ā_k−1) is the optimal-blip-to-zero function of Robins (2004). Posit models C_k(x̄_k, ā_k−1; ψ_k) and C₁(x₁; ψ₁), depending on parameters ψ_k; and models h_k(x̄_k, ā_k−1; α_k) and h₁(x₁; α₁), with parameters α_k for k = K, …, 2. Let π_k(x̄_k, ā_k−1) = pr(A_k = 1 | X̄_k = x̄_k, Ā_k−1 = ā_k−1) and π₁(x₁) = pr(A₁ = 1 | X₁ = x₁) be the propensities for treatment, which are unknown unless the data are from a sequentially randomized trial, and specify models π_k(x̄_k, ā_k−1; γ_k), k = K, …, 2, and π₁(x₁; γ₁), e.g., logistic regression models. Estimators ψ̂_k may be found iteratively for k = K, …, 1 by solving for ψ_k and α_k estimating equations given in §2 of the Supplementary Material, substituting the maximum likelihood estimators γ̂_k. As Q_k(x̄_k, ā_k) is maximized by a_k = I{C_k(x̄_k, ā_k−1) > 0}, the estimated optimal regime is $ĝ_{A}^{opt} = (ĝ_{A, 1}^{opt}, \dots, ĝ_{A, K}^{opt})$ , where $ĝ_{A, 1}^{opt} (x_{1}) = g_{A, 1}^{opt} (x_{1}; {ψ̂}_{1}) = I {C_{1} (x_{1}; {ψ̂}_{1}) > 0}$ and $ĝ_{A, k}^{opt} ({x̄}_{k}, ā_{k - 1}) = g_{A, k}^{opt} ({x̄}_{k}, ā_{k - 1}; {ψ̂}_{k}) = I {C_{k} ({x̄}_{k}, ā_{k - 1}; {ψ̂}_{k}) > 0}$ , for k = 2, …, K. If the contrast and propensity models are correctly specified, then ψ̄_k will be consistent for ψ_k even if h_k(x̄_k, ā_k−1; α_k) for k = K, …, 2, and h₁(x₁; α₁) are misspecified, and $ĝ_{A}^{opt}$ will consistently estimate g^opt. Thus, the quality of $ĝ_{A}^{opt}$ depends on how close the C_k(x̄_k, ā_k−1; ψ_k) are to the true contrast functions.

As discussed in §2 of the Supplementary Material, the efficient version of A-learning is so complex as to be infeasible to implement. The implementation of A-learning we use in the empirical studies of §5 is likely as close to efficient as could be hoped in practice.

See the unpublished report of Schulte et al. (2013) for a detailed account of both methods.

4. Proposed robust method

Q- and A-learning are predicated on the postulated models for the Q-functions and Q-contrast functions, respectively, so the resulting estimated regime may be far from g^opt if these models are misspecified. We propose an alternative approach that may be robust to such misspecification, based on directly estimating the optimal regime in a specified class of regimes.

Models Q_k(x̄_k, ā_k; β_k) or C_k(x̄_k, ā_k−1; ψ_k), whether correct or not, define classes of regimes 𝒢_β, $β = {(β_{1}^{T}, \dots, β_{K}^{T})}^{T}$ , or 𝒢_ψ, indexed analogously by ψ, whose elements may often be simplified. For example, with K = 2, if C₂(x̄₂, a₁; ψ₂) = ψ₀₂ + ψ₁₂x₂ and C₁(x₁; ψ₁) = ψ₀₁ + ψ₁₁x₁, the corresponding regimes g_ψ = (g_ψ₁, g_ψ₂) take g_ψ₁(x₁) = I(ψ₀₁ + ψ₁₁x₁ > 0) and g_ψ₂(x̄₂, a₁) = I(ψ₀₂ + ψ₁₂x₂ > 0). If prior knowledge suggests that treatment 1 would benefit patients with smaller values of X₁ or X₂, then all reasonable regimes should have ψ₁₁ < 0 and ψ₁₂ < 0, and elements of 𝒢_ψ may be expressed in terms of η₁ = −ψ₀₁/ψ₁₁ and η₂ = −ψ₀₂/ψ₁₂ as g_η = (g_η₁, g_η₂), g_η₁ (x₁) = I(η₁ > x₁) and g_η₂(x̄₂, a₁) = I(η₂ > x₂), η = (η₁, η₂)^T.

This suggests considering a class 𝒢_η, with elements g_η = (g_η₁, …, g_{η_K}), indexed by $η = {(η_{1}^{T}, \dots, η_{K}^{T})}^{T}$ , of form {g_η₁(x₁), …, g_{η_K} (x̄_K, ā_K−1)}. If 𝒢_η is derived from models Q_k(x̄_k, ā_k; β_k) or C_k(x̄_k, ā_k−1; ψ_k), then η = η(β) or η = η(ψ) is a many-to-one function of β or ψ, and g^opt ∈ 𝒢_η if these models are correct. Here, estimating η^opt = arg max_η E{Y*(g_η)} defining the regime $g_{η}^{opt}$ , say, will yield an estimator for g^opt. If these models are misspecified, η(β̂) or η(ψ̂) may not converge in probability to η^opt, and resulting regimes may be far from optimal. If instead the form of elements of 𝒢_η is chosen directly based on interpretability or cost, independently of such models, 𝒢_η may or may not contain g^opt, but $g_{η}^{opt}$ is still of interest as the optimal regime among those deemed realistic in practice.

We propose an approach to estimation of $g_{η}^{opt}$ in a given class 𝒢_η by developing an estimator for E{Y*(g_η)} that is robust to model misspecification and maximizing it in η. We cast the problem as one of monotone coarsening. Following Tsiatis (2006, §7.1), for fixed η, let ḡ_{η_k} = (g_η₁, …, g_{η_k}), for k = 1, …, K, and let ḡ_{η_K} = g_η. Identify full data as the potential outcomes $W_{g_{η}} = {X_{1}, X_{2}^{*} (g_{η_{1}}), \dots, X_{K}^{*} (ḡ_{η_{K - 1}}), Y^{*} (g_{η})}$ , and let ${X̄}_{k}^{*} (ḡ_{η_{k - 1}}) = {X_{1}, X_{2}^{*} (g_{η_{1}}), \dots, X_{k}^{*} (ḡ_{η_{k - 1}})}$ . Let 𝒞_η be a discrete coarsening variable taking values 1, …, K, ∞ corresponding to K + 1 levels of coarsening, reflecting the extent to which the observed treatments received are consistent with those dictated by g_η. In the general coarsened data set up, when 𝒞_η = k, we observe G_k(W_{g_η}), a many-to-one function of W_{g_η}; when 𝒞_η = ∞, we observe G_∞(W_{g_η}) = W_{g_η}, the full data. Here, under the consistency assumption, this is as follows. If A₁ ≠ g_η₁(X₁), then 𝒞_η = 1; that is, I(𝒞_η = 1) = I{A₁ ≠ g_η₁(X₁)}, and we observe G_{𝒞_η}(W_{g_η}) = G₁(W_{g_η}) = X₁. None of the observed treatments are consistent with following g_η, so X₂, …, X_K, Y are not consistent with g_η. If A₁ = g_η₁(X₁) and A₂ ≠ g_η₂{X̄₂, g_η₁(X₁)}, then 𝒞_η = 2, I(𝒞_η = 2) = I{A₁ = g_η₁(X₁)}I[A₂ ≠ g_η₂{X̄₂, g_η₁(X₁)}], and $G_{𝒞_{η}} (W_{g_{η}}) = G_{2} (W_{g_{η}}) = {X̄}_{2}^{*} (g_{η_{1}}) = {X̄}_{2}$ . Only the treatment at decision 1 and the ensuing X₂ are consistent with g_η. Likewise, I(𝒞_η = 3) = I{A₁ = g_η₁(X₁)}I[A₂ = g_η₂{X̄₂, g_η₁(X₁)}]I{A₃ ≠ g_η₃(X̄₃)}, where g_η₃(X̄₃) is shorthand for g_η₃[X̄₃, g_η₁(X₁), g_η₂{X̄₂, g_η₁(X₁)}] = g_η₃{X̄₃, ḡ_η₂(X̄₂)} and ḡ_η₂(X̄₂) = [g_η₁(X₁), g_η₂{X̄₂, g_η₁(X₁)}] and similarly for general k; and $G_{𝒞_{η}} (W_{g_{η}}) = G_{3} (W_{g_{η}}) = {X̄}_{3}^{*} (ḡ_{η_{2}}) = {X̄}_{3}$ . Continuing in this fashion, I(𝒞_η = K) = I[Ā_K−1 = ḡ_{η_K−1}{X̄_K−1, ḡ_{η_K−2}(X̄_K−2)}]I[A_K ≠ g_{η_K}{X̄_K, ḡ_{η_K−1}(X̄_K−1)}], and $G_{𝒞_{η}} (W_{g_{η}}) = G_{K} (W_{g_{η}}) = {X̄}_{K}^{*} (ḡ_{η_{K - 1}}) = {X̄}_{K}$ . Finally, if Ā_K = ḡ_{η_K} {X̄_K, ḡ_{η_K−1} (x̄_K−1)}, G_{𝒞_η} (W_{g_η}) = G_∞(W_{g_η}) = W_{g_η} = (X₁, …, X_K, Y). Here, the observed data are consistent with having followed all K rules in g_η. The coarsening is monotone in that G_k(W_{g_η}) is a coarsened version of G_k′(W_{g_η}), k′ > k, and G_k(W_{g_η}) is a many-to-one function of G_k+1(W_{g_η}).

Coarsened data are said to be coarsened at random if, for each k, the probability that the data are coarsened at level k, given the full data, depends only on the coarsened data, so only on data that are observed at level k (Tsiatis, 2006, §7.1). Under the consistency and sequential randomization assumptions, it may be shown using results in §3 of the Supplementary Material that the coarsening here is at random. Define the coarsening discrete hazard pr(𝒞_η = k | 𝒞_η ≥ k, W_{g_η}) to be the probability that the observed treatments cease to be consistent with g_η at decision k, given they are consistent prior to k and all potential outcomes. Under coarsening at random, this hazard is a function only of the coarsened data, that is, the data observed through decision k, which we write as pr(𝒞_η = k | 𝒞_η ≥ k, W_{g_η}) = λ_η,k{G_k(W_{g_η})}. Then, from above, for k = 1, λ_η,1{G₁(W_{g_η})} = λ_η,1(X₁) = pr{A₁ ≠ g_η₁(X₁) | X₁}, which can be expressed in terms of the propensity for treatment at decision 1 as π₁(X₁)^{1−g_η₁ (X₁)}{1 − π₁(X₁)}^{g_η₁ (X₁)}. Similarly, for k = 2, …, K,

λ_{η, k} {G_{k} (W_{g_{η}})} = λ_{η, k} ({X̄}_{k}) = p r {A_{k} \neq g_{η_{k}} ({X̄}_{k}, Ā_{k - 1}) | {X̄}_{k}, Ā_{k - 1} = ḡ_{η_{k - 1}} ({X̄}_{k - 1})} = π_{k} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1})}^{1 - g_{η_{k}} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1})}} \times {[1 - π_{k} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1})}]}^{g_{η_{k}} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1})}} .

We may then express the probabilities of being consistent with g_η through at least the kth decision, so having 𝒞_η > k, given all potential outcomes, in terms of the discrete hazards. Under coarsening at random, these probabilities depend only on the observed data through decision k. That is, pr(𝒞_η > k | W_{g_η}) = K_η,k{G_k(W_{g_η})} = K_η,k(X̄_k), where $K_{η, k} ({X̄}_{k}) = \prod_{k' = 1}^{k} {1 - λ_{η, k'} ({X̄}_{k'})}$ (Tsiatis, 2006, §8.1).

We now use these developments to deduce the form of estimators for E{Y*(g_η)}. From the theory of Robins et al. (1994) for general monotonely coarsened data, under coarsening at random, if the coarsening mechanism is correctly specified, which corresponds here to correct specification of the λ_η,k(X̄_k), and hence of the propensity models, all regular, asymptotically linear, consistent estimators (Tsiatis, 2006, Chapter 3) for E{Y*(g_η)} for fixed η have the form

\sum_{i = 1}^{n} \frac{I (𝒞_{η, i} = \infty)}{K_{η, K} ({X̄}_{K i})} Y_{i} + \sum_{k = 1}^{K} \frac{I (𝒞_{η, i} = k) - λ_{η, k} ({X̄}_{k i}) I (𝒞_{η, i} \geq k)}{K_{η, k} ({X̄}_{k i})} L_{k} ({X̄}_{k i}),

(4)

where L_k(X̄_k) are arbitrary functions of X̄_k. The optimal choice leading to (4) with smallest asymptotic variance is $L_{η, k}^{opt} ({x̄}_{k}) = E {Y^{*} (g_{η}) | {X̄}_{k}^{*} (ḡ_{η_{k - 1}}) = {x̄}_{k}}$ . The right hand term in (4) augments the first, itself a consistent estimator for E{Y*(g_η)} when the λ_η,k(X̄_k) are correctly specified, to gain efficiency. As in Tsiatis (2006, §10.3), (4) is doubly robust in that it is a consistent estimator for E{Y*(g_η)} if either the λ_η,k(X̄_ki) are correctly specified or if the L_k(X̄_ki) are equal to $L_{η, k}^{opt} ({X̄}_{k i}) (k = 1, \dots, K)$ ; see §4 of the Supplementary Material.

To implement (4), one must specify λ_η,k(X̄_ki) and L_k(X̄_ki). The first follow from specifying π₁(x₁) = pr(A₁ = 1 | X₁ = x₁), π_k(x̄_k, ā_k−1) = pr(A_k = 1 | X̄_k = x̄_k,Ā_k−1 = ā_k−1) for k = K, …, 2. If these are unknown, as in A-learning, posit models π₁(x̄₁; γ₁), π_k(x̄_k, ā_k−1; γ_k) for k = 2, …, K, and estimate γ_k by γ̂_k (k = 1, …, K). With $γ = {(γ_{1}^{T}, \dots, γ_{K}^{T})}^{T}$ and ${γ̂}^{T} = {({γ̂}_{1}^{T}, \dots, {γ̂}_{K}^{T})}^{T}$ , this implies that λ_η,1(X₁; γ₁) = π₁(X₁; γ₁)^{1−g_η₁ (X₁)}{1 − π₁(X₁; γ₁)}^{g_η₁ (X₁)},

λ_{η, k} ({X̄}_{k}; γ_{k}) = π_{k} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1}); γ_{k}}^{1 - g_{η_{k}} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1})}} \times {[1 - π_{k} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1}); γ_{k}}]}^{g_{η_{k}} {{X̄}_{k}, ḡ_{η_{k - 1}} ({X̄}_{k - 1})}}

and $K_{η, k} ({X̄}_{k}; γ) = \prod_{k' = 1}^{k} {1 - λ_{η, k'} ({X̄}_{k'}; γ_{k'})}$ , and suggests substituting λ_η,k(X̄_k; γ̂_k) and K_η,k(X̄_k; γ̂) in (4).

Several options exist for specification of the L_k(X̄_k). The simplest is to take L_k(X̄_k) ≡ 0, yielding the inverse probability weighted estimator

I P W E (η) = \sum_{i = 1}^{n} \frac{I (𝒞_{η, i} = \infty)}{K_{η, K} ({X̄}_{K i}; γ̂)} Y_{i},

(5)

which is consistent for E{Y*(g_η)} if η₁(X₁; γ_k) and π_k(X̄_k, Ā_k−1; γ_k) (k = 2, …, K), and hence K_η,K(X̄_K; γ), are correctly specified, but otherwise may be inconsistent. The corresponding estimator for $g_{η}^{opt}$ is found by estimating η^opt by ${η̂}_{I P W E}^{opt}$ , say, maximizing (5) in η. As (5) is based on data only from subjects whose entire treatment history is consistent with g_η, it is relatively less efficient than estimators that use all the data, discussed next.

To take greatest advantage of the potential for improved efficiency through the augmentation term in (4), an obvious approach is to posit and fit parametric models approximating the conditional expectations $L_{η, k}^{opt} ({x̄}_{k}) = E {Y^{*} (g_{η}) | {X̄}_{k}^{*} (ḡ_{η_{k - 1}}) = {x̄}_{k}}$ , and substitute these into (4) along with λ_η,k(X̄_k; γ̂_k) and K_η,k(X̄_k; γ̂). To this end, let μ_{η_K}(x̄_K, ā_K) = E(Y | X̄_K = x̄_K, Ā_K = ā_K) and f_{η_K}(x̄_K, ā_K−1) = μ_{η_K} {x̄_K, ā_K−1, g_{η_K} (x̄_K, ā_K−1)}. Then define iteratively, for k = K − 1, …, 2, the quantities μ_{η_k} (x̄_k, ā_k) = E{f_{η_k+1} (x̄_k, X_k+1, ā_k) | X̄_k = x̄_k, Ā_k = ā_k} and f_{η_k} (x̄_k, ā_k−1) = μ_{η_k} {x̄_k, ā_k−1, g_{η_k} (x̄_k, ā_k−1)}; for k = 1, μ_η₁ (x₁, a₁) = E{f_η₂(x₁, X₂, a₁) | X₁ = x₁, A₁ = a₁}, f_η₁(x₁) = μ_η₁ {x₁, g_η₁ (x₁)}. In §5 of the Supplementary Material, we demonstrate that $L_{η, k}^{opt} ({X̄}_{k}) = μ_{η_{k}} {{X̄}_{k}, ḡ_{η_{k}} ({X̄}_{k})}$ .

This suggests specifying η-dependent models μ_{η_k} (x̄_k, ā_k; ξ_k) depending on parameters ξ_k, k = 1, …, K. For fixed η, estimators ξ̂_k for ξ_k may be found iteratively by solving in ξ_k

\sum_{i = 1}^{n} \frac{\partial μ_{η_{k}} ({X̄}_{k i}, Ā_{k i}; ξ_{k})}{\partial ξ_{k}} {{f̃}_{(k + 1) i} - μ_{η_{k}} ({X̄}_{k i}, Ā_{k i}; ξ_{k})} = 0 (k = 1, \dots K),

where ∂/∂ξ_k{μ_{η_k} (X̄_ki, Ā_ki; ξ_k)} is the vector of partial derivatives of μ_{η_k} (X̄_ki, Ā_ki; ξ_k) with respect to elements of ξ_k, f̃_(K+1)i = Y_i and f̃_ki = μ_{η_k} [X̄_ki, Ā_(k−1)i, g_{η_k} {X̄_ki, Ā_(k−1)i}; ξ̂_k] (k = K, …, 2). The fitted μ_{η_k} {X̄_k, ḡ_{η_k} (X̄_k); ξ̂_k} may then be used to approximate $L_{η, k}^{opt} ({x̄}_{k})$ in (4). While these models almost certainly are not correct, as specification of a compatible sequence of models for k = 1, …, K is a significant challenge, they may be reasonable approximations to the true conditional expectations. Thus, define

D R (η) = \sum_{i = 1}^{n} \frac{I (𝒞_{η, i} = \infty)}{K_{η, K} ({X̄}_{K i}; γ̂)} Y_{i} + \sum_{k = 1}^{K} \frac{I (𝒞_{η, i} = k) - λ_{η, k} ({X̄}_{k i}; {γ̂}_{k}) I (𝒞_{η, i} \geq k)}{K_{η, k} ({X̄}_{k i}; γ̂)} μ_{η_{k}} {{X̄}_{k i}, ḡ_{η_{k}} ({X̄}_{k i}); {ξ̂}_{k}},

(6)

which, by virtue of the double robustness property, will be consistent for E{Y*(g_η)} if either π₁(x₁; γ_k) and π_k(x̄_k, ā_k−1; γ_k) (k = K, …, 2), are correctly specified, or the the μ_{η_k} (x̄_k, ā_k; ξ_k) are. If all of these models were correct, then (6) would achieve optimal efficiency. As for (5), estimation of $g_{η}^{opt}$ follows by maximizing (6) in η to obtain ${η̂}_{D R}^{opt}$ .

A computational challenge is that the models μ_{η_k} (x̄_k, ā_k−1; ξ_k) must be refitted for each value of η encountered in the optimization algorithm used to carry out the maximization. A practical alternative when regimes in 𝒢_η are derived from models is to substitute for L_k(X̄_k,i) in (4) fitted Q-functions Q_k{X̄_k, ḡ_{η_k} (X̄_k); β̂_k} for k = K, …, 1 obtained from Q-learning; holding β̂_k fixed, these depend on η only through ḡ_{η_k} (X̄_k). While these are not strictly models for $E {Y^{*} (g_{η}) | {X̄}_{k}^{*} (ḡ_{η_{k}})}$ , the hope is that they will be close enough to achieve near optimal efficiency gains over (5). Thus, estimate $g_{η}^{opt}$ by maximizing in η to obtain ${η̂}_{A I P W E}^{opt}$

A I P W E (η) = \sum_{i = 1}^{n} \frac{I (𝒞_{η, i} = \infty)}{K_{η, K} ({X̄}_{K i}; γ̂)} Y_{i} + \sum_{k = 1}^{K} \frac{I (𝒞_{η, i} = k) - λ_{η, k} ({X̄}_{k i}; {γ̂}_{k}) I (𝒞_{η, i} \geq k)}{K_{η, k} ({X̄}_{k i}; γ̂)} Q_{k} {{X̄}_{k i}, ḡ_{η_{k}} ({X̄}_{k i}); {β̂}_{k}} .

(7)

See §6 of the Supplementary Material for a similar proposal when 𝒢_η is determined directly.

Standard errors for these estimators for $E {Y^{*} (g_{η}^{opt})}$ may be obtained via the sandwich technique (Stefanski & Boos, 2002) based on the argument in Zhang et al. (2012, Equation (4)).

5. Simulation studies

We have carried out several simulation studies to evaluate the performance of the proposed methods, each involving 1000 Monte Carlo data sets.

The first simulation adopts the scenario in Moodie et al. (2007) of a study in which HIV-infected patients are randomized to initiate antiretroviral therapy or not, coded as 1 or 0, at baseline and again at six months to determine the optimal regime for therapy initiation. We generated baseline CD4 count X₁ ~ N (450, 150), where N(μ, σ²) denotes the normal distribution with mean μ and variance σ²; baseline treatment A₁ as Bernoulli with success probability pr(A₁ = 1 | X₁) = expit(2 − 0.006X₁), where expit(u) = e^u/(1 + e^u); six-month CD4 count X₂, conditional on (X₁,A₁), as N(1.25X₁, 60); and treatment at six months A₂ as Bernoulli with pr(A₂ = 1 | X̂₂, A₁) = A₁ + (1 − A₁)expit(0.8 − 0.004X₂). Here, patients with A₁ = 1 continue on therapy with certainty. The outcome Y, one-year CD4 count, conditional on (X̄₂, Ā₂), was normal with mean 400 + 1.6X₁ − |250 − X₁|{A₁ − I(250 − X₁ > 0)}² − (1 − A₁)|720 − 2X₂|{A₂ − I(720 − 2X₂ > 0)}² and variance 60². The true Q-contrast functions are thus C₂(x₁, x₂, a₁) = (1 − a₁) (720 − 2x₂), C₁(x₁) = 250 − x₁, the optimal treatment regime $g^{opt} = (g_{1}^{opt}, g_{2}^{opt})$ has $g_{1}^{opt} (x_{1}) = I (250 - x_{1} > 0), g_{2}^{opt} ({x̄}_{2}, a_{1}) = I {a_{1} + (1 - a_{1}) (720 - 2 x_{2}) > 0} = I {a_{1} + (1 - a_{1}) (360 - x_{2}) > 0}$ and E{Y*(g^opt)} = 1120.

For A-learning, we took

h_{2} ({x̄}_{2}, a_{1}; α_{2}) = α_{20} + α_{21} x_{1} + α_{22} a_{1} + α_{23} a_{1} x_{1} + α_{24} (1 - a_{1}) x_{2}, C_{2} ({x̄}_{2}, a_{1}; ψ_{2}) = (1 - a_{1}) (ψ_{20} + ψ_{21} x_{2}),

h₁(x₁, α₁) = α₁₀ + α₁₁x₁, and C₁(x₁; ψ₁) = ψ₁₀ + ψ₁₁x₁; and, analogously, for Q-learning,

Q_{2} ({x̄}_{2}, ā_{2}; β_{2}) = β_{20} + β_{21} x_{1} + a_{1} (β_{22} + β_{23} x_{1}) + β_{24} (1 - a_{1}) x_{2} + a_{2} (1 - a_{1}) (β_{25} + β_{26} x_{2}), Q_{1} (x_{1}, a_{1}; β_{1}) = β_{10} + β_{11} x_{1} + a_{1} (β_{12} + β_{13} x_{1}),

so the Q-contrast functions are correct, but the Q-functions are misspecified. Here, C₂(x̄₂, 1; ψ₂) = 0, respecting that Φ₂(x̄₂, 1) = {1}. We used correct propensity models π₂(x̄₂, a₁ = 0; γ₂) = expit(γ₂₀ + γ₂₁x₂), π₁(x₁; γ₁) = expit(γ₁₀ + γ₁₁x₁) and incorrect models π₂(x̄₂, a₁ = 0; γ₂) = γ₂, π₁(x₁; γ₁) = γ₁.

For maximizing ipwe(η) in (5), dr(η) in (6), and aipwe(η) in (7) to obtain ${η̂}_{I P W E}^{opt}, {η̂}_{D R}^{opt}$ , and ${η̂}_{A I P W E}^{opt}$ , we considered the class of regimes 𝒢_η with elements g_η = (g_η₁, g_η₂),

g_{η_{2}} ({x̄}_{2}, a_{1}) = I {a_{1} + (1 - a_{1}) (η_{20} + η_{21} x_{2}) > 0}, g_{η_{1}} (x_{1}) = I (η_{10} + η_{11} x_{1} > 0),

so that η₂ = (η₂₀, η₂₁)^T, η₁ = (η₁₀, η₁₁)^T, $η = {(η_{1}^{T}, η_{2}^{T})}^{T}$ and η^opt = (250, −1, 360, −1)^T. Clearly, g^opt ∈ 𝒢_η. We used the same propensity models, and, for (7), Q-function models as above; for (6), we posited μ_η₂(x̄₂, ā₂; ξ₂) = ξ₂₀ + ξ₂₁x₁ + a₁(ξ₂₂ + ξ₂₃x₁) + ξ₂₄(1 − a₁)x₂ + a₂(1 − a₁)(ξ₂₅ + ξ₂₆x₂) and μ_η₁(x₁, a₁; ξ₁) = ξ₁₀ + ξ₁₁x₁ + a₁(ξ₁₂ + ξ₁₃x₁) for each η. To achieve a unique representation, we fixed (η₂₁, η₁₁) = (−1, −1) and determined η₂₀, η₁₀ via a grid search; because ipwe(η), dr(η) and aipwe(η) are step functions of η with jumps at (x_1i, x_2j) (i, j = 1, …, n), we maximized in η over all (x_1i, x_2j).

The second scenario is the same as the first except that the models for the Q-contrast functions are misspecified. Specifically, the generative distribution of Y given (X̄₂, Ā₂) is now normal with mean 400 + 1.6X₁ − |250 − 0.6X₁|{A₁ − I(250 − X₁ > 0)}² − (1 − A₁)|720 − 1.4X₂|{A₂ − I(720 − 2X₂ > 0)}² and variance 60², so that, from the discussion below (2) of Moodie et al. (2007), the implied true contrast functions are no longer of the form above, but all posited models were taken to be the same as in the first simulation.

Tables 1 and 2 show the results. For Q- and A-learning, we report η(β̂) and η(ψ̂). The column Ê (η̂^opt) shows for each estimator the Monte Carlo average and standard deviation of the estimated values of $E {Y^{*} (g_{η}^{opt})}$ reflecting performance for estimating the true achievable mean outcome under the true optimal regime, while E(η̂^opt) reflects performance of the estimated optimal regime itself. For each Monte Carlo data set, this is the true mean outcome that would be achieved if the estimated optimal regime were followed by the population was determined by simulation, and the values reported are the Monte Carlo average and standard deviation of these simulated quantities. When compared to the true $E {Y^{*} (g_{η}^{opt})} = 1120$ , these measure the extent to which the estimated optimal regimes approach the performance of the true optimal regime.

Table 1.

Results for the first simulation scenario, Q-contrast functions correct, 1000 Monte Carlo data sets, n = 500. For the true optimal regime $g^{opt} = g_{η}^{opt} \in 𝒢_{η}$ , η^opt = (250, −1, 360, −1)^T and $E {Y^{*} (g_{η}^{opt})} = 1120$

Estimator	η̂₁₀	η̂₂₀	Ê(η̂^opt)	SE	Cov.	E(η̂^opt)
Q-learning	228 (17)	322 (25)	1117 (12)	–	–	1119 (1)
Propensity score correct
A-learning	245 (18)	359 (20)	1121 (11)	–	–	1120 (1)
AIPWE (7)	210 (73)	363 (33)	1125 (12)	12	93.1	1118 (2)
DR (6)	211 (73)	363 (34)	1125 (12)	12	93.1	1118 (2)
IPWE (5)	268 (72)	397 (83)	1183 (24)	34	59.2	1105 (18)
Propensity score correct
A-learning	228 (17)	322 (25)	1117 (12)	–	–	1119 (1)
AIPWE (7)	259 (51)	390 (47)	1123 (12)	12	93.8	1116 (4)
DR (6)	262 (48)	386 (45)	1123 (12)	12	94.3	1116 (4)
IPWE (5)	349 (49)	471 (63)	1554 (56)	64	0.0	1075 (22)

Open in a new tab

AIPWE, DR, and IPWE, estimators based on maximizing aipwe(η) dr(η), and ipwe(η), respectively; η̂₁₀, η̂₂₀, Monte Carlo average estimates (standard deviation); Ê(η̂^opt), Monte Carlo average (standard deviation) of estimated $E {Y^{*} (g_{η}^{opt})}$ ; SE, Monte Carlo average of sandwich standard errors; Cov., coverage of associated 95% Wald-type confidence intervals for E(η^opt); E(η̂^opt), Monte Carlo average (standard deviation) of values $E {Y^{*} (ĝ_{η}^{opt})}$ obtained using 10⁶ Monte Carlo simulations for each data set.

Table 2.

Results for the second simulation scenario, Q-contrast functions incorrect, 1000 Monte Carlo data sets, n = 500. For the true optimal regime $g^{opt} = g_{η}^{opt} \in 𝒢_{η}$ , η^opt = (250, −1, 360, −1)^T and $E {Y^{*} (g_{η}^{opt})} = 1120$ . All quantities are as in Table 1

Estimator	η̂₁₀	η̂₂₀	Ê(η̂^opt)	SE	Cov.	E (η̂^opt)
Q-learning	381 (33)	386 (45)	1104 (12)	–	–	1088 (6)
Propensity score correct
A-learning	364 (29)	453 (26)	1115 (12)	–	–	1087 (3)
AIPWE (7)	250 (21)	359 (9)	1120 (12)	12	94.7	1118 (3)
DR (6)	250 (23)	360 (13)	1121 (11)	12	96.3	1118 (3)
IPWE (5)	305 (67)	432 (86)	1182 (27)	38	70.1	1096 (12)
Propensity score incorrect
A-learning	381 (33)	386 (45)	1104 (12)	–	–	1088 (6)
AIPWE (7)	255 (24)	363 (28)	1116 (12)	12	93.5	1118 (6)
DR (6)	255 (25)	364 (28)	1116 (12)	12	93.3	1117 (7)
IPWE (5)	361 (47)	480 (69)	1571 (59)	67	0.0	1086 (5)

Open in a new tab

For the first simulation, from Table 1, because the Q-functions are misspecified, the Q-learning estimators for η₁₀ and η₂₀ are biased, while those from A-learning based on postulated Q-contrast functions that include the truth are consistent when the propensity model is correct. When the propensity model is incorrect, Q-learning is unaffected; however, A-learning yields biased estimators for η₁₀ and η₂₀ identical to those from Q-learning, as linear models are used for C₂(x̄₂, a₁; ψ₂), C₁(x₁; ψ₁), h₂(x̄₂, a₁; α₂) and h₁(x₁; α₁) (Chakraborty et al., 2010). Although Q-learning results in poor estimation of η₁₀ and η₂₀, efficiency loss for estimating the optimal regime is negligible, as the proportion of benefit the estimated regime achieves if used in the entire population relative to the true optimal regime is virtually one. A possible explanation is that patients near the true decision boundary have C₂(X̄₂, a₁), C₁(X₁) close to zero, and few patients would receive treatment 1 according to the true decision rule for the first time point. This also follows from the fact that for regime $g = (0, g_{2}^{opt})$ , the corresponding expectation is 1114. When the propensity model is correct, the estimators based on dr(η) and aipwe(η) yield estimated regimes comparable to those found by A-learning in terms of true mean outcome achieved, despite yielding relatively inefficient estimators for η₁₀ and η₂₀ A-learning, perhaps for the same reason as above. When the propensity model is incorrect, the dr(η) and aipwe(η) estimators yield estimated regimes that are still close to the optimal. The ipwe(η) estimator show relatively poorer performance, especially when the propensity score model is incorrect, which is not unexpected; this estimator only uses information from patients whose treatment histories are consistent with following g_η and hence is inefficient.

In the second simulation, the values of |C₂(X̄₂, A₁)| and |C₁(X₁)| for patients near the true decision boundary are larger than in the first simulation, and the posited Q-contrast functions are no longer correct. From Table 2, the A- and Q- learning estimators perform similarly, both yielding estimated regimes far from optimal. Those based on dr(η) and aipwe(η) are almost identical to g^opt on average and perform almost identically to the true optimal regime, regardless of whether or not the propensity model is correct. Again, the estimator based on ipwe(η) in (5) performs poorly. Evidently, augmentation even using incorrect models leads to considerable gains over ipwe(η) regardless of whether or not the propensity model is correct.

The third scenario involved K = 3 decision points. To achieve average numbers of patients consistent with the regime comparable to those in the K = 2 cases, we took n = 1000. We generated X₁, A₁, X₂ as in the previous two scenarios; A₂ as Bernoulli with pr(A₂ = 1 | X̄₂, A₁) = expit(0.8 − 0.004X₂); twelve-month CD4 count X₃, conditional on (X̄₂, Ā₂), as N(0.8X₂, 60); treatment at twelve months A₃ as Bernoulli with pr(A₃ = 1 | X̄₃, Ā₂) = expit(1 − 0.004X₃); and the outcome Y, 18-month CD4 count, conditional on (X̄₃, Ā₃), as normal with mean 400 + 1.6X₁ − |500 − 1.4X₁|{A₁ − I(500 − 2X₁ > 0)}² − |720 − 1.4X₂|{A₂ − I(720 − 2X₂ > 0)}² − |600 − 1.4X₃|{A₃ − I(600 − 2X₃ > 0)}² and variance 60². The optimal treatment regime $g^{opt} = (g_{1}^{opt}, g_{2}^{opt}, g_{3}^{opt})$ has $g_{1}^{opt} (x_{1}) = I (250 - x_{1} > 0), g_{2}^{opt} ({x̄}_{2}, a_{1}) = I (360 - x_{2} > 0), g_{3}^{opt} ({x̄}_{3}, ā_{2}) = I (300 - x_{3} > 0)$ and E{Y*(g^opt)} = 1120.

For A-learning, we took

h_{3} ({x̄}_{3}, ā_{2}; α_{3}) = α_{30} + α_{31} x_{1} + a_{1} (α_{32} + α_{33} x_{1}) + α_{34} x_{2} + a_{2} (α_{35} + α_{36} x_{2}) + α_{37} x_{3}, C_{3} ({x̄}_{3}, ā_{2}; ψ_{2}) = ψ_{30} + ψ_{31} x_{3}, h_{2} ({x̄}_{2}, a_{1}; α_{2}) = α_{20} + α_{21} x_{1} + a_{1} (α_{22} + α_{23} x_{1}) + α_{24} x_{2}, C_{2} ({x̄}_{2}, a_{1}; ψ_{2}) = ψ_{20} + ψ_{21} x_{2}, h_{1} (x_{1}; α_{1}) = α_{10} + α_{11} x_{1}, C_{1} (x_{1}; ψ_{1}) = ψ_{10} + ψ_{11} x_{1},

and for Q-learning

Q_{3} ({x̄}_{3}, ā_{3}; β_{3}) = β_{30} + β_{31} x_{1} + a_{1} (β_{32} + β_{33} x_{1}) + β_{34} x_{2} + a_{2} (β_{35} + β_{36} x_{2}) + α_{37} x_{3} + a_{3} (β_{38} + β_{39} x_{3}), Q_{2} ({x̄}_{2}, ā_{2}; β_{2}) = β_{20} + β_{21} x_{1} + a_{1} (β_{22} + β_{23} x_{1}) + β_{24} x_{2} + a_{2} (β_{25} + β_{26} x_{2}), Q_{1} (x_{1}, a_{1}; β_{1}) = β_{10} + β_{11} x_{1} + a_{1} (β_{12} + β_{13} x_{1});

thus, both Q- and Q-contrast functions are misspecified. We used correct propensity models π₃(x̄₃, ā₂; γ₃) = expit(γ₃₀ + γ₃₁x₃), π₂(x̄₂, a₁; γ₂) = expit(γ₂₀ + γ₂₁x₂), π₁(x₁; γ₁) = expit(γ₁₀ + γ₁₁x₁) and incorrect models π₃(x̄₃, ā₂; γ₃) = γ₃, π₂(x̄₂, a₁; γ₂) = γ₂, π₁(x₁; γ₁) = γ₁.

For the three proposed estimators, we took the class of regimes 𝒢_η to have elements g_η = (g_η₁, g_η₂, g_η₃), g_η₃(x̄₃, ā₂) = I(η₃₀ + η₃₁x₃ > 0), g_η₂(x̄₂, a₁) = I(η₂₀ + η₂₁x₂ > 0), g_η₁(x₁) = (η₁₀ + η₁₁x₁ > 0), so η₃ = (η₃₀, η₃₁)^T η₂ = (η₂₀, η₂₁)^T, η₁ = (η₁₀, η₁₁)^T, $η = {(η_{1}^{T}, η_{2}^{T}, η_{3}^{T})}^{T}$ and η^opt = (250, −1, 360, −1, 300, −1)^T, so g^opt ∈ 𝒢_η. We used the same propensity models, and, for (7), Q-function models as above, and fixed (η₃₁, η₂₁, η₁₁) = (−1, −1, −1). To carry out the maximizations, we used a genetic algorithm discussed by Goldberg (1989), implemented in the rgenoud package in R (Mebane & Sekhon, 2011); see §7 of the Supplementary Material for details.

Table 3 show the results. Q-learning performs poorly, as expected. When the propensity model is correctly specified, results for A-learning and the proposed methods are similar to those in the second scenario, with the estimated regimes based on dr(η) and aipwe(η) achieving near-optimal performance and associated reliable inference on the true achievable mean outcome E{Y*(g_η)}. When the propensity models are misspecified, the situation is similar for these estimators in terms of performance; however, inference on E{Y*(g_η)} is markedly degraded. In both cases, performance of the estimator based on ipwe(η) is quite poor. Intuitively, as the number of decisions K increases, it is not unexpected that all methods can suffer from diminished performance. Research is needed on the design of sequentially randomized trials to ensure adequate sample size for reliable inference on multi-decision regimes.

Table 3.

Results for the third simulation scenario, K = 3, Q-contrast functions incorrect, 1000 Monte Carlo data sets, n = 1000. For the true optimal regime $g^{opt} = g_{η}^{opt} \in 𝒢_{η}$ , η^opt = (250, −1, 360, −1, 300, −1)^T and $E {Y^{*} (g_{η}^{opt})} = 1120$ . All quantities are as in Table 1

Estimator	η̂₁₀	η̂₂₀	η̂₃₀	Ê(η̂^opt)	SE	Cov.	E(η̂^opt)
Q-learning	179 (58)	412.9 (28)	341 (33)	1058 (13)	–	–	1086 (9)
Propensity score correct
A-learning	319 (12)	462 (11)	387 (12)	1108 (12)	–	–	1071 (3)
AIPWE (7)	263 (41)	362 (14)	300 (7)	1121 (10)	10	94.6	1116 (5)
DR (6)	263 (37)	361 (11)	300 (8)	1121 (10)	10	94.2	1117 (5)
IPWE (5)	399 (132)	618 (138)	450 (132)	1297 (63)	103	56.2	1008 (75)
Propensity score incorrect
A-learning	179 (58)	413 (28)	341 (33)	1041 (12)	–	–	1086 (9)
AIPWE (7)	360 (48)	371 (39)	310 (30)	1200 (26)	27	9.0	1104 (10)
DR (6)	386 (35)	362 (26)	314 (39)	1208 (26)	27	4.5	1102 (9)
IPWE (5)	412 (42)	521 (60)	415 (58)	2459 (148)	167	0.0	1055 (14)

Open in a new tab

In §8 of the Supplementary Material, we present results of a more complex scenario; the qualitative conclusions are similar.

All simulations here, and others we have conducted, suggest that Q- and A-learning can yield biased estimators for parameters defining the optimal regime if the Q-functions or Q-contrast functions are misspecified. Under these conditions, the resulting estimated optimal regimes can perform poorly in terms of achieving the expected potential outcome of the true optimal regime. In contrast, the proposed approach using (6) or (7) exhibits robustness to misspecification of either one of the outcome regression or propensity score models. Under these circumstances, the estimators of regime parameters are relatively unbiased, and the expected potential outcome under the estimated optimal regime approaches that of the true optimal regime. Moreover, the proposed methods lead to reliable estimation of the expected potential outcome under the true regime, with coverage probabilities close to the nominal level. Even when both outcome regression and propensity models are misspecified, the proposed methods can yield estimated optimal regimes that do not show substantial degradation of performance in terms of achieved expected potential outcome relative to the true optimal regime. In this case, inference on the expected outcome under the true optimal regime can be compromised, although, interestingly, the methods performed well in this regard under these conditions in the second simulation scenario. Collectively, our results suggest that the proposed methods are attractive alternatives to Q- and A-learning owing to their robustness to such model misspecification. As the estimator based on aipwe(η) is much less computationally intensive than dr(η) and performs similarly, we recommend it for practical use.

In §9 of the Supplementary Material, we report on application of the methods to a study to compare treatment options in patients with nonpsychotic major depressive disorder.

Supplementary Material

supplementary material

NIHMS494366-supplement-supplementary_material.pdf^{(252.4KB, pdf)}

Acknowledgment

This research was supported by grants from the US National Institutes of Health.

Footnotes

Supplementary Material

Supplementary material available at Biometrika online includes technical arguments, more details on the estimators studied, and additional simulation results.

Contributor Information

Baqun Zhang, Email: baqun.zhang@northwestern.edu, Department of Preventive Medicine, 680 N. Lakeshore Drive, Suite 1400 Northwestern University, Chicago, Illinois, 60611 U.S.A.

Anastasios A. Tsiatis, Email: tsiatis@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695-8203, U.S.A.

Eric B. Laber, Email: eblaber@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695-8203, U.S.A.

Marie Davidian, Email: davidian@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695-8203, U.S.A.

References

Almirall D, Ten Have T, Murphy SA. Structural nested mean models for assessing time-varying effect moderation. Biometrics. 2010;66:131–139. doi: 10.1111/j.1541-0420.2009.01238.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bather J. Decision Theory: an Introduction to Dynamic Programming and Sequential Decisions. Chichester: Wiley; 2000. [Google Scholar]
Chakraborty B, Murphy SA, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statist. Meth. Med. Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goldberg DE. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley; 1989. [Google Scholar]
Henderson R, Ansell P, Alshibani D. Regret-regression for optimal dynamic treatment regimes. Biometrics. 2010;66:1192–1201. doi: 10.1111/j.1541-0420.2009.01368.x. [DOI] [PubMed] [Google Scholar]
Mebane WR, Sekhon JS. Genetic optimization using derivatives: the rgenoud package for R. J. Statist. Soft. 2011;42:1–26. [Google Scholar]
Moodie EEM, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63:447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]
Murphy SA. Optimal dynamic treatment regimes (with discussion) J. Royal Statist. Soc., Ser. B. 2003;58:331–366. [Google Scholar]
Murphy SA. An experimental design for the development of adaptive treatment strategies. Statist. Med. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
Murphy SA, Oslin DW, Rush AJ, Zhu J. Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology. 2007;32:257–262. doi: 10.1038/sj.npp.1301241. [DOI] [PubMed] [Google Scholar]
Orellana L, Rotnitzky A, Robins J. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: Main content. Int. J. Biostatist. 2010;6(Issue 2) Article 8. [PubMed] [Google Scholar]
Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods: Applications to control of the healthy worker survivor effect. Math. Model. 1986;7:1393–1512. [Google Scholar]
Robins JM. Optimal structured nested models for optimal sequential decisions. In: Lin DY, Heagerty PJ, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004. pp. 189–326. [Google Scholar]
Robins J, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Statist. Med. 2008;27:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 1994;89:846–866. [Google Scholar]
Rosthøj S, Fullwood C, Henderson R, Stewart S. Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statist. Med. 2006;25:4197–4215. doi: 10.1002/sim.2694. [DOI] [PubMed] [Google Scholar]
Rubin DB. Bayesian inference for causal effects: The role of randomization. Ann. Statist. 1978;6:34–58. [Google Scholar]
Stefanski LA, Boos DD. The calculus of M-estimation. Amer. Statist. 2002;56:29–38. [Google Scholar]
Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]
Watkins CJCH, Dayan P. Q-learning. Mach. Learn. 1992;8:279–292. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Kosorok MR, Zeng D. Reinforcement learning design for cancer clinical trials. Statist. Med. 2009;28:3294–3315. doi: 10.1002/sim.3720. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

NIHMS494366-supplement-supplementary_material.pdf^{(252.4KB, pdf)}

[R1] Almirall D, Ten Have T, Murphy SA. Structural nested mean models for assessing time-varying effect moderation. Biometrics. 2010;66:131–139. doi: 10.1111/j.1541-0420.2009.01238.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bather J. Decision Theory: an Introduction to Dynamic Programming and Sequential Decisions. Chichester: Wiley; 2000. [Google Scholar]

[R3] Chakraborty B, Murphy SA, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statist. Meth. Med. Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Goldberg DE. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley; 1989. [Google Scholar]

[R5] Henderson R, Ansell P, Alshibani D. Regret-regression for optimal dynamic treatment regimes. Biometrics. 2010;66:1192–1201. doi: 10.1111/j.1541-0420.2009.01368.x. [DOI] [PubMed] [Google Scholar]

[R6] Mebane WR, Sekhon JS. Genetic optimization using derivatives: the rgenoud package for R. J. Statist. Soft. 2011;42:1–26. [Google Scholar]

[R7] Moodie EEM, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63:447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]

[R8] Murphy SA. Optimal dynamic treatment regimes (with discussion) J. Royal Statist. Soc., Ser. B. 2003;58:331–366. [Google Scholar]

[R9] Murphy SA. An experimental design for the development of adaptive treatment strategies. Statist. Med. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]

[R10] Murphy SA, Oslin DW, Rush AJ, Zhu J. Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology. 2007;32:257–262. doi: 10.1038/sj.npp.1301241. [DOI] [PubMed] [Google Scholar]

[R11] Orellana L, Rotnitzky A, Robins J. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: Main content. Int. J. Biostatist. 2010;6(Issue 2) Article 8. [PubMed] [Google Scholar]

[R12] Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods: Applications to control of the healthy worker survivor effect. Math. Model. 1986;7:1393–1512. [Google Scholar]

[R13] Robins JM. Optimal structured nested models for optimal sequential decisions. In: Lin DY, Heagerty PJ, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004. pp. 189–326. [Google Scholar]

[R14] Robins J, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Statist. Med. 2008;27:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]

[R15] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 1994;89:846–866. [Google Scholar]

[R16] Rosthøj S, Fullwood C, Henderson R, Stewart S. Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statist. Med. 2006;25:4197–4215. doi: 10.1002/sim.2694. [DOI] [PubMed] [Google Scholar]

[R17] Rubin DB. Bayesian inference for causal effects: The role of randomization. Ann. Statist. 1978;6:34–58. [Google Scholar]

[R18] Stefanski LA, Boos DD. The calculus of M-estimation. Amer. Statist. 2002;56:29–38. [Google Scholar]

[R19] Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]

[R20] Watkins CJCH, Dayan P. Q-learning. Mach. Learn. 1992;8:279–292. [Google Scholar]

[R21] Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Zhao Y, Kosorok MR, Zeng D. Reinforcement learning design for cancer clinical trials. Statist. Med. 2009;28:3294–3315. doi: 10.1002/sim.3720. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions

Baqun Zhang

Anastasios A Tsiatis

Eric B Laber

Marie Davidian

Summary

1. Introduction

2. Framework

3. Q- and A- learning

4. Proposed robust method

5. Simulation studies

Table 1.

Table 2.

Table 3.

Supplementary Material

Acknowledgment

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions

Baqun Zhang

Anastasios A Tsiatis

Eric B Laber

Marie Davidian

Summary

1. Introduction

2. Framework

3. Q- and A- learning

4. Proposed robust method

5. Simulation studies

Table 1.

Table 2.

Table 3.

Supplementary Material

Acknowledgment

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases