Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 May 30.
Published in final edited form as: Biometrika. 2013 May 30;100(3):10.1093/biomet/ast014. doi: 10.1093/biomet/ast014

Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions

Baqun Zhang 1, Anastasios A Tsiatis 2, Eric B Laber 3, Marie Davidian 4
PMCID: PMC3843953  NIHMSID: NIHMS494366  PMID: 24302771

Summary

A dynamic treatment regime is a list of sequential decision rules for assigning treatment based on a patient’s history. Q- and A-learning are two main approaches for estimating the optimal regime, i.e., that yielding the most beneficial outcome in the patient population, using data from a clinical trial or observational study. Q-learning requires postulated regression models for the outcome, while A-learning involves models for that part of the outcome regression representing treatment contrasts and for treatment assignment. We propose an alternative to Q- and A-learning that maximizes a doubly robust augmented inverse probability weighted estimator for population mean outcome over a restricted class of regimes. Simulations demonstrate the method’s performance and robustness to model misspecification, which is a key concern.

Keywords: A-learning, Double robustness, Outcome regression, Propensity score, Q-learning

1. Introduction

Treatment of patients with chronic disease involves a series of decisions, where the clinician determines the next treatment to be administered based on all information available to that point. A dynamic treatment regime is a set of sequential decision rules, each corresponding to a decision point in the treatment process. Each rule inputs the available information and outputs the treatment to be given from among the possible options. The optimal regime is that yielding the most favorable outcome on average if followed by the patient population.

Q- and A-learning are two main approaches for estimating the optimal dynamic treatment regime using data from a clinical trial or observational study. Q-learning (Watkins & Dayan, 1992) involves postulating at each decision point regression models for outcome as a function of patient information to that point. In A-learning (Robins, 2004; Murphy, 2003), models are posited only for the part of the regression involving contrasts among treatments and for treatment assignment at each decision point. Both are implemented through a backward recursive fitting procedure based on a dynamic programming algorithm (Bather, 2000). Under certain assumptions and correct specification of these models, Q- and A-learning lead to consistent estimation of the optimal regime. See Rosthøj et al. (2006), Murphy et al. (2007), Zhao et al. (2009) and Henderson et al. (2010) for applications; related methods are discussed by Robins (2004), Moodie et al. (2007), Robins et al. (2008), Almirall et al. (2010) and Orellana et al. (2010).

A concern with both Q- and A-learning is the effect of model misspecification on the quality of the estimated optimal regime. If one attempts to circumvent this difficulty by using flexible nonparametric regression techniques (Zhao et al., 2009), the estimated optimal rules may be complicated functions of possibly high-dimensional patient information that are difficult to interpret or implement and hence are unappealing to clinicians wary of black box approaches.

Given these drawbacks, we focus on a restricted class of treatment regimes indexed by a finite number of parameters, where the form of regimes in the class may be derived from posited regression models or prespecified on the grounds of interpretability or cost to depend on key subsets of patient information. Zhang et al. (2012) proposed an approach for estimating the optimal regime within such a restricted class for a single treatment decision based on maximizing directly a doubly robust augmented inverse probability weighted estimator for the population mean outcome over all regimes in the class, assuming that larger outcomes are preferred. Via the double robustness property, the estimated optimal regimes enjoy protection against model misspecification and comparable or superior performance than do competing methods. With judicious choice of the augmentation term, increased efficiency of estimation of the mean outcome is achieved, which translates into more precise estimators for the optimal regime.

We adapt this approach to two or more decision points. This is considerably more complex than for one decision and is based on casting the problem as one of monotone coarsening (Tsiatis, 2006, Chapter 7). We focus for simplicity on the case of two treatment options at each decision point, though the methods extend to a finite number of options. The methods lead to estimated optimal regimes achieving comparable performance to those derived via Q- or A-learning under correctly specified models and have the added benefit of protection against misspecification.

2. Framework

Assume there are K prespecified, ordered decision points and an outcome of interest, a function of information collected across all K decisions or ascertained after the Kth decision, with larger values preferred. At each decision k = 1, …, K, there are two k-specific treatment options coded as 0,1 in the set of options 𝒜k; write ak to denote an element of 𝒜k. Denote a possible treatment history up to and including decision k as āk = (a1, …, ak) ∈ 𝒜1 × ⋯ × 𝒜k = 𝒜̄k.

We consider a potential outcomes framework. For a randomly chosen patient, let X1 denote baseline covariates recorded prior to the first decision, and let Xk*(āk1) be the covariate information that would accrue between decisions (k − 1) and k were s/he to receive treatment history āk−1 (k = 2, …, K), taking values xk ∈ 𝒳k. Let Y*(āK) be the outcome that would result were s/he to receive full treatment history āK. Then define the potential outcomes (Robins, 1986) as

W={X1,X2*(a1),,XK*(āK1),Y*(āK)for allāK𝒜̄K}.

For convenience later, we include X1, which is always observed and hence is not strictly a potential outcome, in W, and write k*(āk1)={X1,X2*(a1),Xk*(āk1)} and k = (x1, …, xk) for k = 1, …, K, where then k ∈ 𝒳̄k = 𝒳1 × ⋯ × 𝒳k.

A dynamic treatment regime g = (g1, …, gK) is an ordered set of decision rules, where gk(k, āk−1) corresponding to the kth decision takes as input a patient’s realized covariate and treatment history up to decision k and outputs a treatment option ak ∈ Φk(k, āk−1) ⊆ 𝒜k. In general, Φk(k, āk−1) is the set of feasible options at decision k for a patient with realized history (k, āk−1), allowing that some options in 𝒜k may not be possible for patients with certain histories; here, Φk(k, āk−1) ⊆ {0, 1}. Thus, a feasible treatment regime must satisfy gk(k, āk−1) ∈ Φk(k, āk−1) (k = 1, …, K). Denote the class of all feasible regimes by 𝒢.

For g ∈ 𝒢, writing ḡk = (g1, …, gk) for k = 1, …, K and ḡK = g, define the potential outcomes associated with g to be Wg={X1,X2*(g1),,XK*(K1),Y*(g)}, where Xk*(k1) is the covariate information that would be seen between decisions k − 1 and k were a patient to receive the treatments dictated sequentially by the first k − 1 rules in g, and Y*(g) is the outcome if s/he were to receive the K treatments determined by g. Thus, Wg is an element of W.

Define an optimal treatment regime gopt=(g1opt,,gKopt)𝒢 as satisfying

E{Y*(gopt)}E{Y*(g)},g𝒢. (1)

That is, gopt is a regime that maximizes expected outcome were all patients in the population to follow it. The optimal regime gopt may be determined via dynamic programming, also referred to as backward induction. At the Kth decision point, for any K ∈ 𝒳̄K, āK−1 ∈ 𝒜̄K−1, define

gKopt(K,āK1)=argmaxaKΦK(K,āK1)E{Y*(āK1,aK)|K*(āK1)=K}, (2)
VK(K,āK1)=maxaKΦK(K,āK1)E{Y*(āK1,aK)|K*(āK1)=K}. (3)

For k = K − 1, …, 2 and any k ∈ 𝒳̄k, āk−1 ∈ 𝒜̄k−1, define

gkopt(k,āk1)=argmaxakΦk(k,āk1)E[Vk+1{k,Xk+1*(āk1,ak),āk1,ak}|k*(āk1)=k],Vk(k,āk1)=maxakΦk(k,āk1)E[Vk+1{k,Xk+1*(āk1,ak),āk1,ak}|k*(āk1)=k].

For k = 1, x1 ∈ 𝒳1, g1opt(x1)=arg maxa1Φ1(x1)E[V2{x1,X2*(a1),a1}|X1=x1] and V1(x1)=maxa1Φ1(x1)E[V2{x1,X2*(a1),a1}|X1=x1]. Thus, gKopt yields the treatment option at decision K that maximizes the expected potential outcome given prior covariate and treatment history. At decisions k = K − 1, …, 1, gkopt dictates the option that maximizes the expected potential outcome that would be achieved if the optimal rules were followed in the future. An argument that gopt, so defined, satisfies (1) is given in an unpublished report by Schulte et al. (2013) available from the last author.

This definition of an optimal regime is intuitively given in terms of potential outcomes. In practice, with the exception of X1, W cannot be observed for any patient. Rather, a patient is observed to experience only a single treatment history. Let Ak be the observed treatment received at decision k and let Āk = (A1, …, Ak) be observed treatment history up to decision k. Let Xk be the covariate information observed between decisions k − 1 and k under the observed treatment history Āk−1 (k = 2, …, K), with history k = (X1, …, Xk) for k = 1, …, K to decision k. Let Y be the observed outcome under ĀK. The observed data on a patient are (K, ĀK, Y), and the data available from a clinical trial or observational study involving n subjects are independent and identically distributed (Ki, ĀKi, Yi) for i = 1, …, n.

Under the following standard assumptions, gopt may equivalently be expressed in terms of the observed data. The consistency assumption states that Xk=Xk*(Āk1)=Σāk1𝒜̄k1Xk1*(āk1)I(Āk1=āk1) for k = 2, …, K, and Y = Y*(ĀK) = ∑āK ∈ 𝒜̄K Y*(āK)IK = āK); that is, a patient’s observed covariates and outcome are the same as the potential ones s/he would exhibit under the treatment history actually received. The stable unit treatment value assumption (Rubin, 1978), implies that a patient’s covariates and outcome are not influenced by treatments received by other patients. A version of the sequential randomization assumption (Robins, 2004) states that W is independent of Ak conditional on (k, Āk−1). This is satisfied by default for data from a sequentially randomized clinical trial (Murphy, 2005), but is not verifiable from data from an observational study. It is reasonable to believe that decisions made in an observational study are based on a patient’s covariate and treatment history; however, all such information associated with treatment assignment and outcome must be recorded in the k to validate the assumption.

Under these assumptions, from §1 of the Supplementary Material, pY*(āK)|K*(āK1)(y|K)=pY|K,ĀK(y|K,āK), so that E{Y*(āK)|K*(āK1)=K}=E(Y|K=K,ĀK=āK). Thus, letting QK(K, āK) = E(Y | K = K, ĀK = āK), (2) and (3) become

gKopt(K,āK1)=argmaxaKΦK(K,āK1)QK(K,āK1,aK),VK(K,āK1)=maxaKΦK(K,āK1)QK(K,āK1,aK).

Using pXk*(āk1)|k1*(āk2)(k|k1)=pXk|k1,Āk1(xk|k1,āk1), for k = K, …, 2,

Qk(k,āk)=E{Vk+1(k,Xk+1,āk)|k=k,Āk=āk}(k=K1,,1),gkopt(k,āk1)=argmaxakΦk(k,āk1)Qk(k,āk1,ak)(k=K1,,2),Vk(k,āk1)=maxakΦk(k,āk1)Qk(k,āk1,ak)(k=K1,,2),

and g1opt(x1)=arg maxa1Φ1(x1)Q1(x1,a1), V1(x1) = maxa1∈Φ1(x1) Q1(x1, a1). The Qk(k, āk) and Vk(k, āk−1) are referred to as Q-functions and value functions and are derived from the distribution of the observed data.

3. Q- and A- learning

Q-learning is based on the developments in §2. Linear or nonlinear models Qk(k, āk; βk) in a finite-dimensional parameter βk may be posited and estimators β̂k obtained via a backward iterative process for k = K, …, 1 by solving least squares estimating equations; see §2 of the Supplementary Material. The estimated optimal regime is ĝQopt=(ĝQ,1opt,,ĝQ,Kopt), where ĝQ,1opt(x1)=gQ,1opt(x1;β̂1)=arg maxa1Φ1(x1)Q1(x1,a1;β̂1), and ĝQ,kopt(k,āk1)=gQ,kopt(k,āk1;β̂k)=arg maxakΦk(k,āk1)Qk(k,āk1,ak;β̂k) for k = 2, …, K. Unless all models are correctly specified, ĝQopt may not be a good estimator for gopt.

The A-learning method we consider is a version of g-estimation (Robins, 2004); see §2 of the Supplementary Material. Write Qk(k, āk) as hk(k, āk−1) + akCk(k, āk−1), hk(k, āk−1) = Qk(k, āk−1, 0) and Ck(k, āk−1) = Qk(k, āk−1, 1) − Qk(k, āk−1, 0). We refer to Ck(k, āk−1) as the Q-contrast function; with two treatment options, AkCk(k, āk−1) is the optimal-blip-to-zero function of Robins (2004). Posit models Ck(k, āk−1; ψk) and C1(x1; ψ1), depending on parameters ψk; and models hk(k, āk−1; αk) and h1(x1; α1), with parameters αk for k = K, …, 2. Let πk(k, āk−1) = pr(Ak = 1 | k = k, Āk−1 = āk−1) and π1(x1) = pr(A1 = 1 | X1 = x1) be the propensities for treatment, which are unknown unless the data are from a sequentially randomized trial, and specify models πk(k, āk−1; γk), k = K, …, 2, and π1(x1; γ1), e.g., logistic regression models. Estimators ψ̂k may be found iteratively for k = K, …, 1 by solving for ψk and αk estimating equations given in §2 of the Supplementary Material, substituting the maximum likelihood estimators γ̂k. As Qk(k, āk) is maximized by ak = I{Ck(k, āk−1) > 0}, the estimated optimal regime is ĝAopt=(ĝA,1opt,,ĝA,Kopt), where ĝA,1opt(x1)=gA,1opt(x1;ψ̂1)=I{C1(x1;ψ̂1)>0} and ĝA,kopt(k,āk1)=gA,kopt(k,āk1;ψ̂k)=I{Ck(k,āk1;ψ̂k)>0}, for k = 2, …, K. If the contrast and propensity models are correctly specified, then ψ̄k will be consistent for ψk even if hk(k, āk−1; αk) for k = K, …, 2, and h1(x1; α1) are misspecified, and ĝAopt will consistently estimate gopt. Thus, the quality of ĝAopt depends on how close the Ck(k, āk−1; ψk) are to the true contrast functions.

As discussed in §2 of the Supplementary Material, the efficient version of A-learning is so complex as to be infeasible to implement. The implementation of A-learning we use in the empirical studies of §5 is likely as close to efficient as could be hoped in practice.

See the unpublished report of Schulte et al. (2013) for a detailed account of both methods.

4. Proposed robust method

Q- and A-learning are predicated on the postulated models for the Q-functions and Q-contrast functions, respectively, so the resulting estimated regime may be far from gopt if these models are misspecified. We propose an alternative approach that may be robust to such misspecification, based on directly estimating the optimal regime in a specified class of regimes.

Models Qk(k, āk; βk) or Ck(k, āk−1; ψk), whether correct or not, define classes of regimes 𝒢β, β=(β1T,,βKT)T, or 𝒢ψ, indexed analogously by ψ, whose elements may often be simplified. For example, with K = 2, if C2(2, a1; ψ2) = ψ02 + ψ12x2 and C1(x1; ψ1) = ψ01 + ψ11x1, the corresponding regimes gψ = (gψ1, gψ2) take gψ1(x1) = I01 + ψ11x1 > 0) and gψ2(2, a1) = I02 + ψ12x2 > 0). If prior knowledge suggests that treatment 1 would benefit patients with smaller values of X1 or X2, then all reasonable regimes should have ψ11 < 0 and ψ12 < 0, and elements of 𝒢ψ may be expressed in terms of η1 = −ψ0111 and η2 = −ψ0212 as gη = (gη1, gη2), gη1 (x1) = I1 > x1) and gη2(2, a1) = I2 > x2), η = (η1, η2)T.

This suggests considering a class 𝒢η, with elements gη = (gη1, …, gηK), indexed by η=(η1T,,ηKT)T, of form {gη1(x1), …, gηK (K, āK−1)}. If 𝒢η is derived from models Qk(k, āk; βk) or Ck(k, āk−1; ψk), then η = η(β) or η = η(ψ) is a many-to-one function of β or ψ, and gopt ∈ 𝒢η if these models are correct. Here, estimating ηopt = arg maxη E{Y*(gη)} defining the regime gηopt, say, will yield an estimator for gopt. If these models are misspecified, η(β̂) or η(ψ̂) may not converge in probability to ηopt, and resulting regimes may be far from optimal. If instead the form of elements of 𝒢η is chosen directly based on interpretability or cost, independently of such models, 𝒢η may or may not contain gopt, but gηopt is still of interest as the optimal regime among those deemed realistic in practice.

We propose an approach to estimation of gηopt in a given class 𝒢η by developing an estimator for E{Y*(gη)} that is robust to model misspecification and maximizing it in η. We cast the problem as one of monotone coarsening. Following Tsiatis (2006, §7.1), for fixed η, let ḡηk = (gη1, …, gηk), for k = 1, …, K, and let ḡηK = gη. Identify full data as the potential outcomes Wgη={X1,X2*(gη1),,XK*(ηK1),Y*(gη)}, and let k*(ηk1)={X1,X2*(gη1),,Xk*(ηk1)}. Let 𝒞η be a discrete coarsening variable taking values 1, …, K, ∞ corresponding to K + 1 levels of coarsening, reflecting the extent to which the observed treatments received are consistent with those dictated by gη. In the general coarsened data set up, when 𝒞η = k, we observe Gk(Wgη), a many-to-one function of Wgη; when 𝒞η = ∞, we observe G(Wgη) = Wgη, the full data. Here, under the consistency assumption, this is as follows. If A1gη1(X1), then 𝒞η = 1; that is, I(𝒞η = 1) = I{A1gη1(X1)}, and we observe G𝒞η(Wgη) = G1(Wgη) = X1. None of the observed treatments are consistent with following gη, so X2, …, XK, Y are not consistent with gη. If A1 = gη1(X1) and A2gη2{2, gη1(X1)}, then 𝒞η = 2, I(𝒞η = 2) = I{A1 = gη1(X1)}I[A2gη2{2, gη1(X1)}], and G𝒞η(Wgη)=G2(Wgη)=2*(gη1)=2. Only the treatment at decision 1 and the ensuing X2 are consistent with gη. Likewise, I(𝒞η = 3) = I{A1 = gη1(X1)}I[A2 = gη2{2, gη1(X1)}]I{A3gη3(3)}, where gη3(3) is shorthand for gη3[3, gη1(X1), gη2{2, gη1(X1)}] = gη3{3, ḡη2(2)} and ḡη2(2) = [gη1(X1), gη2{2, gη1(X1)}] and similarly for general k; and G𝒞η(Wgη)=G3(Wgη)=3*(η2)=3. Continuing in this fashion, I(𝒞η = K) = IK−1 = ḡηK−1{K−1, ḡηK−2(K−2)}]I[AKgηK{K, ḡηK−1(K−1)}], and G𝒞η(Wgη)=GK(Wgη)=K*(ηK1)=K. Finally, if ĀK = ḡηK {K, ḡηK−1 (K−1)}, G𝒞η (Wgη) = G(Wgη) = Wgη = (X1, …, XK, Y). Here, the observed data are consistent with having followed all K rules in gη. The coarsening is monotone in that Gk(Wgη) is a coarsened version of Gk(Wgη), k′ > k, and Gk(Wgη) is a many-to-one function of Gk+1(Wgη).

Coarsened data are said to be coarsened at random if, for each k, the probability that the data are coarsened at level k, given the full data, depends only on the coarsened data, so only on data that are observed at level k (Tsiatis, 2006, §7.1). Under the consistency and sequential randomization assumptions, it may be shown using results in §3 of the Supplementary Material that the coarsening here is at random. Define the coarsening discrete hazard pr(𝒞η = k | 𝒞ηk, Wgη) to be the probability that the observed treatments cease to be consistent with gη at decision k, given they are consistent prior to k and all potential outcomes. Under coarsening at random, this hazard is a function only of the coarsened data, that is, the data observed through decision k, which we write as pr(𝒞η = k | 𝒞ηk, Wgη) = λη,k{Gk(Wgη)}. Then, from above, for k = 1, λη,1{G1(Wgη)} = λη,1(X1) = pr{A1gη1(X1) | X1}, which can be expressed in terms of the propensity for treatment at decision 1 as π1(X1)1−gη1 (X1){1 − π1(X1)}gη1 (X1). Similarly, for k = 2, …, K,

λη,k{Gk(Wgη)}=λη,k(k)=pr{Akgηk(k,Āk1)|k,Āk1=ηk1(k1)}=πk{k,ηk1(k1)}1gηk{k,ηk1(k1)}×[1πk{k,ηk1(k1)}]gηk{k,ηk1(k1)}.

We may then express the probabilities of being consistent with gη through at least the kth decision, so having 𝒞η > k, given all potential outcomes, in terms of the discrete hazards. Under coarsening at random, these probabilities depend only on the observed data through decision k. That is, pr(𝒞η > k | Wgη) = Kη,k{Gk(Wgη)} = Kη,k(k), where Kη,k(k)=k=1k{1λη,k(k)} (Tsiatis, 2006, §8.1).

We now use these developments to deduce the form of estimators for E{Y*(gη)}. From the theory of Robins et al. (1994) for general monotonely coarsened data, under coarsening at random, if the coarsening mechanism is correctly specified, which corresponds here to correct specification of the λη,k(k), and hence of the propensity models, all regular, asymptotically linear, consistent estimators (Tsiatis, 2006, Chapter 3) for E{Y*(gη)} for fixed η have the form

i=1nI(𝒞η,i=)Kη,K(Ki)Yi+k=1KI(𝒞η,i=k)λη,k(ki)I(𝒞η,ik)Kη,k(ki)Lk(ki), (4)

where Lk(k) are arbitrary functions of k. The optimal choice leading to (4) with smallest asymptotic variance is Lη,kopt(k)=E{Y*(gη)|k*(ηk1)=k}. The right hand term in (4) augments the first, itself a consistent estimator for E{Y*(gη)} when the λη,k(k) are correctly specified, to gain efficiency. As in Tsiatis (2006, §10.3), (4) is doubly robust in that it is a consistent estimator for E{Y*(gη)} if either the λη,k(ki) are correctly specified or if the Lk(ki) are equal to Lη,kopt(ki)(k=1,,K); see §4 of the Supplementary Material.

To implement (4), one must specify λη,k(ki) and Lk(ki). The first follow from specifying π1(x1) = pr(A1 = 1 | X1 = x1), πk(k, āk−1) = pr(Ak = 1 | k = kk−1 = āk−1) for k = K, …, 2. If these are unknown, as in A-learning, posit models π1(1; γ1), πk(k, āk−1; γk) for k = 2, …, K, and estimate γk by γ̂k (k = 1, …, K). With γ=(γ1T,,γKT)T and γ̂T=(γ̂1T,,γ̂KT)T, this implies that λη,1(X1; γ1) = π1(X1; γ1)1−gη1 (X1){1 − π1(X1; γ1)}gη1 (X1),

λη,k(k;γk)=πk{k,ηk1(k1);γk}1gηk{k,ηk1(k1)}×[1πk{k,ηk1(k1);γk}]gηk{k,ηk1(k1)}

and Kη,k(k;γ)=k=1k{1λη,k(k;γk)}, and suggests substituting λη,k(k; γ̂k) and Kη,k(k; γ̂) in (4).

Several options exist for specification of the Lk(k). The simplest is to take Lk(k) ≡ 0, yielding the inverse probability weighted estimator

IPWE(η)=i=1nI(𝒞η,i=)Kη,K(Ki;γ̂)Yi, (5)

which is consistent for E{Y*(gη)} if η1(X1; γk) and πk(k, Āk−1; γk) (k = 2, …, K), and hence Kη,K(K; γ), are correctly specified, but otherwise may be inconsistent. The corresponding estimator for gηopt is found by estimating ηopt by η̂IPWEopt, say, maximizing (5) in η. As (5) is based on data only from subjects whose entire treatment history is consistent with gη, it is relatively less efficient than estimators that use all the data, discussed next.

To take greatest advantage of the potential for improved efficiency through the augmentation term in (4), an obvious approach is to posit and fit parametric models approximating the conditional expectations Lη,kopt(k)=E{Y*(gη)|k*(ηk1)=k}, and substitute these into (4) along with λη,k(k; γ̂k) and Kη,k(k; γ̂). To this end, let μηK(K, āK) = E(Y | K = K, ĀK = āK) and fηK(K, āK−1) = μηK {K, āK−1, gηK (K, āK−1)}. Then define iteratively, for k = K − 1, …, 2, the quantities μηk (k, āk) = E{fηk+1 (k, Xk+1, āk) | k = k, Āk = āk} and fηk (k, āk−1) = μηk {k, āk−1, gηk (k, āk−1)}; for k = 1, μη1 (x1, a1) = E{fη2(x1, X2, a1) | X1 = x1, A1 = a1}, fη1(x1) = μη1 {x1, gη1 (x1)}. In §5 of the Supplementary Material, we demonstrate that Lη,kopt(k)=μηk{k,ηk(k)}.

This suggests specifying η-dependent models μηk (k, āk; ξk) depending on parameters ξk, k = 1, …, K. For fixed η, estimators ξ̂k for ξk may be found iteratively by solving in ξk

i=1nμηk(ki,Āki;ξk)ξk{(k+1)iμηk(ki,Āki;ξk)}=0(k=1,K),

where ∂/∂ξkηk (ki, Āki; ξk)} is the vector of partial derivatives of μηk (ki, Āki; ξk) with respect to elements of ξk, (K+1)i = Yi and ki = μηk [ki, Ā(k−1)i, gηk {ki, Ā(k−1)i}; ξ̂k] (k = K, …, 2). The fitted μηk {k, ḡηk (k); ξ̂k} may then be used to approximate Lη,kopt(k) in (4). While these models almost certainly are not correct, as specification of a compatible sequence of models for k = 1, …, K is a significant challenge, they may be reasonable approximations to the true conditional expectations. Thus, define

DR(η)=i=1nI(𝒞η,i=)Kη,K(Ki;γ̂)Yi+k=1KI(𝒞η,i=k)λη,k(ki;γ̂k)I(𝒞η,ik)Kη,k(ki;γ̂)μηk{ki,ηk(ki);ξ̂k}, (6)

which, by virtue of the double robustness property, will be consistent for E{Y*(gη)} if either π1(x1; γk) and πk(k, āk−1; γk) (k = K, …, 2), are correctly specified, or the the μηk (k, āk; ξk) are. If all of these models were correct, then (6) would achieve optimal efficiency. As for (5), estimation of gηopt follows by maximizing (6) in η to obtain η̂DRopt.

A computational challenge is that the models μηk (k, āk−1; ξk) must be refitted for each value of η encountered in the optimization algorithm used to carry out the maximization. A practical alternative when regimes in 𝒢η are derived from models is to substitute for Lk(k,i) in (4) fitted Q-functions Qk{k, ḡηk (k); β̂k} for k = K, …, 1 obtained from Q-learning; holding β̂k fixed, these depend on η only through ḡηk (k). While these are not strictly models for E{Y*(gη)|k*(ηk)}, the hope is that they will be close enough to achieve near optimal efficiency gains over (5). Thus, estimate gηopt by maximizing in η to obtain η̂AIPWEopt

AIPWE(η)=i=1nI(𝒞η,i=)Kη,K(Ki;γ̂)Yi+k=1KI(𝒞η,i=k)λη,k(ki;γ̂k)I(𝒞η,ik)Kη,k(ki;γ̂)Qk{ki,ηk(ki);β̂k}. (7)

See §6 of the Supplementary Material for a similar proposal when 𝒢η is determined directly.

Standard errors for these estimators for E{Y*(gηopt)} may be obtained via the sandwich technique (Stefanski & Boos, 2002) based on the argument in Zhang et al. (2012, Equation (4)).

5. Simulation studies

We have carried out several simulation studies to evaluate the performance of the proposed methods, each involving 1000 Monte Carlo data sets.

The first simulation adopts the scenario in Moodie et al. (2007) of a study in which HIV-infected patients are randomized to initiate antiretroviral therapy or not, coded as 1 or 0, at baseline and again at six months to determine the optimal regime for therapy initiation. We generated baseline CD4 count X1 ~ N (450, 150), where N(μ, σ2) denotes the normal distribution with mean μ and variance σ2; baseline treatment A1 as Bernoulli with success probability pr(A1 = 1 | X1) = expit(2 − 0.006X1), where expit(u) = eu/(1 + eu); six-month CD4 count X2, conditional on (X1,A1), as N(1.25X1, 60); and treatment at six months A2 as Bernoulli with pr(A2 = 1 | 2, A1) = A1 + (1 − A1)expit(0.8 − 0.004X2). Here, patients with A1 = 1 continue on therapy with certainty. The outcome Y, one-year CD4 count, conditional on (2, Ā2), was normal with mean 400 + 1.6X1 − |250 − X1|{A1I(250 − X1 > 0)}2 − (1 − A1)|720 − 2X2|{A2I(720 − 2X2 > 0)}2 and variance 602. The true Q-contrast functions are thus C2(x1, x2, a1) = (1 − a1) (720 − 2x2), C1(x1) = 250 − x1, the optimal treatment regime gopt=(g1opt,g2opt) has g1opt(x1)=I(250x1>0),g2opt(2,a1)=I{a1+(1a1)(7202x2)>0}=I{a1+(1a1)(360x2)>0} and E{Y*(gopt)} = 1120.

For A-learning, we took

h2(2,a1;α2)=α20+α21x1+α22a1+α23a1x1+α24(1a1)x2,C2(2,a1;ψ2)=(1a1)(ψ20+ψ21x2),

h1(x1, α1) = α10 + α11x1, and C1(x1; ψ1) = ψ10 + ψ11x1; and, analogously, for Q-learning,

Q2(2,ā2;β2)=β20+β21x1+a1(β22+β23x1)+β24(1a1)x2+a2(1a1)(β25+β26x2),Q1(x1,a1;β1)=β10+β11x1+a1(β12+β13x1),

so the Q-contrast functions are correct, but the Q-functions are misspecified. Here, C2(2, 1; ψ2) = 0, respecting that Φ2(2, 1) = {1}. We used correct propensity models π2(2, a1 = 0; γ2) = expit(γ20 + γ21x2), π1(x1; γ1) = expit(γ10 + γ11x1) and incorrect models π2(2, a1 = 0; γ2) = γ2, π1(x1; γ1) = γ1.

For maximizing ipwe(η) in (5), dr(η) in (6), and aipwe(η) in (7) to obtain η̂IPWEopt,η̂DRopt, and η̂AIPWEopt, we considered the class of regimes 𝒢η with elements gη = (gη1, gη2),

gη2(2,a1)=I{a1+(1a1)(η20+η21x2)>0},gη1(x1)=I(η10+η11x1>0),

so that η2 = (η20, η21)T, η1 = (η10, η11)T, η=(η1T,η2T)T and ηopt = (250, −1, 360, −1)T. Clearly, gopt ∈ 𝒢η. We used the same propensity models, and, for (7), Q-function models as above; for (6), we posited μη2(2, ā2; ξ2) = ξ20 + ξ21x1 + a122 + ξ23x1) + ξ24(1 − a1)x2 + a2(1 − a1)(ξ25 + ξ26x2) and μη1(x1, a1; ξ1) = ξ10 + ξ11x1 + a112 + ξ13x1) for each η. To achieve a unique representation, we fixed (η21, η11) = (−1, −1) and determined η20, η10 via a grid search; because ipwe(η), dr(η) and aipwe(η) are step functions of η with jumps at (x1i, x2j) (i, j = 1, …, n), we maximized in η over all (x1i, x2j).

The second scenario is the same as the first except that the models for the Q-contrast functions are misspecified. Specifically, the generative distribution of Y given (2, Ā2) is now normal with mean 400 + 1.6X1 − |250 − 0.6X1|{A1I(250 − X1 > 0)}2 − (1 − A1)|720 − 1.4X2|{A2I(720 − 2X2 > 0)}2 and variance 602, so that, from the discussion below (2) of Moodie et al. (2007), the implied true contrast functions are no longer of the form above, but all posited models were taken to be the same as in the first simulation.

Tables 1 and 2 show the results. For Q- and A-learning, we report η(β̂) and η(ψ̂). The column Ê (η̂opt) shows for each estimator the Monte Carlo average and standard deviation of the estimated values of E{Y*(gηopt)} reflecting performance for estimating the true achievable mean outcome under the true optimal regime, while E(η̂opt) reflects performance of the estimated optimal regime itself. For each Monte Carlo data set, this is the true mean outcome that would be achieved if the estimated optimal regime were followed by the population was determined by simulation, and the values reported are the Monte Carlo average and standard deviation of these simulated quantities. When compared to the true E{Y*(gηopt)}=1120, these measure the extent to which the estimated optimal regimes approach the performance of the true optimal regime.

Table 1.

Results for the first simulation scenario, Q-contrast functions correct, 1000 Monte Carlo data sets, n = 500. For the true optimal regime gopt=gηopt𝒢η, ηopt = (250, −1, 360, −1)T and E{Y*(gηopt)}=1120

Estimator η̂10 η̂20 Ê(η̂opt) SE Cov. E(η̂opt)
Q-learning 228 (17) 322 (25) 1117 (12) 1119 (1)
Propensity score correct
A-learning 245 (18) 359 (20) 1121 (11) 1120 (1)
AIPWE (7) 210 (73) 363 (33) 1125 (12) 12 93.1 1118 (2)
DR (6) 211 (73) 363 (34) 1125 (12) 12 93.1 1118 (2)
IPWE (5) 268 (72) 397 (83) 1183 (24) 34 59.2 1105 (18)
Propensity score correct
A-learning 228 (17) 322 (25) 1117 (12) 1119 (1)
AIPWE (7) 259 (51) 390 (47) 1123 (12) 12 93.8 1116 (4)
DR (6) 262 (48) 386 (45) 1123 (12) 12 94.3 1116 (4)
IPWE (5) 349 (49) 471 (63) 1554 (56) 64 0.0 1075 (22)

AIPWE, DR, and IPWE, estimators based on maximizing aipwe(η) dr(η), and ipwe(η), respectively; η̂10, η̂20, Monte Carlo average estimates (standard deviation); Ê(η̂opt), Monte Carlo average (standard deviation) of estimated E{Y*(gηopt)}; SE, Monte Carlo average of sandwich standard errors; Cov., coverage of associated 95% Wald-type confidence intervals for Eopt); E(η̂opt), Monte Carlo average (standard deviation) of values E{Y*(ĝηopt)} obtained using 106 Monte Carlo simulations for each data set.

Table 2.

Results for the second simulation scenario, Q-contrast functions incorrect, 1000 Monte Carlo data sets, n = 500. For the true optimal regime gopt=gηopt𝒢η, ηopt = (250, −1, 360, −1)T and E{Y*(gηopt)}=1120. All quantities are as in Table 1

Estimator η̂10 η̂20 Ê(η̂opt) SE Cov. E (η̂opt)
Q-learning 381 (33) 386 (45) 1104 (12) 1088 (6)
Propensity score correct
A-learning 364 (29) 453 (26) 1115 (12) 1087 (3)
AIPWE (7) 250 (21) 359 (9) 1120 (12) 12 94.7 1118 (3)
DR (6) 250 (23) 360 (13) 1121 (11) 12 96.3 1118 (3)
IPWE (5) 305 (67) 432 (86) 1182 (27) 38 70.1 1096 (12)
Propensity score incorrect
A-learning 381 (33) 386 (45) 1104 (12) 1088 (6)
AIPWE (7) 255 (24) 363 (28) 1116 (12) 12 93.5 1118 (6)
DR (6) 255 (25) 364 (28) 1116 (12) 12 93.3 1117 (7)
IPWE (5) 361 (47) 480 (69) 1571 (59) 67 0.0 1086 (5)

For the first simulation, from Table 1, because the Q-functions are misspecified, the Q-learning estimators for η10 and η20 are biased, while those from A-learning based on postulated Q-contrast functions that include the truth are consistent when the propensity model is correct. When the propensity model is incorrect, Q-learning is unaffected; however, A-learning yields biased estimators for η10 and η20 identical to those from Q-learning, as linear models are used for C2(2, a1; ψ2), C1(x1; ψ1), h2(2, a1; α2) and h1(x1; α1) (Chakraborty et al., 2010). Although Q-learning results in poor estimation of η10 and η20, efficiency loss for estimating the optimal regime is negligible, as the proportion of benefit the estimated regime achieves if used in the entire population relative to the true optimal regime is virtually one. A possible explanation is that patients near the true decision boundary have C2(2, a1), C1(X1) close to zero, and few patients would receive treatment 1 according to the true decision rule for the first time point. This also follows from the fact that for regime g=(0,g2opt), the corresponding expectation is 1114. When the propensity model is correct, the estimators based on dr(η) and aipwe(η) yield estimated regimes comparable to those found by A-learning in terms of true mean outcome achieved, despite yielding relatively inefficient estimators for η10 and η20 A-learning, perhaps for the same reason as above. When the propensity model is incorrect, the dr(η) and aipwe(η) estimators yield estimated regimes that are still close to the optimal. The ipwe(η) estimator show relatively poorer performance, especially when the propensity score model is incorrect, which is not unexpected; this estimator only uses information from patients whose treatment histories are consistent with following gη and hence is inefficient.

In the second simulation, the values of |C2(2, A1)| and |C1(X1)| for patients near the true decision boundary are larger than in the first simulation, and the posited Q-contrast functions are no longer correct. From Table 2, the A- and Q- learning estimators perform similarly, both yielding estimated regimes far from optimal. Those based on dr(η) and aipwe(η) are almost identical to gopt on average and perform almost identically to the true optimal regime, regardless of whether or not the propensity model is correct. Again, the estimator based on ipwe(η) in (5) performs poorly. Evidently, augmentation even using incorrect models leads to considerable gains over ipwe(η) regardless of whether or not the propensity model is correct.

The third scenario involved K = 3 decision points. To achieve average numbers of patients consistent with the regime comparable to those in the K = 2 cases, we took n = 1000. We generated X1, A1, X2 as in the previous two scenarios; A2 as Bernoulli with pr(A2 = 1 | 2, A1) = expit(0.8 − 0.004X2); twelve-month CD4 count X3, conditional on (2, Ā2), as N(0.8X2, 60); treatment at twelve months A3 as Bernoulli with pr(A3 = 1 | 3, Ā2) = expit(1 − 0.004X3); and the outcome Y, 18-month CD4 count, conditional on (3, Ā3), as normal with mean 400 + 1.6X1 − |500 − 1.4X1|{A1I(500 − 2X1 > 0)}2 − |720 − 1.4X2|{A2I(720 − 2X2 > 0)}2 − |600 − 1.4X3|{A3I(600 − 2X3 > 0)}2 and variance 602. The optimal treatment regime gopt=(g1opt,g2opt,g3opt) has g1opt(x1)=I(250x1>0),g2opt(2,a1)=I(360x2>0),g3opt(3,ā2)=I(300x3>0) and E{Y*(gopt)} = 1120.

For A-learning, we took

h3(3,ā2;α3)=α30+α31x1+a1(α32+α33x1)+α34x2+a2(α35+α36x2)+α37x3,C3(3,ā2;ψ2)=ψ30+ψ31x3,h2(2,a1;α2)=α20+α21x1+a1(α22+α23x1)+α24x2,C2(2,a1;ψ2)=ψ20+ψ21x2,h1(x1;α1)=α10+α11x1,C1(x1;ψ1)=ψ10+ψ11x1,

and for Q-learning

Q3(3,ā3;β3)=β30+β31x1+a1(β32+β33x1)+β34x2+a2(β35+β36x2)+α37x3+a3(β38+β39x3),Q2(2,ā2;β2)=β20+β21x1+a1(β22+β23x1)+β24x2+a2(β25+β26x2),Q1(x1,a1;β1)=β10+β11x1+a1(β12+β13x1);

thus, both Q- and Q-contrast functions are misspecified. We used correct propensity models π3(3, ā2; γ3) = expit(γ30 + γ31x3), π2(2, a1; γ2) = expit(γ20 + γ21x2), π1(x1; γ1) = expit(γ10 + γ11x1) and incorrect models π3(3, ā2; γ3) = γ3, π2(2, a1; γ2) = γ2, π1(x1; γ1) = γ1.

For the three proposed estimators, we took the class of regimes 𝒢η to have elements gη = (gη1, gη2, gη3), gη3(3, ā2) = I30 + η31x3 > 0), gη2(2, a1) = I20 + η21x2 > 0), gη1(x1) = (η10 + η11x1 > 0), so η3 = (η30, η31)T η2 = (η20, η21)T, η1 = (η10, η11)T, η=(η1T,η2T,η3T)T and ηopt = (250, −1, 360, −1, 300, −1)T, so gopt ∈ 𝒢η. We used the same propensity models, and, for (7), Q-function models as above, and fixed (η31, η21, η11) = (−1, −1, −1). To carry out the maximizations, we used a genetic algorithm discussed by Goldberg (1989), implemented in the rgenoud package in R (Mebane & Sekhon, 2011); see §7 of the Supplementary Material for details.

Table 3 show the results. Q-learning performs poorly, as expected. When the propensity model is correctly specified, results for A-learning and the proposed methods are similar to those in the second scenario, with the estimated regimes based on dr(η) and aipwe(η) achieving near-optimal performance and associated reliable inference on the true achievable mean outcome E{Y*(gη)}. When the propensity models are misspecified, the situation is similar for these estimators in terms of performance; however, inference on E{Y*(gη)} is markedly degraded. In both cases, performance of the estimator based on ipwe(η) is quite poor. Intuitively, as the number of decisions K increases, it is not unexpected that all methods can suffer from diminished performance. Research is needed on the design of sequentially randomized trials to ensure adequate sample size for reliable inference on multi-decision regimes.

Table 3.

Results for the third simulation scenario, K = 3, Q-contrast functions incorrect, 1000 Monte Carlo data sets, n = 1000. For the true optimal regime gopt=gηopt𝒢η, ηopt = (250, −1, 360, −1, 300, −1)T and E{Y*(gηopt)}=1120. All quantities are as in Table 1

Estimator η̂10 η̂20 η̂30 Ê(η̂opt) SE Cov. E(η̂opt)
Q-learning 179 (58) 412.9 (28) 341 (33) 1058 (13) 1086 (9)
Propensity score correct
A-learning 319 (12) 462 (11) 387 (12) 1108 (12) 1071 (3)
AIPWE (7) 263 (41) 362 (14) 300 (7) 1121 (10) 10 94.6 1116 (5)
DR (6) 263 (37) 361 (11) 300 (8) 1121 (10) 10 94.2 1117 (5)
IPWE (5) 399 (132) 618 (138) 450 (132) 1297 (63) 103 56.2 1008 (75)
Propensity score incorrect
A-learning 179 (58) 413 (28) 341 (33) 1041 (12) 1086 (9)
AIPWE (7) 360 (48) 371 (39) 310 (30) 1200 (26) 27 9.0 1104 (10)
DR (6) 386 (35) 362 (26) 314 (39) 1208 (26) 27 4.5 1102 (9)
IPWE (5) 412 (42) 521 (60) 415 (58) 2459 (148) 167 0.0 1055 (14)

In §8 of the Supplementary Material, we present results of a more complex scenario; the qualitative conclusions are similar.

All simulations here, and others we have conducted, suggest that Q- and A-learning can yield biased estimators for parameters defining the optimal regime if the Q-functions or Q-contrast functions are misspecified. Under these conditions, the resulting estimated optimal regimes can perform poorly in terms of achieving the expected potential outcome of the true optimal regime. In contrast, the proposed approach using (6) or (7) exhibits robustness to misspecification of either one of the outcome regression or propensity score models. Under these circumstances, the estimators of regime parameters are relatively unbiased, and the expected potential outcome under the estimated optimal regime approaches that of the true optimal regime. Moreover, the proposed methods lead to reliable estimation of the expected potential outcome under the true regime, with coverage probabilities close to the nominal level. Even when both outcome regression and propensity models are misspecified, the proposed methods can yield estimated optimal regimes that do not show substantial degradation of performance in terms of achieved expected potential outcome relative to the true optimal regime. In this case, inference on the expected outcome under the true optimal regime can be compromised, although, interestingly, the methods performed well in this regard under these conditions in the second simulation scenario. Collectively, our results suggest that the proposed methods are attractive alternatives to Q- and A-learning owing to their robustness to such model misspecification. As the estimator based on aipwe(η) is much less computationally intensive than dr(η) and performs similarly, we recommend it for practical use.

In §9 of the Supplementary Material, we report on application of the methods to a study to compare treatment options in patients with nonpsychotic major depressive disorder.

Supplementary Material

supplementary material

Acknowledgment

This research was supported by grants from the US National Institutes of Health.

Footnotes

Supplementary Material

Supplementary material available at Biometrika online includes technical arguments, more details on the estimators studied, and additional simulation results.

Contributor Information

Baqun Zhang, Email: baqun.zhang@northwestern.edu, Department of Preventive Medicine, 680 N. Lakeshore Drive, Suite 1400 Northwestern University, Chicago, Illinois, 60611 U.S.A.

Anastasios A. Tsiatis, Email: tsiatis@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695-8203, U.S.A.

Eric B. Laber, Email: eblaber@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695-8203, U.S.A.

Marie Davidian, Email: davidian@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695-8203, U.S.A.

References

  1. Almirall D, Ten Have T, Murphy SA. Structural nested mean models for assessing time-varying effect moderation. Biometrics. 2010;66:131–139. doi: 10.1111/j.1541-0420.2009.01238.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bather J. Decision Theory: an Introduction to Dynamic Programming and Sequential Decisions. Chichester: Wiley; 2000. [Google Scholar]
  3. Chakraborty B, Murphy SA, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statist. Meth. Med. Res. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Goldberg DE. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley; 1989. [Google Scholar]
  5. Henderson R, Ansell P, Alshibani D. Regret-regression for optimal dynamic treatment regimes. Biometrics. 2010;66:1192–1201. doi: 10.1111/j.1541-0420.2009.01368.x. [DOI] [PubMed] [Google Scholar]
  6. Mebane WR, Sekhon JS. Genetic optimization using derivatives: the rgenoud package for R. J. Statist. Soft. 2011;42:1–26. [Google Scholar]
  7. Moodie EEM, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63:447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]
  8. Murphy SA. Optimal dynamic treatment regimes (with discussion) J. Royal Statist. Soc., Ser. B. 2003;58:331–366. [Google Scholar]
  9. Murphy SA. An experimental design for the development of adaptive treatment strategies. Statist. Med. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
  10. Murphy SA, Oslin DW, Rush AJ, Zhu J. Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology. 2007;32:257–262. doi: 10.1038/sj.npp.1301241. [DOI] [PubMed] [Google Scholar]
  11. Orellana L, Rotnitzky A, Robins J. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: Main content. Int. J. Biostatist. 2010;6(Issue 2) Article 8. [PubMed] [Google Scholar]
  12. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods: Applications to control of the healthy worker survivor effect. Math. Model. 1986;7:1393–1512. [Google Scholar]
  13. Robins JM. Optimal structured nested models for optimal sequential decisions. In: Lin DY, Heagerty PJ, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004. pp. 189–326. [Google Scholar]
  14. Robins J, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Statist. Med. 2008;27:4678–4721. doi: 10.1002/sim.3301. [DOI] [PubMed] [Google Scholar]
  15. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 1994;89:846–866. [Google Scholar]
  16. Rosthøj S, Fullwood C, Henderson R, Stewart S. Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statist. Med. 2006;25:4197–4215. doi: 10.1002/sim.2694. [DOI] [PubMed] [Google Scholar]
  17. Rubin DB. Bayesian inference for causal effects: The role of randomization. Ann. Statist. 1978;6:34–58. [Google Scholar]
  18. Stefanski LA, Boos DD. The calculus of M-estimation. Amer. Statist. 2002;56:29–38. [Google Scholar]
  19. Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]
  20. Watkins CJCH, Dayan P. Q-learning. Mach. Learn. 1992;8:279–292. [Google Scholar]
  21. Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhao Y, Kosorok MR, Zeng D. Reinforcement learning design for cancer clinical trials. Statist. Med. 2009;28:3294–3315. doi: 10.1002/sim.3720. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

RESOURCES