Interactive model building for Q-learning

Eric B Laber; Kristin A Linn; Leonard A Stefanski

doi:10.1093/biomet/asu043

. Author manuscript; available in PMC: 2014 Dec 23.

Published in final edited form as: Biometrika. 2014 Oct 20;101(4):831–847. doi: 10.1093/biomet/asu043

Interactive model building for Q-learning

Eric B Laber ¹, Kristin A Linn ¹, Leonard A Stefanski ¹

PMCID: PMC4274394 NIHMSID: NIHMS613829 PMID: 25541562

Summary

Evidence-based rules for optimal treatment allocation are key components in the quest for efficient, effective health care delivery. Q-learning, an approximate dynamic programming algorithm, is a popular method for estimating optimal sequential decision rules from data. Q-learning requires the modeling of nonsmooth, nonmonotone transformations of the data, complicating the search for adequately expressive, yet parsimonious, statistical models. The default Q-learning working model is multiple linear regression, which is not only provably misspecified under most data-generating models, but also results in nonregular regression estimators, complicating inference. We propose an alternative strategy for estimating optimal sequential decision rules for which the requisite statistical modeling does not depend on nonsmooth, nonmonotone transformed data, does not result in nonregular regression estimators, is consistent under a broader array of data-generation models than Q-learning, results in estimated sequential decision rules that have better sampling properties, and is amenable to established statistical approaches for exploratory data analysis, model building, and validation. We derive the new method, IQ-learning, via an interchange in the order of certain steps in Q-learning. In simulated experiments IQ-learning improves on Q-learning in terms of integrated mean squared error and power. The method is illustrated using data from a study of major depressive disorder.

Some key words: Dynamic Treatment Regime, Personalized Medicine, Treatment Selection

1. Introduction

Clinical treatment decisions are based on a patient’s treatment history and current health status. Sequential decision rules, also known as dynamic treatment regimes, formalize this process by specifying a sequence of decision rules, one for each treatment decision, that take as input a patient’s treatment and covariate history and output recommended treatments. An optimal sequential decision rule is one that maximizes a desirable clinical outcome. The sequential nature of clinical decision making problems has led researchers to estimate optimal sequential decision rules using approximate dynamic programming procedures (Robins, 2004; Murphy, 2005a). Q-learning with function approximation, hereafter Q-learning, is one such popular method (Murphy, 2005a). However, Q-learning and similar variants of it involve modeling nonsmooth, nonmonotone functions of the data. Nonmonotonicity complicates the regression function whereas nonsmoothness imparts nonregularity to estimators. Inference in the presence of such nonregularity has been well-studied by Robins (2004), Chakraborty et al. (2010), Laber et al. (2014), and Moodie and Richardson (2010). However, much less attention has been directed toward the effect of nonsmoothness on the equally-important applied problems of model building and diagnostics. The preceding references all study linear working models, making little mention of their appropriateness or how to interactively build a model using data. As we demonstrate, even under simple generative models, linear working models result in questionable fits.

Rather than develop specialized exploratory and model building techniques for Q-learning, we propose to model the data before applying the necessary nonmonotone, nonsmooth operations. Because standard interactive model building techniques can be used with our new version of Q-learning, we call it IQ-learning for interactive Q-learning. Interactive model building is an essential part of extracting meaningful information from data (Henderson and Velleman, 1981; Cook and Weisberg, 1982; Henderson et al., 2010; Rich et al., 2010; Chakraborty and Moodie, 2013). This is especially true when the analysis is intended to inform clinical practice or provide scientific insight, and thus IQ-learning is especially attractive in applications.

IQ-learning has several advantages over Q-learning. For a large class of generative models, IQ-learning involves only simple, well-studied, and well-understood conditional mean and variance modeling of smooth transformations of the data, resulting in better fitting and more interpretable models (Carroll and Ruppert, 1988). Furthermore, inference for coefficients indexing the working models in IQ-learning is greatly simplified by regular normal limit theory. However, IQ-learning does not resolve the problem of nonregularity of the estimated non-terminal Q-functions, just of the coefficients on which they depend. Nevertheless, this is an important distinction from Q-learning and related methods. This issue is discussed further in Section 2.3.

2. Q- and IQ-Learning

2·1. Setup

For simplicity we consider the case of two treatment decisions and two treatment options per stage. In addition to being the most common in practice (see www4.stat.ncsu.edu/~laber/smarts.html), the two-stage, binary-treatment setting facilitates exposition while maintaining the key features of more general problems. While we focus on randomized studies, the proposed approach is valid for observational data under the same causal conditions required for Q-learning (Schulte et al., 2012).

Training data for the two-stage, two-treatment case, $𝒟 = {(X_{1, i}, A_{1, i}, X_{2, i}, A_{2, i}, Y_{i})}_{i = 1}^{n}$ , consists of independent identically distributed copies of the quintuple (X₁, A₁, X₂, A₂, Y) containing data collected on a single subject. Each quintuple is time ordered and thus called a trajectory: X₁ ∈ ℝ^p₁ is baseline covariate information; A₁ ∈ {−1, 1} is the first treatment; X₂ ∈ ℝ^p₂ is covariate information collected between the first and second treatment assignments; A₂ ∈ {−1, 1} is the second treatment; and Y ∈ ℝ is the outcome, coded so that higher values coincide with more desirable clinical outcomes. For notational compactness and conformity with established practice, we denote the information available prior to the t^th treatment assignment by H_t. Thus, H₁ = X₁ and $H_{2} = {(X_{1}^{⊤}, A_{1}, X_{2}^{⊤})}^{⊤}$ .

2·2. Q-Learning

The goal of Q- and IQ-learning is estimation of a pair of decision rules π = (π₁, π₂) such that π _t maps the domain of H_t into the set of treatments, π_t : ℋ_t ↦ {−1, 1}. An optimal sequential decision rule π^opt maximizes expected outcome. Let E^π denote expectation under the restriction A_t = π_t(H_t); then π^opt satisfies E^{π^opt} (Y) = sup_π E^π (Y).

Q-learning, a regression-based, approximate dynamic programming algorithm, is commonly used to estimate π^opt (Murphy, 2005a). The algorithm depends on the Q-functions,

Q_{2} (h_{2}, a_{2}) = E (Y | H_{2} = h_{2}, A_{2} = a_{2}),

(1)

Q_{1} (h_{1}, a_{1}) = E {max_{a_{2} \in {- 1, 1}} Q_{2} (H_{2}, a_{2}) | H_{1} = h_{1}, A_{1} = a_{1}} .

(2)

Thus, Q_t(h_t, a_t) measures the quality of treatment a_t when assigned to a patient with history h_t, assuming that optimal treatment decisions are made in future stages (Sutton and Barto, 1998). If the Q-functions were known, the optimal sequential decision rule could be determined using dynamic programming (Bellman, 1957), yielding the solution $π_{t}^{opt} (h_{t}) = {arg max}_{a_{t} \in {- 1, 1}} Q_{t} (h_{t}, a_{t}) (t = 1, 2)$ . The Q-functions are seldom known, and in practice Q-learning mimics the dynamic programming solution by replacing the unknown conditional expectations with fitted regression models. For reasons of data scarcity, model parsimony, and simplicity, it is common to use linear models for the Q-functions:

Q_{t} (h_{t}, a_{t}; β_{t}) = h_{t, 0}^{⊤} β_{t, 0} + a_{t} h_{t, 1}^{⊤} β_{t, 1}, t = 1, 2,

(3)

where h_t,0 and h_t,1 are, possibly the same, subvectors of h_t, and $β_{t} = {(β_{t, 0}^{⊤}, β_{t, 1}^{⊤})}^{⊤}$ . Q-learning with linear models consists of three steps:

Q1
estimate β₂, and hence Q₂, via least squares regression of Y on H₂ and A₂ using the working model (3), i.e., ${β̂}_{2} = {arg min}_{β_{2}} \sum_{i = 1}^{n} {Y_{i} - Q_{2} (H_{2, i}, A_{2, i}; β_{2})}^{2}$ ;
Q2
calculate predicted future outcomes Ỹ assuming optimal Stage 2 decisions
$Ỹ = {max}_{a_{2} \in {- 1, 1}} Q_{2} (H_{2}, a_{2}; {β̂}_{2}) = H_{2, 0}^{⊤} {β̂}_{2, 0} + | H_{2, 1}^{⊤} {β̂}_{2, 1} |;$
then estimate β₁, and hence Q₁, via least squares regression of Ỹ on H₁ and A₁ using the working model (3), i.e., ${β̂}_{1} = {arg min}_{β_{1}} \sum_{i = 1}^{n} {Ỹ_{i} - Q_{1} (H_{1, i}, A_{1, i}; β_{1})}^{2}$ ; and
Q3
calculate the estimated Q-learning optimal treatment policy ${π̂}^{Q} = ({π̂}_{1}^{Q}, {π̂}_{2}^{Q})$ as
${π̂}_{t}^{Q} (h_{t}) = {arg max}_{a_{t} \in {- 1, 1}} Q_{t} (h_{t}, a_{t}; {β̂}_{t}) .$

There is nothing unusual about the regression modeling in the first step, which is amenable to well-studied techniques and diagnostics. The same is not true of the modeling in the second step. The absolute value function in the definition of Ỹ means that even if the relationships among the components of H₁ and H₂ are approximately linear, the dependence of Ỹ on H₁ would necessarily be nonmonotone. Thus the working model for Ỹ commonly used in practice is generally wrong. Furthermore, there are no simple transformations of the observed variables or simple modifications to the working model that render the true regression of Ỹ on H₁ and A₁ linear, even asymptotically. We illustrate this point with data generated from the model

\begin{matrix} X_{1} ~ N (0, σ^{2}), & ξ ~ N (0, τ^{2}), X_{2} = ζ X_{1} + ξ, \\ A_{t} ~ Unif {- 1, 1}, t = 1, 2, & ϕ ~ N (0, γ^{2}), Y = 1.25 A_{1} A_{2} + A_{2} X_{2} - A_{1} X_{1} + ϕ, \end{matrix}

(4)

where σ², τ², ζ, and γ² are fixed parameters. In most applications one expects ζ = cov(X₁, X₂) ≠ 0. Treatments are randomly assigned at each stage as in a sequential multiple assignment randomized trial design (Murphy, 2005b; see also Lavori and Dawson, 2000, 2004). The working model in (3) for Q₂(H₂, A₂) is correct, so the resulting predicted value Ỹ approximates the true fitted value Ỹ_True = |1.25A₁ + X₂| − A₁X₁. It follows that the regression in Step Q2 approximates the regression of Ỹ_True on H₁ = X₁ and A₁. Substituting for X₂ in Ỹ shows that Ỹ_True = |1.25A₁ + ζX₁ + ξ| − A₁X₁, from which it is apparent that E(Ỹ_True|X₁, A₁) is linear in X₁ for fixed A₁ only in the unlikely case that ζ = 0. Thus correlation between X₁ and X₂ induces a nonlinear dependence of Ỹ on X₁.

The left-hand panel of Figure 1 displays a scatterplot of Ỹ against X₁ for each value of A₁ based on 1,000 random draws from model (4) with ζ = 0.85, σ = 1, and τ = γ = 1/√2, using the Q-learning algorithm to calculate Ỹ. The figure illustrates nonlinearity in the regression of Ỹ on X₁ and also heteroscedastic variation induced by the max operation in Step Q2. As this toy model makes clear, identifying the correct form of the regression model for E(Ỹ_True | H₁,A₁) and fitting it efficiently would be difficult in the realistic case that the data-generating model is unknown; one approach would be to adopt nonparametric models for the Q-functions (e.g., Zhao et al., 2011; Moodie et al., 2013), but some clinicians are wary of black-box approaches and it 125 can be difficult to glean scientific knowledge from these models. Thus, common practice is to ignore the problem and settle for the best approximation afforded by fitting linear models. This problem is shared by variants of Q-learning (e.g., A-learning, Murphy, 2003; Blatt et al., 2004; Robins, 2004; Schulte et al., 2012). In contrast, the right-hand panel of Fig. 1 shows the first-stage regression model that must be fit in our proposed method, IQ-learning, described next. It is a common analysis of covariance model in X₁ and A₁.

Fig. 1 — Scatterplots of Ỹ (left) and Δ̂(H₂) (right) against X₁ for A₁ = −1 (black circles) and A₁ = 1 (grey crosses) for 1, 000 random samples from the toy model. Step 2 of the Q-learning algorithm requires modeling the data in the left plot; note the nonlinearity and heteroscedasticity. Data in the right plot must be modeled for IQ-learning; note the common analysis of covariance structure.

2·3. IQ-Learning

IQ-learning replaces the difficult problem of modeling the predicted future optimal outcomes $Ỹ = {max}_{a_{2} \in {- 1, 1}} Q_{2} (H_{2}, a_{2}; {β̂}_{2}) = H_{2, 0}^{⊤} {β̂}_{2, 0} + | H_{2, 1}^{⊤} {β̂}_{2, 1} |$ with two ordinary mean-variance function modeling problems. Its practical advantages result from the fact that there is a wealth of models and theory for mean-variance function modeling (Carroll and Ruppert, 1988). Thus it has the potential for better model building and diagnostics. The modeling required is familiar and is generally interactive. We first describe the IQ-learning algorithm in general terms and then discuss special cases that are useful in practice.

Whereas Q-learning models max_{a₂∈{−1,1}} Q₂(H₂, a₂) directly, IQ-learning starts with the Q₂ contrast and main-effect functions: Δ(H₂) = {Q₂(H₂, 1) − Q₂(H₂, −1)} /2; μ(H₂) = {Q₂(H₂, 1) + Q₂(H₂, −1)} /2. The contrast and main-effect functions are linear, and hence smooth and monotone functions of Q₂(·). Let g_{h₁, a₁}(·) denote the conditional distribution of the contrast Δ(H₂) given H₁ = h₁ and A₁ = a₁. With these definitions Q₁(h₁, a₁) defined in (2) can be written as

Q_{1} (h_{1}, a_{1}) = E {μ (H_{2}) | H_{1} = h_{1}, A_{1} = a_{1}} + \int | z | g_{h_{1}, a_{1}} (z) d z .

(5)

The IQ-learning estimator of Q₁(h₁, a₁) has the form

{Q̂}_{1}^{I Q} (h_{1}, a_{1}) = L̂ (h_{1}, a_{1}) + \int | z | ĝ_{h_{1}, a_{1}} (z) d z,

(6)

where L̂(h₁, a₁) and ĝ_{h₁, a₁}(·) are estimators of E {μ(H₂) | H₁ = h₁, A₁ = a₁} and g_{h₁, a₁}(·).

Let Q̂₂(H₂, A₂) denote the estimator obtained in Step Q1 of the Q-learning algorithm. Define the estimated main-effect and contrast functions

μ̂ (H_{2}) = \frac{1}{2} {{Q̂}_{2} (H_{2}, 1) + {Q̂}_{2} (H_{2}, - 1)}, Δ̂ (H_{2}) = \frac{1}{2} {{Q̂}_{2} (H_{2}, 1) - {Q̂}_{2} (H_{2}, - 1)} .

(7)

Then L̂(h₁, a₁) is obtained by modeling the regression of μ̂(H₂) on H₁ and A₁ for which linear models are often adequate as no unusual transformations are involved. Obtaining ĝ_{h₁, a₁}(·) is accomplished by estimating the conditional distribution of Δ̂(H₂) given H₁ and A₁. For this we exploit mean-variance function modeling as explained in Section 2·4. Thus we have the following algorithm for IQ-learning:

IQ1
use Step Q1 of the Q-learning algorithm to obtain β̂₂ and ${Q̂}_{2}^{I Q} (H_{2}, A_{2}) = Q_{2} (H_{2}, A_{2}; {β̂}_{2})$ ;
IQ2
1. regress the estimated main-effect function μ̂(H₂) from (7) on H₁ and A₁ to obtain an estimator L̂(h₁, a₁) of E {μ(H₂) | H₁ = h₁, a₁};
2. model the conditional distribution of the estimated contrast function Δ̂(H₂) from (7) given H₁ = h₁ and A₁ = a₁ to obtain an estimator ĝ_{h₁, a₁}(·) of g_{h₁, a₁}(·);
3. combine the estimators from IQ2a and IQ2b to obtain
  ${Q̂}_{1}^{I Q} (h_{1}, a_{1}) = L̂ (h_{1}, a_{1}) + \int | z | ĝ_{h_{1}, a_{1}} (z) d z;$
IQ3
define the IQ-learning estimated optimal treatment policy ${π̂}^{I Q} = ({π̂}_{1}^{I Q}, {π̂}_{2}^{I Q})$ so that ${π̂}_{t}^{I Q} (h_{t}) = {arg max}_{a_{t} \in {- 1, 1}} {Q̂}_{t}^{I Q} (h_{t}, a_{t})$ .

Completing our algorithm requires specific models for Steps IQ1, IQ2a, and IQ2b. As noted previously, Steps IQ1 and IQ2a are usually straightforward and linear models will often suffice. We now show how to accomplish the modeling in Step IQ2b efficiently and with sufficient flexibility for many applications by using mean-variance models.

2·4. Location-Scale Working Models for g_{h₁, a₁}(·)

Henceforth we consider mean-variance, location-scale estimators of g_{h₁, a₁}(·) of the form

ĝ_{h_{1}, a_{1}} (z) = \frac{1}{σ̂ (h_{1}, a_{1})} ϕ̂ {\frac{z - m̂ (h_{1}, a_{1})}{σ̂ (h_{1}, a_{1})}},

(8)

where m̂(h₁, a₁) is an estimator of m(h₁, a₁) = E {Δ(H₂) | H₁ = h₁, A₁ = a₁}, σ̂²(h₁, a₁) is an estimator of σ²(h₁, a₁) = E [{Δ(H₂) − m(h₁, a₁)}² | H₁ = h₁, A₁ = a₁], and ϕ̂ is an estimator of the density of the standardized residuals {Δ(H₂) − m(h₁, a₁)} /σ(h₁, a₁), say ϕ_{h₁, a₁}, which we assume does not depend on the history h₁ or the treatment a₁. In other words, we assume that all of the dependence of Δ(H₂) on (H₁, A₁) is captured by the conditional mean and variance functions. The great success of mean-variance function modeling (Carroll and Ruppert, 1988) suggests that this assumption is reasonable quite generally; however, we also note that substantial departures from the assumption can be investigated by stratifying on h₁ and a₁ and comparing higher-order moments, such as skewness and kurtosis, or nonparametric density estimates of the empirical residuals {Δ̂(H₂) − m̂(H₁,A₁)} /σ̂(H₁, A₁) across the strata. We now describe two special cases of the estimator in (8).

Let ϕ denote the density of a standard normal random variable. A simple but useful estimator of g_{h₁, a₁}(·) is the normal location-scale model:

ĝ_{h_{1}, a_{1}}^{N} (z) = \frac{1}{σ̂ (h_{1}, a_{1})} ϕ {\frac{z - m̂ (h_{1}, a_{1})}{σ̂ (h_{1}, a_{1})}},

(9)

which is a special case of (8) with ϕ̂ = ϕ. An advantage of this model is that $\int | z | ĝ_{h_{1}, a_{1}}^{N} (z) d z$ can be evaluated in closed form. In particular,

\int | z | ĝ_{h_{1}, a_{1}}^{N} (z) d z = m̂ (h_{1}, a_{1}) [2 Φ {\frac{m̂ (h_{1}, a_{1})}{σ̂ (h_{1}, a_{1})}} - 1] + 2 σ̂ (h_{1}, a_{1}) ϕ {\frac{m̂ (h_{1}, a_{1})}{σ̂ (h_{1}, a_{1})}},

(10)

where Φ is the standard normal cumulative distribution function. If the mean and variance functions are correctly specified and ϕ_{h₁, a₁} = ϕ, then the IQ-learning location-scale model is correct. As commonly implemented, Q-learning fits a misspecified model and thus estimators are not consistent. The closed form expression in (10) makes it possible to study the bias in Q-learning when the true data-generating model is a normal mean-variance function model. Details are in the Supplementary Material.

The normal location-scale model assumes that ϕ_{h₁, a₁} = ϕ, the standard normal density. Violations of normality can be diagnosed via examination of the observed standardized residuals, ê_i = {Δ̂(H_2,i) − m̂{(H_{1, i}, A_{1, i})}/σ̂(H_1,i, A_1,i) (i = 1, …, n). When greater modeling flexibility is desired, the normality assumption can be dropped and the empirical distribution of the ê_i used instead. Defining

Ĝ_{h_{1}, a_{1}} (z) = \frac{1}{n} \sum_{i = 1}^{n} 1_{ê_{i} \leq z}, and ĝ_{h_{1}, a_{1}}^{E} (z) d z = d Ĝ_{h_{1}, a_{1}} (z)

(11)

leads to the nonparametric location-scale estimator of ∫|z|g_{h₁, a₁}(z)dz, $\int | z | ĝ_{h_{1}, a_{1}}^{E} (z) d z = n^{- 1} \sum_{i = 1}^{n} | m̂ (h_{1}, a_{1}) + σ̂ (h_{1}, a_{1}) ê_{i} |$ . We show in Section 2·5 that the nonparametric location-scale estimator is consistent and asymptotically normal.

Data for estimating optimal sequential decision rules are typically expensive to collect, so samples are seldom large and parametric mean and variance function models are more useful than nonparametric models. Thus we assume that m(h₁, a₁) = m(h₁, a₁; θ) and σ(h₁, a₁) = σ(h₁, a₁; γ) for some θ ∈ ℝ^p_m and γ ∈ ℝ^p_s. Similarly, we assume that L(h₁, a₁) = L(h₁, a₁; α), α ∈ ℝ^p_L. For the results in Sections 3 we completed specification of the IQ-learning algorithm in Section 2·3 as follows. Steps IQ2a and IQ2b are amended to:

IQ2a
Set L̂(h₁, a₁) = L(h₁, a₁; α̂), where $α̂ = {arg min}_{α} \sum_{i = 1}^{n} {μ̂ (H_{2, i}) - L (H_{1, i}, A_{1, i}; α)}^{2}$
IQ2bi
Set m̂(h₁, a₁) = m(h₁, a₁;θ̂), where $θ̂ = {arg min}_{θ} \sum_{i = 1}^{n} {Δ̂ (H_{2, i}) - m (H_{1, i}, A_{1, i}; θ)}^{2}$
IQ2bii
Set σ̂(h₁, a₁) = σ(h₁, a₁; γ̂) where
$γ̂ = {arg min}_{γ} \sum_{i = 1}^{n} {log | Δ̂ (H_{2, i}) - m (H_{1, i}, A_{1, i}; θ̂) | - log σ (H_{1, i}, A_{1, i}; γ)}^{2} .$
IQ2biii
Set ĝ_{h₁, a₁} to either $ĝ_{h_{1}, a_{1}}^{N}$ in (9) or to $ĝ_{h_{1}, a_{1}}^{E}$ in (11).

We have used a simple model for the conditional variance in step IQ2bii; for a discussion of other conditional variance estimators and their asymptotic properties see Carroll and Ruppert (1988).

2·5. Asymptotic Theory

Asymptotic distribution theory for ${Q̂}_{2}^{I Q} (h_{1}, a_{1})$ is covered by standard results for linear regression, so we address only the asymptotic distribution of ${Q̂}_{1}^{I Q} (h_{1}, a_{1})$ for the particular parametric estimators defined in the IQ-learning algorithm. Define the population residuals

ℇ (H_{2}, H_{1}, A_{1}; θ, γ, β_{2}) = {Δ (H_{2}; β_{2}) - m (H_{1}, A_{1}; θ)} / σ (H_{1}, A_{1}; γ)

and the population parameters:

β_{2}^{*} = arg min_{β_{2}} E {Y - Q_{2} (H_{2}, A_{1}; β_{2})}^{2}, θ^{*} = arg min_{θ} E {Δ (H_{2}; β_{2}^{*}) - m (H_{1}, A_{1}; θ)}^{2}, γ^{*} = arg min_{γ} E {log | Δ (H_{2}; β_{2}^{*}) - m (H_{1}, A_{1}; θ^{*}) | - log σ (H_{1}, A_{1}; γ)}^{2}, α^{*} = arg min_{α} E {μ (H_{2}; β_{2}^{*}) - L (H_{1}, A_{1}; α)}^{2} .

Let θ̂, γ̂, β̂₂, and α̂ denote n^1/2-consistent estimators of their population analogs θ*, γ*, $β_{2}^{*}$ , and α*. For x ∈ ℝ^p, let ℬ_d(x) denote a ball of radius d centered at x, and let E_n denote the empirical expectation operator so that $E_{n} f = n^{- 1} \sum_{i = 1}^{n} f (H_{1, i}, A_{1, i}, H_{2, i}, A_{2, i}, Y_{i})$ . The asymptotic results are stated in terms of the seven centered statistics:

Δ_{L} = L (h_{1}, a_{1}; α̂) - L (h_{1}, a_{1}; α^{*}), Δ_{m} = m (h_{1}, a_{1}; θ̂) - m (h_{1}, a_{1}; θ^{*}), Δ_{β} = {β̂}_{2} - β_{2}^{*}, Δ_{σ} = σ (h_{1}, a_{1}; γ̂) - σ (h_{1}, a_{1}; γ^{*}), Δ_{θ} = θ̂ - θ^{*}, Δ_{γ} = γ̂ - γ^{*}, Δ_{ℇ} = E_{n} {| m (h_{1}, a_{1}; θ^{*}) + σ (h_{1}, a_{1}; γ^{*}) ℇ (H_{2}, H_{1}, A_{1}; θ^{*}, γ^{*}, β_{2}^{*}) |} - E {| m (h_{1}, a_{1}; θ^{*}) + σ (h_{1}, a_{1}; γ^{*}) ℇ (H_{2}, H_{1}, A_{1}; θ^{*}, γ^{*}, β_{2}^{*}) |}

The following assumptions are used to establish the limit theory for IQ-learning:

(A1N)
n^1/2 (Δ_L, Δ_m, Δ_σ) is asymptotically N {0,Σ_N(h₁, a₁)};
(A1E)
n^1/2 (Δ_L, Δ_θ, Δ_γ, Δ_β, Δ_ε) is asymptotically N {0, Σ_E(h₁, a₁)};
(A2)
let k₁, k₂, and k₃ denote fixed positive constants, the class of functions
$ℱ = {f (H_{2}, H_{1}, A_{1}; θ, γ, β_{2}) = | m (h_{1}, a_{1}; θ) + σ (h_{1}, a_{1}; γ) ℇ (H_{2}, H_{1}, A_{1}; θ, γ, β_{2}) | : θ \in ℬ_{k_{1}} (θ^{*}), γ \in ℬ_{k_{2}} (γ^{*}), β_{2} \in ℬ_{k_{3}} (β_{2}^{*})},$
is a P-measurable Donsker class with a square-integrable envelope. In addition, J(θ, γ, β₂) = Ef(H₂, H₁, A₁; θ, γ, β₂) is continuously differentiable with a bounded derivative in a neighborhood of $(θ^{*}, γ^{*}, β_{2}^{*})$ ;
(A3)
the random variable $ℰ (H_{2}, H_{1}, A_{1}; θ^{*}, γ^{*}, β_{2}^{*})$ has a continuously differentiable density κ with derivative κ′(z) satisfying |∫z²κ′(z)dz| < ∞.

These assumptions are relatively mild with (A1N), (A1E), and the first part of (A2) verifiable using standard techniques, e.g., the multivariate central limit and Donsker preservation theorems (see Kosorok, 2008, and the Supplementary Material). Assumption (A3) is generally more difficult to verify, but its validity can be roughly assessed using the observed residuals ê₁,…, ê_n. We have not sought the most general assumptions, but rather a set of simple assumptions that illustrate what is needed for the IQ-learning estimator to be well-behaved.

The first result states the asymptotic normality of the normal and nonparametric location-scale IQ-learning estimators of the first-stage Q-function. Let ${Q̂}_{1}^{I Q, N} (h_{1}, a_{1})$ and ${Q̂}_{1}^{I Q, E} (h_{1}, a_{1})$ denote the normal and nonparametric location-scale estimators, respectively, so that

{Q̂}_{1}^{I Q, N} (h_{1}, a_{1}) = L (h_{1}, a_{1}; α̂) + \int | z | ĝ_{h_{1}, a_{1}}^{N} (z) d z, {Q̂}_{1}^{I Q, E} (h_{1}, a_{1}) = L (h_{1}, a_{1}; α̂) + \int | z | ĝ_{h_{1}, a_{1}}^{E} (z) d z .

Define I(υ, t, s) = υ + s⁻¹∫|z|ϕ {(z − t)/s}dz. Let I*(h₁, a₁) denote I{L(h₁, a₁; α*), m(h₁, a₁; θ*), σ(h₁, a₁; γ*)}. The following is proved in the Supplementary Material.

Theorem 1 (Asymptotic normality.). Let h₁ and a₁ be fixed.

Assume (A1N). Then
$n^{1 / 2} [{Q̂}_{1}^{I Q, N} (h_{1}, a_{1}) - L (h_{1}, a_{1}; α^{*}) - \frac{1}{σ (h_{1}, a_{1}; γ^{*})} \int | z | ϕ {\frac{z - m (h_{1}, a_{1}; θ^{*})}{σ (h_{1}, a_{1}; γ^{*})}} d z]$
converges in distribution to N{0, ∇I* (h₁, a₁)^⊤Σ_N (h₁, a₁)∇I* (h₁, a₁)}.
Assume (A1E), (A2), and (A3). Then
$n^{1 / 2} [{Q̂}_{1}^{I Q, E} (h_{1}, a_{1}) - L (h_{1}, a_{1}; α^{*}) - \frac{1}{σ (h_{1}, a_{1}; γ^{*})} \int | z | κ {\frac{z - m (h_{1}, a_{1}; θ^{*})}{σ (h_{1}, a_{1}; γ^{*})}} d z]$
converges in distribution to $N [0, {1, \nabla J {(θ^{*}, γ^{*}, β_{2}^{*})}^{⊤}, 1} \sum_{E} (h_{1}, a_{1}) {1, \nabla J {(θ^{*}, γ^{*}, β_{2}^{*})}^{⊤}, 1}^{⊤}]$ .

Theorem 1 shows that both location-scale estimators ${Q̂}_{1}^{N} (h_{1}, a_{1})$ and ${Q̂}_{1}^{E} (h_{1}, a_{1})$ are asymptotically normal under the stated conditions, which do not require correct specification of the IQ-learning models. Under (C1) and (C2) below, the IQ-learning models are correctly specified, and consistency and asymptotic normality of the IQ-learning estimators follow.

(C1)
Let Z denote a standard normal random variable, then $Δ (H_{2}; β_{2}^{*}) = m (H_{1}, A_{1}; θ^{*}) + σ (H_{1}, A_{1}; γ^{*})$ .
(C2)
Let W be a random variable with density κ(·), then $Δ (H_{2}; β_{2}^{*}) = m (H_{1}, A_{1}; θ^{*}) + σ (H_{1}, A_{1}; γ^{*}) W$ .

The following results are direct consequences of Theorem 1 and we omit their proofs.

Corollary 1. Assume $Q_{2} (h_{2}, a_{2}) = Q_{2} (h_{2}, a_{2}; β_{2}^{*})$ and $E {μ (H_{2}; β_{2}^{*}) | H_{1} = h_{1}, A_{1} = a_{1}} = L (h_{1}, a_{1}; α^{*})$ . Then

if (C1), $Q_{1} (h_{1}, a_{1}) = L (h_{1}, a_{1}; α^{*}) + \frac{1}{σ (h_{1}, a_{1}; γ^{*})} \int | z | ϕ {\frac{z - m (h_{1}, a_{1}; θ^{*})}{σ (h_{1}, a_{1}; γ^{*})}} d z$ ;
if (C2), $Q_{1} (h_{1}, a_{1}) = L (h_{1}, a_{1}; α^{*}) + \frac{1}{σ (h_{1}, a_{1}; γ^{*})} \int | z | κ {\frac{z - m (h_{1}, a_{1}; θ^{*})}{σ (h_{1}, a_{1}; γ^{*})}} d z$ .

Theorem 2. Assume (A1N) and the conditions of Theorem 1. Let h₁ and a₁ be fixed.

If (C1) then $n^{1 / 2} {{Q̂}_{1}^{I Q, N} (h_{1}, a_{1}) - Q_{1} (h_{1}, a_{1})}$ converges in distribution to N{0, ∇I*(h₁, a₁)^⊤Σ_N (h₁, a₁)∇I*(h₁, a₁)}.
If (A1E), (A2), (A3) and (C2) then $n^{1 / 2} {{Q̂}_{1}^{I Q, E} (h_{1}, a_{1}) - Q_{1} (h_{1}, a_{1})}$ converges in distribution to $N [0, {1, \nabla J {(θ^{*}, γ^{*}, β_{2}^{*})}^{⊤}, 1} Σ_{E} (h_{1}, a_{1}) {1, \nabla J {(θ^{*}, γ^{*}, β_{2}^{*})}^{⊤}, 1}^{⊤}]$ .

Remark 1. Theorem 2 can be used to construct asymptotically valid confidence intervals for the first-stage Q-function for fixed patient history h₁ and first-stage treatment a₁. This is a notoriously difficult task with Q-learning (Laber et al., 2014). In practice, due to the complexity of the variance terms, the bootstrap may be preferred. Remarks about how to extend the above results to obtain bootstrap consistency are made in the Supplementary Material.

Remark 2. As noted in the introduction, IQ-learning does not alleviate the inherent nonregularity present in sequential decision making problems; see Robins (2004), Laber et al. (2014), and Chakraborty et al. (2010). However, IQ-learning is consistent for a nonregular scenario of interest, the so-called global null in which there is no treatment effect for any patients at the second stage, i.e., $Δ (H_{2}; β_{2}^{*}) = 0$ almost surely. To see this, note that assuming (A1N), (C1) holds with m(H₁, A₁; θ*) = 0 with σ(H₁, A₁; γ*) → 0 almost surely. Part 1 of Corollary 1 depends only on part 1 of Theorem 1 and part 1 of Theorem 2. For the more complex case in which $0 < p r {Δ (H_{2}; β_{2}^{*}) = 0} < 1$ we conjecture that using a mixture of normals to estimate g_h₁,a₁(·) may lead to improved small–sample performance.

3. Monte Carlo Results

We compare the small-sample performance of IQ- and Q-learning in using average value of the learned treatment regimes, integrated mean squared error of the first-stage Q-function, and coverage and width of 95% nonparametric bootstrap confidence intervals for the first-stage Q-function. A key advantage of IQ-learning is its compatibility with common model building steps, which we illustrate in the Supplementary Material with a study of the power to detect nonlinear effects in the first-stage Q-function using IQ-learning and Q-learning. Software implementing the IQ-learning estimators is available as part of the iqLearn package on the comprehensive R network (cran.us.r-project.org/). The Monte Carlo results show that for the class of generative models we consider, IQ-learning generally performs better than Q-learning in terms of value, integrated mean squared error, coverage, and width of the confidence intervals. Simulations in the Supplementary Material show that IQ-learning also has higher power to detect nonlinear effects.

Simulations in this section use data from the following class of generative models:

X_{1} ~ {Normal}_{p} {0.1, Ω_{A R_{1}} (0.5)}, A_{t} ~ Uniform {- 1, 1}, t = 1, 2, X_{2} = (1.5 - 0.5 A_{1}) X_{1} + ζ_{A_{1}} ξ, Y = H_{2, 0}^{⊤} β_{2, 0} + A_{2} H_{2, 1}^{⊤} β_{2, 1} + ϕ,

where {Ω_AR₁ (0.5)}_i,j = (0.5)^|i−j|, $H_{2, 0} = H_{2, 1} = {(1, X_{2}^{⊤}, A_{1}, A_{1} X_{2}^{⊤})}^{⊤}$ , and ζ_A₁ = (1.5 + 0.5A₁)^1/2. Thus, the class is indexed by the dimension p, the distributions of ξ and ϕ, and the coefficient vectors β_2,0 and β_2,1. Here we fix p = 4; results for p = 8 are similar and are provided in the Supplementary Material. We consider ξ ~ Normal_p (0, I_p). We fix the main effect parameter β_2,0 and vary the second-stage treatment effect size by scaling β_2,1 as follows: β_2,0 = 1_2p+2/‖1_2p+2‖, $β_{2, 1} = {C {- 0.25 (1_{p + 1}^{⊤}), 1_{p + 1}^{⊤}}}^{⊤} / ‖ {- 0.25 (1_{p + 1}^{⊤}), 1_{p + 1}^{⊤}} ‖$ , where C ranges over a grid from 0 to 2, and 1_d denotes a d-dimensional vector of 1s. In addition, we fix the theoretical R² of the second-stage regression model at 0.6 by specifying $ϕ ~ Normal {0, σ_{ϕ}^{2} (C)}$ and solving for the variance $σ_{ϕ}^{2} (C)$ that yields the desired R². Additional simulations, provided in the Supplementary Material, show results for R² = 0.4, 0.8 and non-normal error distributions for ξ;

We consider linear working models for the mean and log variance functions

\begin{matrix} Q_{2} (h_{2}, a_{2}; β_{2}) & = h_{2}^{⊤} β_{2, 0} + a_{2} h_{2}^{⊤} β_{2, 1}, & Q_{1} (h_{1}, a_{1}; β_{1}) & = h_{1}^{⊤} β_{1, 0} + a_{1} h_{1}^{⊤} β_{1, 1}, \\ L (h_{1}, a_{1}; α) & = h_{1}^{⊤} α_{0} + a_{1} h_{1}^{⊤} α_{1}, & m (h_{1}, a_{1}; θ) & = h_{1}^{⊤} θ_{0} + a_{1} h_{1}^{⊤} θ_{1}, \\ log {σ (h_{1}, a_{1}; γ)} & = h_{1}^{⊤} γ_{0} + a_{1} h_{1}^{⊤} γ_{1}, \end{matrix}

where now $H_{1} = {(1, X_{1}^{⊤})}^{⊤}$ . In addition to Q-learning with linear working models, we include results using support vector regression using a Gaussian kernel (Zhao et al., 2011) to estimate both Q-functions. We compare the two versions of Q-learning with two IQ-learning estimators that differ in the estimation of g_h₁,a₁(·) and the model for σ(h₁, a₁; γ): a normal estimator $ĝ_{h_{1}, a_{1}}^{N} (\cdot)$ of the residual distribution and a restricted variance model, log{σ(h₁, a₁; γ)} = γ₀ + a₁γ₁, that depends only on treatment; and a nonparametric estimator $ĝ_{h_{1}, a_{1}}^{E} (\cdot)$ of the residual distribution with a log-linear variance model that depends on h₁ and a₁. When ξ ~ Normal_p(0, I_p), both these estimators are correctly specified. Q-learning is always correctly specified at the second stage but only correctly specified at the first stage when C = 0 and hence β_2,1 = 0.

Results are based on a training set of size n = 250 and M = 2,000 Monte Carlo data sets for each generative model. Additional results for n = 500 are provided in the Supplementary Material. For the nonparametric IQ-learning estimator, which is always correctly specified, the true Q-functions and subsequent optimal regime are estimated using a test set of 10,000 observations. Recall that the value, E^π (Y), of an arbitrary policy π is the expected outcome if all patients are assigned treatment according to π, that is, E^π (Y) = E(E [E {Y | H₂, a₂} | H₁, a₁]) evaluated at a₂ = π₂(H₂) and a₁ = π₁(H₁). For a given training set of size n and an algorithm that produces an estimated optimal policy, say π̂, we define the average value as E {E^π̂ (Y)}, where the outer expectation is taken over all training sets of size n. We estimate the average value of the IQ- and Q-learning estimators using a test set of size 10,000 to estimate the inner expectation and 2,000 Monte Carlo replications to estimate the outer expectation. We compare average values of the learned IQ and Q regimes to the value of the true optimal regime and present the proportion of optimal value obtained. For an estimator Q̂₁(h₁, a₁) of the first-stage Q-function, Q₁(h₁, a₁), define the integrated mean squared error as E [{Q̂₁(H₁, A₁) − Q₁(H₁, A₁)}²], where the expectation is taken over the joint distribution of (H₁, A₁) as well as the training data.

Confidence intervals for Q₁(h₁, a₁) based on IQ-learning and Q-learning estimators are formed by bootstrapping the respective estimators and taking percentiles. For example, if l̂ and û denote the 100 × η/2 and 100 × (1 − η/2) percentiles of the bootstrap distribution of Q̂^IQ(h₁, a₁) based on 1,000 bootstrap resamples, then the 100 × (1 − η)% confidence interval is given by (2Q̂^IQ(h₁, a₁) − û, 2Q̂^IQ(h₁, a₁) − l̂). Bootstrap intervals of this form are sometimes referred to as hybrid bootstrap confidence intervals (Efron and Tibshirani, 1993). Coverage and width of the foregoing confidence intervals are estimated using 2,000 Monte Carlo replications with a new instance (h₁, a₁) of (H₁, A₁) drawn for each replication. Figure 2 displays the results from this simulation, where ξ ~ Normal_p(0, I_p).

Fig. 2 — Performance of the normal IQ-learning estimator, nonparametric IQ-learning estimator, support vector regression Q-learning, and linear Q-learning given by gray lines with squares, gray lines with circles, light gray lines, and black lines, respectively. Left to Right: Average proportion of optimal value attained; integrated mean squared error of Q₁ estimates; coverage of 95% confidence intervals for Q₁; width of 95% confidence intervals for Q₁.

Figure 2 indicates that the IQ-learning estimators perform better than both Q-learning estimators. Although some gains are achieved using the more flexible support vector regression version of Q-learning, the far left panel indicates that IQ-learning attains higher average value than the Q-learning algorithms across most values of C. The second panel from the left shows that the IQ-learning reduces integrated mean squared error the most, with greater reduction as the second-stage effects increase. IQ-learning also demonstrates a large improvement over linear Q-learning in terms of the coverage of 95% confidence intervals for Q̂₁ (h₁, a₁), as seen in the third panel. The poor coverage of linear Q-learning is attributed to bias, whereas IQ-learning is consistent because the regularity conditions given in Section 2.4 for Theorem 2 hold for the generative models in this section. Thus, IQ-learning estimators come close to achieving the nominal level. The average widths of the confidence intervals are similar for linear Q-learning and IQ-learning, as illustrated by the far right panel. Support vector regression Q-learning greatly improves coverage compared to linear Q-learning, but at the expense of wider intervals. Results from the simulation where the elements of ξ are generated independently from a t₅ distribution are included in the Supplementary Material and appear similar to those in Fig. 2, suggesting the normal IQ-learning estimator is robust to slight misspecification of the residual distribution.

Next, we consider the model

\begin{matrix} X_{1} ~ Normal (- 2, 1), & ξ ~ Normal (0, 1), & A_{t} ~ Uniform {- 1, 1}, t = 1, 2, \\ X_{2} = X_{1} + ξ, & ϕ ~ Normal (0, 1), & Y = H_{2, 0}^{⊤} β_{2, 0} + A_{2} H_{2, 1}^{⊤} β_{2, 1} + ϕ, \end{matrix}

where H_2,0 = H_2,1 = (1, X₁, A₁, X₁A₁, X₂)^⊤, β_2,0 = (3, −1, 0.1, −0.1, −0.1)^⊤, and β_2,1 = C(−6, −2, 5, 3, −0.2)^⊤. We use this example to illustrate a scenario where IQ-learning achieves a large gain in value over Q-learning with linear models. Since the predictor X₁ is univariate, we can visualize which patients are treated differently by IQ-learning compared to Q-learning. The left plot in Figure 3 was obtained by deriving the true first stage Q-function, for which IQ-learning is consistent in this case, and comparing the true first-stage rule to the rule recommended by Q-learning with linear models. For each combination of (X₁, C), the plot shows whether or not Q-learning makes the correct treatment decision. With this generative model, Q-learning assigns the wrong treatment to approximately half the population for a wide range of effect sizes because the true first-stage Q-function is nonlinear in X₁. In contrast, with a sufficiently large sample size, IQ-learning treats all patients according to the optimal rule. A remark explaining the observed pattern is included in Section 6 of the Supplementary Material. Consequently, IQ-learning achieves higher average value than Q-learning, as displayed in the right plot of Fig. 3. Results are based on n = 250 training set samples and M = 1,000 Monte Carlo data sets. In this scenario, both IQ-learning and support vector regression Q-learning reach gains in optimal value attained of approximately 15% as the second-stage effect size grows.

Fig. 3 — A scenario where IQ-learning achieves a large gain in value over Q-learning. The constant C determines the second-stage treatment effect size, from no treatment effects (C = 0) to large effects (C = 2). In the left panel, (X₁, C) pairs where linear Q-learning agrees and disagrees with the true first-stage rule are shown in dark and light gray, respectively, where X₁ is a normally distributed first-stage covariate. On the right, average proportion of optimal value attained by the normal IQ-learning estimator, nonparametric IQ-learning estimator, support vector regression Q-learning, and linear Q-learning regimes shown by gray lines with squares, gray lines with circles, light gray lines, and black lines, respectively.

4. Application to STAR*D

Sequenced Treatment Alternatives to Relieve Depression (STAR*D; Fava et al., 2003; Rush et al., 2004) is a sequentially randomized study of major depressive disorder. A key feature of this trial is that patients experiencing early remission of symptoms were exempt from future randomization, complicating the analysis. We use a subset of the STAR*D data to illustrate how IQ-learning can be used to estimate an optimal dynamic treatment regime in the presence of responder-status dependent designs. There were four stages in the trial, but each patient received Citalopram in the first stage, and thus, there was no randomization. Since our aim is to demonstrate how to learn a regime using the IQ-learning machinery, we opt to perform a complete-case analysis and consider only the first two of three randomized stages. We refer to the second and third stages as stages one and two, respectively. At each stage, treatments can be categorized as a Selective Serotonin Reuptake Inhibitor or not; this is the binary treatment variable in our analysis. The first-line treatment Citalopram given in the non-randomized stage is in the class of Selective Serotonin Reuptake Inhibitors.

We use a measure of efficacy, the Quick Inventory of Depression Symptomatology score, as the outcome (Rush et al., 2004); side-effects or other competing outcomes could be accommodated using set-valued treatment regimes (Lizotte et al., 2012; Laber et al., 2013) or composite outcomes (Wang et al., 2012). The depression score ranges from 0 to 27 with higher values corresponding to more severe negative symptoms. To be consistent with our development, we recode these scores by subtracting them from 27; thus, higher values correspond to better clinical outcomes. The depression score was recorded at multiple time points throughout each stage, intermediate depression scores used as predictors in our analysis are also recoded (Rush et al., 2004, Schulte et al., 2012). At each stage, patients experiencing remission left the study. Remission was defined as a depression score of 5 or less, or greater than 21 after recoding. Henceforth, we refer to patients who achieved remission and left after stage one as responders and second-stage participants as non-responders. The data we use here consist of n = 795 patient trajectories with complete information, not including 481 patients who dropped out for reasons other than remission. The variables composing each trajectory are given in Table 1 of the Supplementary Material.

Of the 795 total patients, 329 were non-responders in the first stage and continued on to the second stage. We define our primary outcome as Y = RY₁ + (1 − R)(Y₁ + Y₂)/2. That is, Y is taken to be Y₁ for patients who left the study after stage one, and Y is taken to be the average of the depression scores measured at the end of the first and second stages for non-responders. Our responder-status version of IQ-learning is based on the Q-learning implementation described in Schulte et al. (2012). The first-stage history vector contains all information available prior to the first-stage treatment randomization. Thus, H₁ = (X_1,1, X_1,2)^⊤. The second-stage history is H₂ = (X_1,1, X_1,2, A₁, Y₁, R, X_2,1, X_2,2)^⊤, which contains all information observed before the second-stage treatment assignment. The second-stage Q-function is Q₂(H₂, A₂) = E(Y | H₂ = h₂, A₂ = a₂) = RY₁ + (1 − R) {Y₁ + E(Y₂ | H₂ = h₂, A₂ = a₂)} /2, where we have used the fact that R and Y₁ are contained in H₂. Thus the first step in the IQ-learning algorithm, and in Q-learning, is to specify and fit a model for E(Y₂ | H₂ = h₂, A₂ = a₂). Defining the second-stage summary vectors as H_2,0 = H_2,1 = (1, X_2,1, X_2,2)^⊤, we consider a working model of the form $E (Y_{2} | H_{2} = h_{2}, A_{2} = a_{2}) = h_{2, 0}^{⊤} β_{2, 0} + a_{2} h_{2, 1}^{⊤} β_{2, 1}$ , that we fit via least squares. Standard regression diagnostics based on the residuals to not indicate any major departures from the usual linear modeling assumptions.

In our subset of data, all non-responders who did not receive a Selective Serotonin Reuptake Inhibitor in stage one did not receive one in stage two. Table 2 of the Supplementary Material provides the number of patients assigned to each treatment strategy. We define Ỹ = arg max_a₂ Q₂(H₂, a₂)R + {Q₂(H₂, −1)(1 − A₁) + arg max_a₂ Q₂(H₂, a₂)(1 + A₁)} (1 − R)/2. Thus, Ỹ is Q₂(H₂, −1) for non-responders who received A₁ = −1 and arg max_a₂ Q₂(H₂, a₂) otherwise. With this working model, after substituting in Q₂(H₂, a₂) and simplifying,

Q_{1} (H_{1}, A_{1}) = E (R Y_{1} | H_{1}, A_{1}) + \frac{1}{2} E {(1 - R) Y_{1} | H_{1}, A_{1}} + \frac{1}{2} E {(1 - R) H_{2, 0}^{⊤} β_{2, 0} | H_{1}, A_{1}} - \frac{1 - A_{1}}{4} E {(1 - R) H_{2, 1}^{⊤} β_{2, 1} | H_{1}, A_{1}} + \frac{1 + A_{1}}{4} E {(1 - R) | H_{2, 1}^{⊤} β_{2, 1} | | H_{1}, A_{1}} .

(12)

The Q-learning algorithm models E(Ỹ | H₁, A₁) directly. The left panel in Figure 4 shows a scatterplot of the pseudo-response Ỹ against baseline depression score by first-stage treatment. Cubic smoothing spline fits to the data are indicated by solid gray and dashed black lines for A₁ = 1 and A₁ = −1, respectively. These fits appear approximately linear; however, the variance appears non-constant across baseline depression score, and there is clear separation between the responder and non-responder groups.

Fig. 4 — First-stage summaries of the STAR*D data. Gray circles and black crosses represent treatments A₁ = 1 and A₁ = −1, respectively. The left panel shows a scatterplot of Ỹ against baseline depression score by first-stage treatment. The middle panel contains a scatterplot of $H_{2, 0}^{⊤} {β̂}_{2, 0}$ against baseline depression score by first-stage treatment for non-responders. The right panel contains a scatterplot of $H_{2, 1}^{⊤} {β̂}_{2, 1}$ against baseline depression score by first-stage treatment for non-responders. Cubic smoothing spline fits to the data for A₁ = 1 and A₁ = −1 are represented by solid gray and dashed black lines, respectively.

We can write the first three expectation terms in (12) as:

\begin{matrix} E (R Y_{1} | H_{1}, A_{1}) & = E (Y_{1} | H_{1}, A_{1}, R = 1) pr (R = 1 | H_{1}, A_{1}), \\ E {(1 - R) Y_{1} | H_{1}, A_{1}} & = E (Y_{1} | H_{1}, A_{1}, R = 0) pr (R = 0 | H_{1}, A_{1}), \\ E {(1 - R) H_{2, 0}^{⊤} β_{2, 0} | H_{1}, A_{1}} & = E (H_{2, 0}^{⊤} β_{2, 0} | H_{1}, A_{1}, R = 0) pr (R = 0 | H_{1}, A_{1}) . \end{matrix}

We estimate the right-hand conditional expectations by fitting three separate linear regressions, and we use logistic regression to estimate pr(R = r | H₁, A₁, r = 0, 1. In particular, we specify linear models of the form $E (Y_{1} | H_{1} = h_{1}, A_{1} = a_{1}, R = r) = h_{1, 0}^{⊤} λ_{r, 0} + a_{1} h_{1, 1}^{⊤} λ_{r, 1}$ for E(Y₁ | H₁, A₁, R = r), where H_1,0 = H_1,1 = (1, X_1,1, X_1,2)^⊤. We posit the model $E (H_{2, 0}^{⊤} β_{2, 0} | H_{1} = h_{1}, A_{1} = a_{1}, R = 0) = h_{1, 0}^{⊤} α_{0} + a_{1} h_{1, 1}^{⊤} α_{1}$ for the main-effect term, which we fit with least squares using only the non-responder data. The middle and right plots in Fig. 4 display scatterplots of the non-responder realizations of the main-effect term and contrast function, respectively, against baseline depression score by first-stage treatment. Cubic smoothing spline fits to the data appear mostly linear. For the logistic regression, we fit the model $logit {pr (R = 1 | H_{1} = h_{1}, A_{1} = a_{1})} = h_{1, 0}^{⊤} δ_{0} + a_{1} h_{1, 1}^{⊤} δ_{1}$ .

Finally, we must obtain estimates of $E {(1 - R) | H_{2, 1}^{⊤} β_{2, 1} | | H_{1}, A_{1}}$ and $E {(1 - R) H_{2, 1}^{⊤} β_{2, 1} | H_{1}, A_{1}}$ in equation (12). Notice that

E {(1 - R) | H_{2, 1}^{⊤} β_{2, 1} | | H_{1}, A_{1}} = pr (R = 0 | H_{1}, A_{1}) \int | z | g_{H_{1}, A_{1}, R = 0} (z) d z .

We can use the IQ-learning machinery to obtain an estimate of g_{H₁, A₁, R=0}(·). The logistic regression previously described provides an estimate of pr(R = 0 | H₁, A₁). To estimate g_{H₁, A₁, R=0}(·), we first specify a model for the mean of the contrast function for non-responders. We posit the model

E {Δ̂ (H_{2}; {β̂}_{2}) | H_{1}, A_{1}} = H_{1, 0}^{⊤} β_{0} + A_{1} H_{1, 1}^{⊤} β_{1},

(13)

where H_1,0 = (1, X_1,1, X_1,2)^⊤ and H_1,1 = (1, X_1,1, X_1,2)^⊤. This model is also used along with the logistic regression model to estimate $E {(1 - R) H_{2, 1}^{⊤} β_{2, 1} | H_{1}, A_{1}} = E (H_{2, 1}^{⊤} β_{2, 1} | H_{1}, A_{1}, R = 0) \times pr (R = 0 | H_{1}, A_{1})$ , the fourth expectation term in (12). We fit model (13) using least squares. Exploratory analysis suggests that a constant variance assumption is reasonable, so we standardize the residuals of the mean fit using the sample standard deviation. A normal quantile-quantile plot of the standardized residuals suggests heavier tails than would be expected from a normal distribution. Thus, we opt to use the nonparametric density estimator described in Section 2.3.

Assembling the foregoing estimates of the four terms in equation (12) yields

{Q̂}_{1}^{I Q} (h_{1}, a_{1}; {θ̂}_{1}) = (h_{1, 0}^{⊤} {λ̂}_{1, 0} + a_{1} h_{1, 1}^{⊤} {λ̂}_{1, 1}) expit (h_{1, 0}^{⊤} {δ̂}_{0} + a_{1} h_{1, 1}^{⊤} {δ̂}_{1}) + {1 - expit (h_{1, 0}^{⊤} {δ̂}_{0} + a_{1} h_{1, 1}^{⊤} {δ̂}_{1})} {\frac{1}{2} (h_{1, 0}^{⊤} {λ̂}_{0, 0} + a_{1} h_{1, 1}^{⊤} {λ̂}_{0, 1}) + \frac{1}{2} (h_{1, 0}^{⊤} {α̂}_{0} + a_{1} h_{1, 1}^{⊤} {α̂}_{1}) - \frac{1}{4} (h_{1, 0}^{⊤} {β̂}_{0} + a_{1} h_{1, 1}^{⊤} {β̂}_{1}) + \frac{1}{4} \int | z | ĝ_{h_{1}, a_{1}, R = 0} (z) d z},

where ĝ_{h₁, a₁, R=0}(·) is the nonparametric density estimator described in Section 2.3 and ${θ̂}_{1} = {({λ̂}_{0, 0}^{⊤}, {λ̂}_{0, 1}^{⊤}, {λ̂}_{1, 0}^{⊤}, {λ̂}_{1, 1}^{⊤}, {δ̂}_{0}^{⊤}, {δ̂}_{1}^{⊤}, {α̂}_{0}^{⊤}, {α̂}_{1}^{⊤}, {β̂}_{0}^{⊤}, {β̂}_{1}^{⊤})}^{⊤}$ .

The first-stage rule estimated by Q-learning treats all training data patients with a Selective Serotonin Reuptake Inhibitor. Roughly twelve percent of the these patients are not recommended a Selective Serotonin Reuptake Inhibitor by the estimated first-stage IQ-learning rule. Broadly summarizing the IQ-learning rule, patients with very low recoded depression scores after the non-randomized stage should switch from Citalopram, a Selective Serotonin Reuptake Inhibitor, to another drug not in the class of Selective Serotonin Reuptake Inhibitors. This rule recommends a different strategy for patients who respond poorly to the initial Selective Serotonin Reuptake Inhibitor; otherwise Selective Serotonin Reuptake Inhibitor treatment strategies should continue.

Supplementary Material

Supplemental Materials

NIHMS613829-supplement-Supplemental_Materials.pdf^{(6.3MB, pdf)}

Acknowledgments

We acknowledge grant support from the National Institutes of Health (Laber, Stefanski) and National Science Foundation (Stefanski). The authors thank the National Institutes of Health for providing the STAR*D data and Bibhas Chakraborty for his consultation on the analysis presented in Section 4.

Footnotes

Supplementary Material

Supplementary material available at Biometrika online includes proofs of the theoretical results stated in Section 2, additional simulation results, and a list of covariates used in Section 4.

Contributor Information

Eric B. Laber, Email: laber@stat.ncsu.edu.

Kristin A. Linn, Email: kalinn@ncsu.edu.

Leonard A. Stefanski, Email: stefansk@ncsu.edu.

References

Bellman R. Dynamic Programming. Princeton: Princeton University Press; 1957. [Google Scholar]
Blatt D, Murphy SA, Zhu J. A-Learning for Approximate Planning. Ann Arbor, 1001. 2004:48109–42122. [Google Scholar]
Carroll RJ, Ruppert D. Transformation and Weighting in Regression. New York: Chapman and Hall; 1988. [Google Scholar]
Chakraborty B, Moodie EE. Statistical Methods for Dynamic Treatment Regimes. Springer; 2013. Statistical reinforcement learning; pp. 31–52. [Google Scholar]
Chakraborty B, Murphy SA, Strecher VJ. Inference for Non-Regular Parameters in Optimal Dynamic Treatment Regimes. Statistical Methods in Medical Research. 2010;19(3):317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cook RD, Weisberg S. Residuals and Influence in Regression. New York: Chapman and Hall; 1982. [Google Scholar]
Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and Hall; 1993. [Google Scholar]
Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski SR, Lavori PW, Rosenbaum JF, Kupfer DJ STAR*D Investigators Group. Background and Rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) Study. Psychiatric Clinics of North America. 2003;26(1):457+. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
Henderson HV, Velleman PF. Building Multiple Regression Models Interactively. Biometrics. 1981;37(2):391–411. [Google Scholar]
Henderson R, Ansell P, Alshibani D. Regret-regression for optimal dynamic treatment regimes. Biometrics. 2010;66(4):1192–1201. doi: 10.1111/j.1541-0420.2009.01368.x. [DOI] [PubMed] [Google Scholar]
Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. New York: Springer; 2008. [Google Scholar]
Laber E, Lizotte D, Ferguson B. Set-valued dynamic treatment regimes for competing outcomes. Biometrics. 2013 doi: 10.1111/biom.12132. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA. Statistical Inference in Dynamic Treatment Regimes. Under Review. 2014 [Google Scholar]
Lavori PW, Dawson R. A Design for Testing Clinical Strategies: Biased Adaptive Within-Subject Randomization. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2000;163(1):29–38. [Google Scholar]
Lavori PW, Dawson R. Dynamic Treatment Regimes: Practical Design Considerations. Clinical Trials. 2004;1(1):9–20. doi: 10.1191/1740774s04cn002oa. [DOI] [PubMed] [Google Scholar]
Lizotte DJ, Bowling M, Murphy SA. Linear fitted-q iteration with multiple reward functions. Journal of Machine Learning Research. 2012;13:3253–3295. [PMC free article] [PubMed] [Google Scholar]
Moodie EE, Dean N, Sun YR. Q-learning: Flexible learning about useful utilities. Statistics in Biosciences. 2013:1–21. [Google Scholar]
Moodie EE, Richardson TS. Estimating optimal dynamic regimes: Correcting bias under the null. Scandinavian Journal of Statistics. 2010;37(1):126–146. doi: 10.1111/j.1467-9469.2009.00661.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA. Optimal Dynamic Treatment Regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):331–355. [Google Scholar]
Murphy SA. A Generalization Error for Q-Learning. Journal of Machine Learning Research. 2005a;6(7):1073–1097. [PMC free article] [PubMed] [Google Scholar]
Murphy SA. An Experimental Design for the Development of Adaptive Treatment Strategies. Statistics in Medicine. 2005b;24(10):1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
Rich B, Moodie EE, Stephens DA, Platt RW. Model checking with residuals for g-estimation of optimal dynamic treatment regimes. The international journal of biostatistics. 2010;6(2) doi: 10.2202/1557-4679.1210. [DOI] [PubMed] [Google Scholar]
Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics. New York: Springer; 2004. Optimal Structural Nested Models for Optimal Sequential Decisions; pp. 189–326. [Google Scholar]
Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner T, Kupfer DJ, Rosenbaum JF, Alpert J, Stewart JW, McGrath PJ, Biggs MM, Shores-Wilson K, Lebowitz BD, Ritz L, Niederehe G STAR*D Investigators Group. Sequenced Treatment Alternatives to Relieve Depression (STAR*D): Rationale and Design. Controlled Clinical Trials. 2004;25(1):119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]
Schulte PJ, Tsiatis AA, Laber EB, Davidian M. Q-and a-learning methods for estimating optimal dynamic treatment regimes. arXiv preprint arXiv:1202.4177. 2012 doi: 10.1214/13-STS450. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 1st edition. Cambridge: MIT Press, Cambridge, MA, USA; 1998. [Google Scholar]
Wang L, Rotnitzky A, Lin X, Millikan RE, Thall PF. Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer. Journal of the American Statistical Association. 2012;107(498):493–508. doi: 10.1080/01621459.2011.641416. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics. 2011;67(4):1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Materials

NIHMS613829-supplement-Supplemental_Materials.pdf^{(6.3MB, pdf)}

[R1] Bellman R. Dynamic Programming. Princeton: Princeton University Press; 1957. [Google Scholar]

[R2] Blatt D, Murphy SA, Zhu J. A-Learning for Approximate Planning. Ann Arbor, 1001. 2004:48109–42122. [Google Scholar]

[R3] Carroll RJ, Ruppert D. Transformation and Weighting in Regression. New York: Chapman and Hall; 1988. [Google Scholar]

[R4] Chakraborty B, Moodie EE. Statistical Methods for Dynamic Treatment Regimes. Springer; 2013. Statistical reinforcement learning; pp. 31–52. [Google Scholar]

[R5] Chakraborty B, Murphy SA, Strecher VJ. Inference for Non-Regular Parameters in Optimal Dynamic Treatment Regimes. Statistical Methods in Medical Research. 2010;19(3):317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cook RD, Weisberg S. Residuals and Influence in Regression. New York: Chapman and Hall; 1982. [Google Scholar]

[R7] Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and Hall; 1993. [Google Scholar]

[R8] Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski SR, Lavori PW, Rosenbaum JF, Kupfer DJ STAR*D Investigators Group. Background and Rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) Study. Psychiatric Clinics of North America. 2003;26(1):457+. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]

[R9] Henderson HV, Velleman PF. Building Multiple Regression Models Interactively. Biometrics. 1981;37(2):391–411. [Google Scholar]

[R10] Henderson R, Ansell P, Alshibani D. Regret-regression for optimal dynamic treatment regimes. Biometrics. 2010;66(4):1192–1201. doi: 10.1111/j.1541-0420.2009.01368.x. [DOI] [PubMed] [Google Scholar]

[R11] Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. New York: Springer; 2008. [Google Scholar]

[R12] Laber E, Lizotte D, Ferguson B. Set-valued dynamic treatment regimes for competing outcomes. Biometrics. 2013 doi: 10.1111/biom.12132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA. Statistical Inference in Dynamic Treatment Regimes. Under Review. 2014 [Google Scholar]

[R14] Lavori PW, Dawson R. A Design for Testing Clinical Strategies: Biased Adaptive Within-Subject Randomization. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2000;163(1):29–38. [Google Scholar]

[R15] Lavori PW, Dawson R. Dynamic Treatment Regimes: Practical Design Considerations. Clinical Trials. 2004;1(1):9–20. doi: 10.1191/1740774s04cn002oa. [DOI] [PubMed] [Google Scholar]

[R16] Lizotte DJ, Bowling M, Murphy SA. Linear fitted-q iteration with multiple reward functions. Journal of Machine Learning Research. 2012;13:3253–3295. [PMC free article] [PubMed] [Google Scholar]

[R17] Moodie EE, Dean N, Sun YR. Q-learning: Flexible learning about useful utilities. Statistics in Biosciences. 2013:1–21. [Google Scholar]

[R18] Moodie EE, Richardson TS. Estimating optimal dynamic regimes: Correcting bias under the null. Scandinavian Journal of Statistics. 2010;37(1):126–146. doi: 10.1111/j.1467-9469.2009.00661.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Murphy SA. Optimal Dynamic Treatment Regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):331–355. [Google Scholar]

[R20] Murphy SA. A Generalization Error for Q-Learning. Journal of Machine Learning Research. 2005a;6(7):1073–1097. [PMC free article] [PubMed] [Google Scholar]

[R21] Murphy SA. An Experimental Design for the Development of Adaptive Treatment Strategies. Statistics in Medicine. 2005b;24(10):1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]

[R22] Rich B, Moodie EE, Stephens DA, Platt RW. Model checking with residuals for g-estimation of optimal dynamic treatment regimes. The international journal of biostatistics. 2010;6(2) doi: 10.2202/1557-4679.1210. [DOI] [PubMed] [Google Scholar]

[R23] Robins JM. Proceedings of the Second Seattle Symposium in Biostatistics. New York: Springer; 2004. Optimal Structural Nested Models for Optimal Sequential Decisions; pp. 189–326. [Google Scholar]

[R24] Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner T, Kupfer DJ, Rosenbaum JF, Alpert J, Stewart JW, McGrath PJ, Biggs MM, Shores-Wilson K, Lebowitz BD, Ritz L, Niederehe G STAR*D Investigators Group. Sequenced Treatment Alternatives to Relieve Depression (STAR*D): Rationale and Design. Controlled Clinical Trials. 2004;25(1):119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]

[R25] Schulte PJ, Tsiatis AA, Laber EB, Davidian M. Q-and a-learning methods for estimating optimal dynamic treatment regimes. arXiv preprint arXiv:1202.4177. 2012 doi: 10.1214/13-STS450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. 1st edition. Cambridge: MIT Press, Cambridge, MA, USA; 1998. [Google Scholar]

[R27] Wang L, Rotnitzky A, Lin X, Millikan RE, Thall PF. Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer. Journal of the American Statistical Association. 2012;107(498):493–508. doi: 10.1080/01621459.2011.641416. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics. 2011;67(4):1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Interactive model building for Q-learning

Eric B Laber

Kristin A Linn

Leonard A Stefanski

Summary

1. Introduction