Estimating tree-based dynamic treatment regimes using observational data with restricted treatment sequences

Nina Zhou; Lu Wang; Daniel Almirall

doi:10.1111/biom.13754

. Author manuscript; available in PMC: 2025 Nov 8.

Published in final edited form as: Biometrics. 2022 Oct 9;79(3):2260–2271. doi: 10.1111/biom.13754

Estimating tree-based dynamic treatment regimes using observational data with restricted treatment sequences

Nina Zhou ¹, Lu Wang ¹, Daniel Almirall ²

PMCID: PMC12593199 NIHMSID: NIHMS2110496 PMID: 36063542

Abstract

A dynamic treatment regime (DTR) is a sequence of decision rules that provide guidance on how to treat individuals based on their static and time-varying status. Existing observational data are often used to generate hypotheses about effective DTRs. A common challenge with observational data, however, is the need for analysts to consider “restrictions” on the treatment sequences. Such restrictions may be necessary for settings where (1) one or more treatment sequences that were offered to individuals when the data were collected are no longer considered viable in practice, (2) specific treatment sequences are no longer available, or (3) the scientific focus of the analysis concerns a specific type of treatment sequences (eg, “stepped-up” treatments). To address this challenge, we propose a restricted tree–based reinforcement learning (RT-RL) method that searches for an interpretable DTR with the maximum expected outcome, given a (set of) user-specified restriction(s), which specifies treatment options (at each stage) that ought not to be considered as part of the estimated tree-based DTR. In simulations, we evaluate the performance of RT-RL versus the standard approach of ignoring the partial data for individuals not following the (set of) restriction(s). The method is illustrated using an observational data set to estimate a two-stage stepped-up DTR for guiding the level of care placement for adolescents with substance use disorder.

Keywords: constrained optimization, dynamic treatment regime, observational studies, tree-based reinforcement learning, viable decision rules

1 |. INTRODUCTION

A dynamic treatment regime (DTR) is a sequence of decision rules that provide guidance on how to treat individuals based on their static and time-varying status (Murphy, 2003). In a DTR, a patient’s observed past and current disease status and medical history, including treatments offered in the past or adherence to those treatments, are used to decide what treatment to offer next.

Various methods have been developed for estimating an optimal DTR from observational data, where optimal is defined as maximizing the expected value of a desired health outcome. Methods such as marginal structural models with inverse probability weighting (IPW) (Murphy et al., 2001; Wang et al., 2012), Q- and A-learning methods (Murphy, 2003; Nahum-Shani et al., 2012; Schulte et al., 2014; Sutton & Barto, 1998; Watkins & Dayan, 1992), and dynamical system models (Rivera et al., 2007) stand out as the most commonly used methods for evaluating an optimal DTR.

New approaches for estimating optimal “tree-based DTRs” have been proposed more recently. Laber and Zhao (2015) introduced tree-based learning for estimating a single-stage treatment decision rule; they employed a novel inverse probability weighted purity measure to guide the tree construction. Tao et al. (2018) generalized the method with augmented AIPW for use with multi-stage multi-treatment observational data to estimate an optimal tree-based DTR (ie, multistage treatment regime). Sun and Wang (2021) proposed a stochastic tree search method. In a tree-based DTR, the decision rule at each stage is a collection of nested binary splits. Each split is on a value (ie, a threshold) along the range of one of the possible inputs for the decision rule. As a result, tree-based DTRs are potentially more easily interpretable than other types of DTRs because the thresholds are values on already interpretable metrics and scales. To better appreciate this distinction, consider Figure 1, which describes two hypothetical examples of decision rules to guide the level of care (LOC) placement for adolescents with substance use disorder, based loosely on our study’s motivating example. The general pattern is the same for both decision rules: recommend a higher LOC to individuals with severe conditions at program entry. However, the decision rule in panel (A) (tree-based DTR) is arguably easier to interpret than panel (B) (DTR that is not tree based). For panel (B), practitioners require some intuition about the meaning of the linear combination and relative “weights” of the two tailoring variables, in addition to the threshold value 3. In panel (A), on the other hand, the thresholds are more easy to understand; for individuals with mental health severity $(X_{1, 1})$ in the lower tercile of its range or with substance use severity $(X_{1, 2})$ in the lower 2 terciles of its range, the recommended treatment is outpatient care $(A_{1} = 1)$ while otherwise, the recommendation is inpatient care $(A_{1} = - 1)$ .

Compare a tree-based DTR (A) with a DTR that is not tree-based (B). To guide the level of care (LOC) placement for adolescents with substance use disorder, we consider two continuous candidate tailoring variables on the range [0,3] (from less to more severe), where $X_{1, 1}$ = mental health severity at program entry and $X_{1, 2}$ = substance use severity at program entry. Suppose the first-stage treatment recommendation $(A_{1})$ can be outpatient LOC $(A_{1} = 1)$ or inpatient LOC $(A_{1} = - 1)$ . The decision rule in panel (b) requires extra understanding of the linear combination $2 X_{1, 1} + X_{1, 2}$ including weights (2 for $X_{1, 1}$ and 1 for $X_{1, 2}$ ) and its cutoff 3, whereas the decision rule in panel (A) can be easily understood

In this paper, we develop a further extension of methods for estimating tree-based DTRs with observational data. The extension proposes a solution to a common challenge: the need to consider “restrictions” to the class of DTRs offered to individuals when the observational data were collected. There are at least two reasons why a domain scientist or analyst would be interested in considering a restricted class of DTRs in analyses with observational data. First, there may be concerns about the current viability or availability of some treatment sequences that were offered to individuals when the observational data were collected. For example, if a drug is withdrawn completely from the market, such as with the weight-loss drug lorcaserin (US Food and Drug Administration, 2020), then analyses of observational data that include this drug (Fidler et al., 2011; O’neil et al., 2012; Smith et al., 2010) may consider restricting to the class of DTRs that does not include lorcaserin. Similarly, if a drug is recalled by the Food and Drug Administration (FDA) due to problems in manufacturing and distribution, such as with lidocaine (US Food and Drug Administration, 2021), then analyses of observational data that include this drug (Hall et al., 2016) may consider restricting to the class of DTRs that does not include lidocaine. In both examples, although in the data, some individuals are observed to be offered the drug that is no longer considered available or viable, respectively, when estimating an optimal tree-based DTR for future use, one needs to exclude such treatments at one or more stages. A second reason is if the scientific focus of the analysis concerns a specific type of DTR, but when the data were collected, some individuals were observed under a broader set of DTRs. This is the case with our motivating example, which seeks to estimate a two-stage DTR guiding the LOC placement for adolescents with substance use disorder. In the data, it was observed that adolescents receiving inpatient treatment in stage 1 could have treatment discontinued in stage 2 (ie, a “stepped down” strategy). However, the domain scientists we are working with were interested solely in the class of “stepped up” DTRs (Sobell & Sobell, 2000), whereby more resource-intensive treatments (such as inpatient treatment) are reserved as a stage 2 treatment for individuals who do not benefit from a less resource-intensive stage 1 treatment.

To address this challenge, we propose an restricted tree–based reinforcement learning (RT-RL) method that searches for a tree-based DTR with the maximum expected outcome, given a (set of) user-specified restriction(s) on treatment options at one or more stages that ought not to be considered as part of the estimated tree-based DTR. For additional protection against time-varying confounding bias and to potentially improve the method’s statistical efficiency, the proposed method employs a doubly robust augmented inverse probability weighted (AIPW) estimator as part of the algorithm.

The rest of the paper is organized as follows: In Section 2, we formalize the restricted optimization problem, describe a proposed solution for handling the patient history and treatment options, and present the RT-RL estimating procedure. In Section 3, using simulation studies, we compare the performance of the proposed method with a naive method where the observed data along the restricted sequence(s) are censored from the analysis. In Section 4, we apply the RT-RL approach to the Global Appraisal for Individual Needs (GAIN) data to estimate a two-stage stepped-up DTR to guide the LOC placement for adolescents with substance use disorder. The results of the proposed method are compared with those using the regular T-RL and with nondynamic regimes. Finally, in Section 5, we summarize and discuss the findings.

2 |. METHOD

2.1 |. Notation and setup

We consider an observational study with $n$ patients and $T$ treatment stages where there are $K_{j} (K_{j} \geq 2)$ potential treatment options at the $j th$ treatment stage, $j = 1, \dots, T$ . Patients are observed to follow one of the treatments available at each stage. For brevity, we suppress the patient index $i (i = 1, \dots, n)$ when no confusion exists. Let $X_{j}$ denote patient’s covariates observed during the 𝑗th treatment stage and $A_{j}$ denote the treatment at the $j th$ treatment stage, where $A_{j}$ may take a value $a_{j}$ , a specific value for the $j th$ treatment that belongs to the stage-specific treatment space $A_{j}$ containing $K_{j}$ treatment options. As a general notation in the following, we use an overline “-” and a subscript $j$ on a variable to denote the history of this variable up to the $j th$ stage. Thus, ${\bar{A}}_{j - 1} \equiv (A_{1}, \dots, A_{j - 1})$ denotes a random vector of past treatments prior to stage $j$ and ${\bar{a}}_{T} \equiv (a_{1}, \dots, a_{T})$ denotes a specific treatment route. We denote the clinical outcome observed following $A_{j}$ with $R_{j}$ , also known as rewards, which depends on the precedent patient characteristics ${\bar{X}}_{j}$ and previous treatments ${\bar{A}}_{j}$ . We consider the overall outcome of interest $Y$ as some functional of the reward history such that $Y \equiv f (R_{1}, \dots, R_{T})$ , where $f (.)$ is a prespecified function (eg, sum), assuming $Y$ is bounded with larger values preferred. To recommend personalized optimal treatment for future patients at stage $j$ , we would like to use the observational data to infer an optimal decision rule depending on the patient history $H_{j} = {({\bar{A}}_{j - 1}, {\bar{X}}_{j}^{T})}^{T} \in H_{j}$ at stage $j$ .

However, some of the observed treatment sequences are not viable for future patients when certain “restrictions” to the class of DTRs are required as introduced in Section 1. We call such treatment routes as nonviable treatment arms, which should be excluded from the domain of the DTRs of interest during estimation. Suppose there are $M_{j}$ nonviable arms at stage $j$ , and denote them as ${\bar{b}}_{m, j} = (b_{1, m, j}, \dots, b_{j, m, j})$ where $m = 1, \dots, M_{j}$ and $b_{j^{'}, m, j} \in A_{j^{'}}$ for $j^{'} = 1, \dots, j$ . Excluding nonviable treatment arms from the observed treatment stages, we obtain a set of viable treatment sequences from stage 1 to stage $j$ denoted as ${\bar{A}}_{j}^{r e s} \equiv {\bar{A}}_{j} / \{{\bar{b}}_{m, j}, m = 1, \dots, M_{j}\}$ . Correspondingly, the domain of viable patient history is $H_{j}^{r e s} = \{H_{j} : {\bar{A}}_{j - 1} \in {\bar{A}}_{j - 1}^{r e s}\}$ . In the rest of the paper, we denote the random variable of viable patient history as $H_{j}^{r e s}$ , and the value of $H_{j}^{r e s}$ is in $H_{j}^{r e s}$ .

We use $g^{r e s} = (g_{1}^{r e s}, \dots, g_{T}^{r e s})$ to denote a sequence of restricted viable rules for personalized treatment decisions across the different treatment stages, that is, a viable restricted DTR, where $g_{j}^{r e s}$ is a meaningful mapping function from the domain of restricted patient history $H_{j}^{r e s}$ to $A_{j}^{r e s}$ , the range of viable treatment options at $j$ , given the past. Denote the collection of such restricted meaningful mappings as $G^{res} = (G_{1}^{res}, \dots, G_{T}^{res})$ . Specifically, $g_{j}^{res} \in G_{j}^{res}$ maps from viable patient histories to applicable stage $j$ treatments conditional on ${\bar{A}}_{j - 1}$ such that $({\bar{A}}_{j - 1}, g_{j}^{r e s} (H_{j}^{r e s})) \in {\bar{A}}_{j}^{r e s}$ .

To identify the optimal restricted DTR among $G^{r e s}$ , we consider the counterfactual framework for causal inference defined in Robins (1986). At stage $T$ , let $Y^{*} (A_{1}, \dots, A_{T - 1}, a_{T})$ or $Y^{*} (a_{T})$ denote the counterfactual outcome for patients had all received treatment $a_{T}$ at stage $T$ conditional on previous treatments. Our goal is to find the optimal one among the restricted DTRs in $G_{T}^{res}$ that maximizes the expected counterfactual outcome, that is,

g_{T}^{res,opt} = \underset{g_{T}^{r e s} (H_{T}^{r e s}) \in G_{T}^{r e s}}{a r g m a x} E [\sum_{a_{T} \in A_{T}^{r e s}} Y^{*} (a_{T}) I \{g_{T}^{r e s} (H_{T}^{r e s}) = a_{T}\}] .

For any stage $j$ before $T$ , we seek the best treatment decision rule by maximizing the expected counterfactual outcome with all future treatments optimized to avoid confounding. This is because the causal impact of current treatment on the expected counterfactual outcome can only be examined when future treatments are optimized and past history is controlled. We denote the counterfactual outcome at stage $j$ given ${\bar{A}}_{j - 1}$ and future optimal treatments $g_{j + 1}^{res, opt}, \dots, g_{T}^{res, opt}$ as $Y^{*} ({\bar{A}}_{j - 1}, a_{j}, g_{j + 1}^{res, opt}, \dots, g_{T}^{res, opt})$ . Sincesuchcounterfactual outcomes cannot be observed, we estimate the stage-wise pseudo-outcome with observed data considering viable restricted treatment space by

P O_{j}^{r e s} = {\hat{E}}^{r e s} \{Y^{*} (A_{1}, \dots, A_{j}, g_{j + 1}^{res, opt}, \dots, g_{T}^{res, opt})\},

at stage $j$ for $j = 1, \dots, T - 1$ , where ${\hat{E}}^{r e s}$ denotes the estimated expectation considering only the viable patient history and treatment routes space, $g_{j}^{r e s, o p t} \in G_{j}^{r e s}$ and $(A_{1}, \dots, A_{j}) \in {\bar{A}}_{j}^{r e s}$ , which is equivalent to the recursive form

P O_{j}^{r e s} = \hat{E} \{P O_{j + 1}^{r e s} ∣ A_{j + 1} = g_{j + 1}^{res,opt} (H_{j + 1}^{r e s}), H_{j + 1}^{r e s}\} .

At the final stage, we define $P O_{T}^{r e s} = Y$ . Let ${PO}_{j}^{r e s, *} ({\bar{A}}_{j - 1}, a_{j})$ , or ${PO}_{j}^{r e s, *} (a_{j})$ for brevity, denote the counterfactual pseudo-outcome for a patient treated with $a_{j} \in A_{j}^{r e s}$ and viable past treatments $(A_{1}, \dots, A_{j - 1}) \in {\bar{A}}_{j - 1}^{r e s}$ . We have

{PO}_{j}^{r e s, *} (g_{j}^{r e s}) = \sum_{a_{j} \in A_{j}^{r e s}} {PO}_{j}^{r e s, *} (a_{j}) I \{g_{j}^{r e s} (H_{j}^{r e s}) = a_{j}\} .

Our optimization problem at stage $j (j < T)$ among the meaningful mappings becomes:

g_{j}^{res,opt} = \underset{g_{j} (H_{j}^{r e s}) \in G_{j}^{r e s}}{a r g m a x} E [{PO}_{j}^{r e s, *} (g_{j}^{r e s})] .

2.2 |. Constrained optimization procedure

We make the following three assumptions considering only the applicable treatment routes (Murphy et al., 2001; Robins & Hernán, 2009):

Consistency. The observed outcome would be the same as the counterfactual outcome that would have been observed given the same treatments, which indicates $Y = \sum_{a_{T} \in A_{T}^{r e s}} Y^{*} (a_{T}) I (A_{T} = a_{T})$ for stage $T$ , where $A_{T} \in A_{T}^{r e s}$ . Similarly, the pseudo-outcome agrees with the counterfactual pseudo-outcome $P O_{j}^{r e s} = \sum_{a_{j} \in A_{j}^{r e s}} {PO}_{j}^{r e s, *} (a_{j}) I \{A_{j} = a_{j}\}$ for stage $j < T$ and $A_{j} \in A_{j}^{r e s}$ .
No unmeasured confounding. For any treatment sequence ${\bar{a}}_{T} = (a_{1}, \dots, a_{T})$ , treatment $A_{j}$ is independent of future outcomes (rewards), given $H_{j}^{r e s} \in H_{j}^{r e s}$ . Specifically, $A_{j} ⊥ (R_{j} ({\bar{a}}_{j}), \dots, R_{T} ({\bar{a}}_{T})) ∣ H_{j}^{r e s}, \forall j = 1, \dots, T$ .
Positivity. There exist constants $0 < c_{0} < c_{1} < 1$ such that with probability 1 the propensity score $π_{a_{j}} (H_{j}^{r e s}) = P r (A_{j} = a_{j} ∣ H_{j}^{r e s}) \in (c_{0}, c_{1})$ , for $A_{j} \in A_{j}^{r e s}$ .

With the three assumptions, we can bridge the pseudo-outcome derived from the observational data with the expected counterfactual pseudo-outcome for a specific $g_{j}^{r e s} \in G_{j}^{r e s}$ at any stage $j < T$ . Conditional on $H_{j}^{r e s}$ , we have

\begin{array}{l} E \{{PO}_{j}^{r e s, *} (g_{j}^{r e s})\} \\ = E_{H_{j}^{r e s}} [\sum_{a_{j} \in A_{j}^{r e s}} E \{{PO}_{j}^{r e s, *} (a_{j}) ∣ H_{j}^{r e s}\} I \{g_{j}^{r e s} (H_{j}^{r e s}) = a_{j}\}] . \end{array}

With no unmeasured confounder assumption, it is easy to show that

\begin{array}{l} E \{{PO}_{j}^{r e s, *} (g_{j}^{r e s})\} \\ = E_{H_{j}^{r e s}} [\sum_{a_{j} \in A_{j}^{r e s}} E \{{PO}_{j}^{r e s, *} (a_{j}) ∣ A_{j} = a_{j}, H_{j}^{r e s}\} I \{g_{j}^{r e s} (H_{j}^{r e s}) = a_{j}\}] . \end{array}

Then, using consistency assumption and positivity assumption, we can link

\begin{array}{l} E \{{PO}_{j}^{r e s, *} (g_{j}^{r e s})\} \\ = E_{H_{j}^{r e s}} [\sum_{a_{j} \in A_{j}^{r e s}} E \{{PO}_{j}^{r e s} ∣ A_{j} = a_{j}, H_{j}^{r e s}\} I \{g_{j}^{r e s} (H_{j}^{r e s}) = a_{j}\}] . \end{array}

Let $μ_{j, a_{j}}^{r e s} (H_{j}^{r e s}) = E (P O_{j}^{r e s} ∣ A_{j} = a_{j}, H_{j}^{r e s})$ , then our goal is to find

g_{j}^{res,opt} = \underset{g_{j}^{r e s} \in G_{j}^{r e s}}{a r g m a x} E_{H_{j}^{r e s}} [\sum_{a_{j} \in A_{j}^{r e s}} μ_{j, a_{j}}^{r e s} (H_{j}^{r e s}) I \{g_{j}^{r e s} (H_{j}^{r e s}) = a_{j}\}],

at stage $j$ in the space of applicable treatment options. Likewise, we have

g_{T}^{res,opt} = \underset{g_{T}^{r e s} \in G_{T}^{res}}{a r g m a x} E_{H_{T}^{r e s}} [\sum_{a_{T} \in A_{T}^{r e s}} μ_{T, a_{T}}^{r e s} (H_{T}^{r e s}) I \{g_{T}^{r e s} (H_{T}^{r e s}) = a_{T}\}]

at stage $T$ , with $μ_{T, a_{T}}^{r e s} (H_{T}^{r e s}) = E (Y ∣ A_{T} = a_{T}, H_{T}^{r e s})$ .

The proposed method RT-RL utilizes the backward induction technique (Bather, 2000) to estimate the restricted DTR. We estimate the counterfactual mean outcome using an AIPW estimator similar to Tao and Wang (2017). Specifically, we modify the estimation of $E \{Y^{*} (a_{T})\}$ for $a_{T} \in A_{T}^{r e s}$ under the viable patient history space $H_{T}^{r e s}$ at final stage $T$ . Suppose that we have $n$ patients with $n^{r e s, T}$ patients received applicable treatment combinations until stage $T$ . We propose to estimate $E \{Y^{*} (a_{T})\}$ with $ℙ_{n^{r e s, T}} \{{\hat{μ}}_{T, a_{T}}^{res,AIPW} (H_{T}^{r e s})\}$ , where ${\hat{μ}}_{T, a_{T}}^{res,AIPW} (H_{T}^{r e s}) = \frac{I (A_{T} = a_{T})}{{\hat{π}}_{T, a_{T}} (H_{T}^{r e s})} Y + \{1 - \frac{I (A_{T} = a_{T})}{{\hat{π}}_{T, a_{T}} (H_{T}^{r e s})}\} {\hat{μ}}_{T, a_{T}}^{r e s} (H_{T}^{r e s})$ , ${\hat{π}}_{T, a_{T}} (H_{T}^{r e s})$ is the estimated propensity score, and ${\hat{μ}}_{T, a_{T}}^{r e s} (H_{T}^{r e s})$ denotes the estimated conditional mean.

Proposition 1 (Double Robustness).

Suppose patient observations ${\{{\bar{X}}_{i, T}, {\bar{A}}_{i, (T - 1)}, A_{i, T}, Y_{i}\}}_{i = 1}^{n}$ are independent and identically distributed. $A$ subset of $n^{r e s, T}$ patients have viable past treatment routes until stage $T$ . Without loss of generality, we define viable patient observations as ${\{H_{i, T}^{r e s}, A_{i, T}, Y_{i}\}}_{i = 1}^{n^{r e s, T}} \equiv {\{{\bar{X}}_{i, T}, {\bar{A}}_{i, (T - 1)}, A_{i, T}, Y_{i}\}}_{i = 1}^{n^{r e s, T}}$ such that ${\bar{A}}_{i, (T - 1)} \in {\bar{A}}_{T - 1}^{r e s}$ . $ℙ_{n^{res, T}} \{{\hat{μ}}_{T, a_{T}}^{r e s, AIPW} (H_{T}^{r e s})\}$ is a consistent estimator of $E \{Y^{*} (a_{T})\}$ if either the propensity score model of $π_{T, a_{T}} (H_{T}^{r e s})$ or the conditional mean model of $μ_{T, a_{T}}^{r e s} (H_{T}^{r e s})$ is correctly specified (the proof for Proposition 1 is provided in Supporting Information A).

For the final stage $T$ , we propose to estimate the optimal regime with

\begin{array}{l} {\hat{g}}_{T}^{res,opt} = \underset{g_{T}^{r e s} \in G_{T}^{res}}{a r g m a x} ℙ_{n^{r e s, T}} [\sum_{a_{T} \in A_{T}^{r e s}} {\hat{μ}}_{T, a_{T}}^{r e s, A I P W} (H_{T}^{r e s}) I \{g_{T}^{r e s} (H_{T}^{r e s}) = a_{T}\}] \\ = \underset{g_{T}^{r e s} \in G_{T}^{res}}{a r g m a x} \frac{1}{n^{r e s, T}} \sum_{i = 1}^{n^{r e s, T}} [\frac{I (A_{i, T} = g_{T}^{r e s} (H_{i, T}^{r e s}))}{{\hat{π}}_{T, g_{T}^{r e s}} (H_{i, T}^{r e s})} Y_{i} \\ + \{1 - \frac{I (A_{i, T} = g_{T}^{r e s} (H_{i, T}^{r e s}))}{{\hat{π}}_{T, g_{T}^{r e s}} (H_{i, T}^{r e s})}\} {\hat{μ}}_{T, g_{T}^{r e s}}^{r e s} (H_{i, T}^{r e s})] . \end{array}

Similarly, for stage $j (1 \leq j < T)$ , our proposed estimator ${\hat{g}}_{j}^{res,opt}$ is

\begin{array}{l} \underset{g_{j}^{r e s} \in G_{j}^{res}}{a r g m a x} \frac{1}{n^{r e s, j}} \sum_{i = 1}^{n^{r e s, j}} [\frac{I (A_{i, j} = g_{j}^{r e s} (H_{i, j}^{r e s}))}{{\hat{π}}_{j, g_{j}^{r e s}} (H_{i, j}^{r e s})} P O_{i, j}^{r e s} \\ + \{1 - \frac{I (A_{i, j} = g_{j}^{res} (H_{i, j}^{r e s}))}{{\hat{π}}_{j, g_{j}^{res}} (H_{i, j}^{r e s})}\} {\hat{μ}}_{j, g_{j}^{res}}^{res} (H_{i, j}^{r e s})], \end{array}

given a propensity score model and a conditional mean model.

We utilize tree-based reinforcement learning (T-RL) to search for the optimal treatment regime that closely follows the procedure proposed by Tao et al. (2018), but only considering the viable restricted space. We propose to use the AIPW estimator $ℙ_{n^{r e s, j}} \{{\hat{μ}}_{j, a_{j}}^{res, AIPW} (H_{j}^{r e s})\}$ in the purity measure for tree model construction. The tree model divides patients into subgroups with similar histories and recommends the corresponding optimal treatment to each subgroup, maximizing the estimated group average stage-wise pseudo-outcome. The proposed procedure provides decision rules at each stage to decide on viable individualized treatment tailored by patient characteristics recorded in $H_{j}^{r e s}$ . While considering a partition over node $Ω$ that split patients into two groups $ω$ and $ω^{c}$ , we define $P_{j} (Ω, ϕ)$ , the purity measure before the partition, as:

\begin{array}{l} \max_{a_{j} \in A_{j}^{r e s}} ℙ_{n^{r e s, j}} [\{\frac{I (A_{j} = a_{j})}{{\hat{π}}_{j, a_{j}} (H_{j}^{r e s})} {PO}_{j}^{r e s} + \{1 - \frac{I (A_{j} = a_{j})}{{\hat{π}}_{j, a_{j}} (H_{j}^{r e s})}\} {\hat{μ}}_{j, a_{j}}^{r e s} (H_{j}^{r e s})\}\} \\ \times I (H_{j}^{r e s} \in Ω)], \end{array}

for $j = 1, \dots, T$ . We compare $P_{j} (Ω, ϕ)$ with $P_{j} (Ω, ω)$ , the purity measure after the partition: $\max_{a_{1}, a_{2} \in A_{j}^{r e s}} ℙ_{n^{r e s, j}} \{\frac{I (A_{j} = a_{j})}{{\hat{π}}_{j, a_{j}} (H_{j}^{r e s})} {PO}_{j}^{r e s} + \{1 - \frac{I (A_{j} = a_{j})}{{\hat{π}}_{j, a_{j}} (H_{j}^{r e s})}\} {\hat{μ}}_{j, a_{j}}^{r e s} (H_{j}^{r e s})\}\} \times I \{g_{j, ω, a_{1}, a_{2}}^{r e s} (H_{j}^{r e s}) = a_{j}\} I (H_{j}^{r e s} \in Ω)]$ , where $g_{j, ω, a_{1}, a_{2}}^{r e s}$ is the treatment rule that assigns $a_{1}$ to patient subgroup $ω$ and assigns $a_{2}$ to patient subgroup $ω^{c}$ . We chose the best split $ω$ based on the largest improvement in purity measure $P_{j} (Ω, ω) - P_{j} (Ω, ϕ)$ .

2.3 |. Implementing RT-RL

We implement RT-RL by restricting and allocating patients based on past treatment sequences during the optimization procedure. This implies that only patientswith viable treatment combinations up to stage $j$ , that is, ${\bar{A}}_{j} \in {\bar{A}}_{j}^{r e s}$ , can contribute to the treatment regime estimation at current stage $j$ for $j = 1, \dots, T$ . Among patients with viable treatment sequences up to stage $j - 1$ but at risk of receiving nonviable sequences at stage $j$ , we fit separate T-RL models based on past treatment sequences. In each model, we search the best treatment among $\{a_{j} \in A_{j}^{r e s} : ({\bar{A}}_{j - 1}, a_{j}) \in {\bar{A}}_{j}^{r e s}\}$ . An additional T-RL is fitted using the remaining patient records without restrictions on treatment options.

Moreover, when the observed treatments of a patient satisfy ${\bar{A}}_{j} \in {\bar{A}}_{j}^{r e s}$ and ${\bar{A}}_{j^{*}} \notin {\bar{A}}_{j^{*}}^{r e s}$ , $\forall j^{*} > j$ with $j < T$ , this patient can only contribute to the optimal regime estimation for the first $j$ stages. To use the information in this patient record, we need to estimate what is the best viable treatment within the restricted set and adjust for any suboptimal loss by calculating $P O_{j}^{res}$ using data from similar patients with applicable treatments until stage $j + 1$ .

When fitting tree models, we adapt the stopping criteria proposed by Tao et al. (2018). Node size $(n_{0})$ controls the minimal number of subjects in each final node, and the depth of the tree $(d)$ avoids complicated regimes. Moreover, a positive constant $λ$ for minimal purity improvement is used for examining if the current best split ${\hat{ω}}^{opt} = a r g m a x_{ω} [P_{j} (Ω, ω)]$ satisfies $P_{j} (Ω, ω) - P_{j} (Ω, ϕ) > λ$ , which guarantees user-specified meaningful splits. We set values for $n_{0}$ , $d$ , and $λ$ for pruning based on specific applications. A detailed RT-RL algorithm for $T \geq 2$ can be found in Supporting Information B.

For demonstration purposes, from now on, we consider a two-stage problem $(T = 2)$ with $M$ nonviable treatment routes at stage 2 denoted as ${\bar{b}}_{m, 2} = (b_{1, m}, b_{2, m})$ , $m = 1, \dots, M$ , where $b_{1, m}$ and $b_{2, m}$ denote treatments for stage 1 and stage 2, respectively. The optimal DTR should consider only viable treatment routes that have $A_{1} \notin \{b_{1, 1}, \dots, b_{1, M}\}$ or $A_{1} = b_{1, m}$ and $A_{2} \neq b_{2, m}$ , $\forall m$ . In stage 2, we estimate the restricted treatment decision rule by allocating patients based on if $A_{1}$ agrees with any nonviable sequences. First, we construct an overall conditional mean model ${\hat{μ}}_{2, a_{2}}^{r e s}$ with patients who have received viable treatment routes. Then, we create $M + 1$ groups so that patients with $A_{1} = b_{1, m}$ are allocated to the $m th$ group for $m = 1, 2, \dots, M$ , and the rest are in the 0th group. We define the restricted patient history, treatment set, and treatment rules as follows. For the 0th group: $H_{0, 2}^{res} = H_{2} ∖ \{(b_{1, m}, {\bar{X}}_{2}^{T}) : m = 1, \dots, M\}$ , $A_{0, 2}^{res} = A_{2}$ , and $G_{0, 2}^{res} = \{g_{2}^{r e s} : H_{0, 2}^{res} \mapsto A_{0, 2}^{res}\}$ . For the $m th$ group: $H_{m, 2}^{res} = \{(b_{1, m}, {\bar{X}}_{2}^{T})$ , $A_{m, 2}^{res} = A_{2} ∖ \{b_{2, m}\}$ , and $G_{m, 2}^{res} = \{g_{2}^{r e s} : H_{m, 2}^{res} \mapsto A_{m, 2}^{res}\}$ , where $m = 1, \dots, M$ .

Group-specific propensity scores are estimated using data from patients who received applicable treatment routes

{\hat{π}}_{2, A_{2}} (H_{2}^{r e s}) = \{\begin{matrix} {\hat{π}}_{2, A_{2}}^{0} (H_{2}^{r e s}), H_{2}^{r e s} \in H_{0, 2}^{r e s} \\ {\hat{π}}_{2, A_{2}}^{m} (H_{2}^{r e s}), A_{2} \in A_{m, 2}^{r e s} and H_{2}^{r e s} \in H_{m, 2}^{r e s}, m = 1, \dots, M . \end{matrix}

Then, separate T-RL models are fitted using patient observations in each restricted group that have ${\bar{a}}_{2} = (b_{1, m}, c_{2, m})$ , where $c_{2, m} \neq b_{2, m}$ . We use all patients in the 0th group to fit the remaining T-RL model since all observed treatment sequences in this group are compatible with those of our interest.

Then the restricted optimal decision rule ${\hat{g}}_{2}^{r e s, o p t}$ at stage 2 is estimated by:

\{\begin{array}{l} \underset{g_{2}^{res} \in G_{m, 2}^{res}}{a r g m a x} ℙ_{n^{res, m, 2}} [\sum_{a_{2} \in A_{m, 2}^{res}} μ_{2, a_{2}}^{r e s, A I P W} (H_{2}^{r e s}) I \{g_{2}^{r e s} (H_{2}^{r e s}) = a_{2}\}] & H_{2}^{r e s} \in H_{m, 2}^{r e s} \\ \underset{g_{2}^{res} \in G_{0, 2}^{res}}{a r g m a x} ℙ_{n^{res, 0, 2}} [\sum_{a_{2} \in A_{2}} μ_{2, a_{2}}^{r e s, A I P W} (H_{2}^{r e s}) I \{g_{2}^{r e s} (H_{2}^{r e s}) = a_{2}\}] & H_{2}^{r e s} \in H_{0, 2}^{r e s} \end{array},

where $m = 1, \dots, M$ , $n^{res, m, 2}$ is the number of patients in the $m th$ group with $A_{2} \neq b_{2, m}$ while $n^{r e s, 0, 2}$ is the number of patients in the 0th group. Notice that $H_{2}^{res} = H_{0, 2}^{res} \cup H_{1, 2}^{res} \cup \dots \cup H_{M, 2}^{res}$ .

In stage 1, we perform backward induction to estimate the optimal decision rule. The pseudo-outcome $P O_{1}^{res}$ is derived as $\hat{E} \{Y ∣ A_{2} = {\hat{g}}_{2}^{r e s, o p t} (H_{2}^{r e s}), H_{2}^{r e s}\} \overset{Δ}{=} {\hat{μ}}_{2, {\hat{g}}_{2}^{r e s, o p t}} (H_{2}^{r e s})$ , which is the estimated expectation of the potential outcome one would have observed if the future treatment at stage 2 is already optimized by ${\hat{g}}_{2}^{r e s, o p t}$ . At stage 1, we do not have any constrains on $A_{1}$ , $H_{1}$ , and $G_{1}$ . Hence, we have $H_{1}^{r e s} = H_{1}$ . Let ${\hat{μ}}_{1, a_{1}}^{res,AIPW} (H_{1})$ denote $\hat{E} (P O_{1}^{r e s} ∣ A_{1} = a_{1}, H_{1})$ . The optimal treatment decision rule at stage 1 is estimated as:

{\hat{g}}_{1}^{r e s, o p t} = \underset{g_{1}^{r e s} \in G_{1}}{a r g m a x} ℙ_{n} [\sum_{a_{1} \in A_{1}} {\hat{μ}}_{1, a_{1}}^{r e s, A I P W} (H_{1}) I \{g_{1}^{r e s} (H_{1}) = a_{1}\}] .

In practice, we use a modified pseudo-outcome ${\tilde{PO}}_{j}^{res}$ (Huangetal., 2015) to reduce the bias due to possiblemodel misspecification from the conditional model:

{\tilde{PO}}_{j}^{res} = {\tilde{PO}}_{j + 1}^{res} + {\hat{μ}}_{j, g_{j}^{r e s, o p t}}^{res} (H_{j}^{r e s}) - {\hat{μ}}_{j, A_{j}}^{r e s} (H_{j})

for $j < T$ . With backward induction and defining ${\tilde{PO}}_{T}^{res} = Y$ , it is easy to show that:

{\tilde{PO}}_{j}^{res} = Y + \sum_{t = j + 1}^{T} \{{\hat{μ}}_{t, g_{t}^{r e s, o p t}}^{res} (H_{t}^{r e s}) - {\hat{μ}}_{t, A_{t}}^{res} (H_{t})\}

for $j < T$ . Thus, for stage 1, since $H_{1}^{r e s} = H_{1}$ we have:

{\hat{μ}}_{1, a_{1}}^{res, AIPW} (H_{1}) = \frac{I (A_{1} = a_{1})}{{\hat{π}}_{1, a_{1}} (H_{1})} {\tilde{PO}}_{1}^{res} + \{1 - \frac{I (A_{1} = a_{1})}{{\hat{π}}_{1, a_{1}} (H_{1})}\} {\hat{μ}}_{1, a_{1}} (H_{1}),

where ${\hat{π}}_{1, a_{1}} (H_{1})$ and ${\hat{μ}}_{1, a_{1}} (H_{1})$ are estimated using all records.

To give an illustrative example, Figure 2 shows a two-stage three-treatment RT-RL patient allocation procedure with a nonviable arm $A_{1} = 1$ , $A_{2} = 2$ . In the second stage, patients with $A_{1} = 1$ are allocated to the first group for being at risk of receiving the nonviable arm. We use patient records with $A_{1} = 1$ , $A_{2} \neq 2$ to fit the T-RL that seeks personalized optimal treatment among $A_{2} = 0$ or $A_{2} = 1$ for patients in group 1. Then, we “correct” the outcomes for patients in the nonviable arm with the pseudo-outcomes ${\tilde{PO}}_{1}^{res}$ so that they can contribute to the first-stage optimal treatment decision rule estimation. This procedure optimizes the restricted patients’ second-stage treatment by only considering viable treatment combinations. Note that their best viable treatment is estimated by patients with similar characteristics who received viable treatments. The rest of the patients are allocated to the 0th group, where a normal T-RL is fitted with no restrictions. The treatment regimes estimated by RT-RL provide viable treatment recommendations to future patients based on past treatment sequences.

Patient allocation for a two-stage three-treatment RT-RL estimation with nonviable route $A_{1} = 1$ , $A_{2} = 2$ . Group 0 contains patients with no risk of receiving the nonviable route. Group 1 contains patients with $A_{1} = 1$ and hence are at risk of receiving the nonviable route. Patients with the nonviable treatment route only contribute to stage 1 decision rule estimation. Their stage 2 rewards will be “corrected” according to similar patients with $A_{1} = 1$ , $A_{2} \neq 2$ to create pseudo-outcomes in stage 1

Through this process, we use the observed data of the patients in the nonviable route before their last stage to provide better estimates of $g_{1}^{r e s, o p t}$ and avoid selection bias in the study population. In the meanwhile, the above process accommodates the restrictions during the optimization procedure and guarantees the estimated optimal regime is interpretable and viable in practice.

3 |. SIMULATION STUDIES

We conduct simulation studies to evaluate the performances of the proposed RT-RL compared to a naive method, T-RL with data censoring, that is, using the regular T-RL and ignoring partial patient records in the nonviable arm. The simulation studies have two treatment stages $(T = 2)$ , and three treatments $(K = 3)$ per stage ( $A_{j} \in {0, 1, 2}$ for $j = 1, 2$ ) with one nonviable treatment route ( $A_{1} = 1$ and $A_{2} = 2$ ). Specifically, in our setup, patients in the nonviable treatment route have healthier baseline conditions and better outcomes. Thus, the traditional methods using the observed compatible data would still try to recommend the nonviable route to future patients in order to maximize the health outcomes.

We generate three continuous covariates $(X_{1}, X_{2}, X_{3})$ independently from $N (0, 1)$ . The clinical outcomes are generated such that a higher value of the sum of rewards at each stage $(Y = Y_{1} + Y_{2})$ is desired. Treatment $A_{1}$ is generated from a multinomial distribution $P (A_{1} = k) = π_{k 1}$ , $k = 0, 1, 2$ with $π_{0, 1} = \frac{p_{1, 1}}{p_{1, 1} + p_{2, 1} + 1}$ , $π_{1, 1} = \frac{p_{2, 1}}{p_{1, 1} + p_{2, 1} + 1}$ , and $π_{2, 1} = 1 - π_{0, 1} - π_{1, 1}$ , where $p_{1, 1} = \exp \{- 0.2 X_{1} + 0.3 X_{2} - 0.2 I (X_{3} > - 0.5) + 0.5\}$ , $p_{2, 1} = \exp \{0.3 X_{2} + 1.5 I (X_{3} > - 0.5) + 0.5\}$ . The underlying optimal treatments for stage 1 follow:

g_{1}^{opt} (H) = \{\begin{matrix} 0 & X_{1} < 0, X_{2} < 0 \\ 1 & X_{2} \geq 0 \\ 2 & X_{1} \geq 0, X_{2} < 0. \end{matrix}

The reward at stage 1 follows: $Y_{1} = \exp (1 + 0.05 * X_{2} - 2 * |A_{1} - A_{opt}|) + ϵ$ , where $ϵ ~ N (0, 0.5)$ . To mimic a typical setting, where some observed treatments with good targeted clinical outcomes are not viable for future patients under certain “restrictions,” we allow a large portion of patients who received the nonviable treatment sequence to have better clinical outcomes than viable sequences. We assign $A_{2}$ conditional on $A_{1}$ such that patients have a greater possibility of receiving the nonviable arm when $A_{1} = 1$ . $A_{2}$ follows a three-level multinomial distribution with corresponding probabilities $(π_{0, 2}, π_{1, 2}, π_{2, 2}) = (\frac{p_{1, 2}}{p_{1, 2} + p_{2, 2} + 1}, \frac{p_{2, 2}}{p_{1, 2} + p_{2, 2} + 1}, \frac{1}{p_{1, 2} + p_{2, 2} + 1})$ , where $p_{1, 2} = \exp \{0.05 Y_{1} - 0.05 X_{1} - 0.2 I (X_{3} > - 0.5) - 1\}$ and $p_{2, 2} = \exp \{0.08 Y_{1} - 0.2 X_{1} - 0.1 I (X_{3} > - 0.5) - 1.6\}$ . On the other hand, when $A_{1} \neq 1$ , $(π_{0, 2}^{'}, π_{1, 2}^{'}, π_{2, 2}^{'}) = (\frac{p_{1, 2}^{'}}{p_{1, 2}^{'} + p_{2, 2}^{'} + 10}, \frac{p_{2, 2}^{'}}{p_{1, 2}^{'} + p_{2, 2}^{'} + 10}, \frac{10}{p_{1, 2}^{'} + p_{2, 2}^{'} + 10})$ , where $p_{1, 2}^{'} = \exp (- 0.2 Y_{1} - 0.05 X_{1} + 4)$ , $p_{2, 2}^{'} = \exp (- 0.08 Y_{1} + 0.6 X_{1} + 3)$ . We define the stage 2 optimal decision rule as:

g_{2}^{opt} (H ∣ A_{1} = 1) = \{\begin{matrix} 0 & Y_{1} \leq 0.5 \\ 1 & 0.5 < Y_{1} \leq 1.5 \\ 2 & Y_{1} > 1.5 \end{matrix}, g_{2}^{opt} (H ∣ A_{1} \neq 1) = \{\begin{matrix} 0 & X_{2} > - 0.5 \\ 1 & X_{2} \leq - 0.5 \end{matrix},

and the reward as:

Y_{2} = \exp \{1 + I (X_{3} > - 0.5) * I (A_{1} = 1) - 6 * \exp (|A_{2} - g_{2}^{opt} (H)|)\} + ϵ,

where $ϵ ~ N (0, 0.5)$ . At stage 2, the optimal rewards for patients in the nonviable route are $E [Y_{2} ∣ A_{1} = 1, A_{2} = g_{2}^{opt} (H) = 2] = e^{2}$ , which is higher than that of the patients in the viable routes $E [Y_{2} ∣ A_{1} \neq 1, A_{2} = g_{2}^{opt} (H)] = e^{1}$ . However, we would like to restrict $(A_{1} = 1, A_{2} = 2)$ out of our estimated treatment decision domain.

We compare our approach with a naive method that rules out the partial data of patients in the nonviable arm for stage 2 estimations but only includes them in stage 1 estimation when their treatments are still viable. For both methods, we utilize the AIPW version of the purity measure to allow robust estimation. We set the minimum size of the terminal nodes as 20, specify $λ$ as 5% in the splitting criteria, and the maximal depth as 3. One hundred replications are generated under each scenario considering different training sample sizes ( $n = 3000, 5000$ ) and different propensity score model specifications (correct or incorrect). The two methods are evaluated on the agreement between the optimal treatments recommended by the estimated DTR and the true constrained optimal treatments of a randomly generated external test data set with $n = 1000$ .

Table 1 summarizes the simulation results. Opt % shows the empirical mean of the percentage of subjects correctly classified to their constrained optimal treatments. Both methods perform well in the first stage, correctly specifying the true optimal treatment for over 99% of patients. The naive approach using T-RL with data censoring is slightly more stable than the proposed method, having a smaller standard deviation for opt %. However, for the second stage and overall, RT-RL has a much higher opt % than the naive method. Although censoring patients’ data with nonviable treatment sequences, the naive method still recommends the “problematic” treatment sequence to 36%–42% of patients in the evaluation set. Both estimated regimes improve patient outcomes, while the proposed algorithm improves less, which is not surprising as it rules out the nonviable arm for future patients. As demonstrated in the simulation results, the proposed and naive methods are doubly robust as similar results are observed under the correctly or incorrectly specified $π$ model.

TABLE 1.

Comparing simulation results between RT-RL and T-RL with data censoring. Standard errors are in parenthesis. Opt % records the average percent of subjects in the evaluation set being recommended the true optimal treatment route. $\hat{E} (Y ∣ {\hat{g}}^{opt}) - \hat{E} (Y ∣ g^{obs})$ is the improvement of the estimated outcomes when following the estimated DTR versus the observed DTR. “% of Recommendation with Nonviable Arm” shows the percent of subjects in the evaluation set being recommended the nonviable treatment sequence

	Correct $π$		Incorrect $π$
Method	RT-RL	Naive T-RL with Data Censoring	RT-RL	Naive T-RL with data censoring
Training sample	n = 3000
Opt % stage 1	99.37 (0.44)	99.43 (0.34)	98.69 (4.91)	99.43 (0.34)
Opt % stage 2	83.73 (12.25)	31.58 (17.58)	80.89 (16.11)	31.56 (17.53)
Opt % overall	83.42 (12.24)	31.31 (17.57)	80.28 (16.40)	31.29 (17.53)
$\hat{E} (Y {\hat{g}}^{o p t}) - \hat{E} (Y g^{o b s})$	1.28 (1.54)	1.38 (1.50)	1.27 (1.55)	1.38 (1.50)
% of recommendation with nonviable arm	0	36.35	0	36.98
Training sample	n = 5000
Opt % stage 1	99.56 (0.33)	99.58 (0.30)	99.39 (0.44)	99.59 (0.29)
Opt % Sstage 2	90.84 (9.17)	33.28 (12.33)	85.16 (17.55)	33.21 (11.34)
Opt % overall	90.52 (9.12)	33.06 (12.36)	84.85 (17.54)	32.99 (11.37)
$\hat{E} (Y ∣ {\hat{g}}^{o p t}) - \hat{E} (Y ∣ g^{o b s})$	1.30 (1.54)	1.40 (1.49)	1.30 (1.55)	1.40 (1.49)
% of recommendation with nonviable arm	0	41.71	0	42.39

Open in a new tab

4 |. APPLICATION TO LOC PLACEMENT FOR ADOLESCENT SUBSTANCE USERS

We use the proposed RT-RL method to estimate a viable optimal two-stage DTR to guide the LOC placement for adolescent substance users over 0–3 months (stage 1) and 3–6 months (stage 2) to minimize substance use at month 12 after intake, based on a longitudinal observational data set ( $n = 10,131$ ), known as the GAIN (Dennis et al., 2003). The observed possible LOC treatment options at both stages include inpatient $(A = 1)$ and outpatient $(A = 2)$ , while stage 2 also allows no treatment $(A = 3)$ . These treatment options represent different levels of patient care during each 3-month period. Inpatient or residential is the most intensive treatment, where youth are admitted for at least one night to a residential, inpatient, or hospital program. Outpatient refers to a regular (1–8 h per week) or intensive outpatient (more than 8 h per week) program. The outcome of interest for this study is the Substance Frequency Scale (SFS) at 12 months after intake, which is coded as $Y = - 1 \times S F S$ to ensure that a higher value is preferable.

According to National Institutes of Health (NIH), since relapses often occur for adolescents with substance abuse, more than one episode of treatment may be necessary (Godley et al., 2007). Also, inpatient treatment is recommendedforaseverelevelofaddiction.Stayingintreatment for an adequate period of time is especially important for patients with $A_{1} = 1$ . However, in the collected data, we observed a group of patients who started with inpatient treatment $(A_{1} = 1)$ and followed with no treatment $(A_{2} = 3)$ . Thus, given the collected data and NIH recommendations, we consider constraining the possible treatment regimes over $A_{2} \leq 2$ given $A_{1} = 1$ by restricting to “stepped up” DTRs that do not allow treatment discontinuation for adolescents starting with inpatient treatment. Note that an alternative way to address the relapse issue is recording the frequency of relapse during the data collection phase and optimizing a combined outcome of SFS and the number of relapses. However, it is challenging to track patients after they leave the healthcare facility, and extra efforts are required to determine the weights or offsets in the combined outcome. The advantage of the proposed methodology is using the collected data and prespecified value function related to SFS to search for an optimal treatment strategy, which is demonstrated below.

Specifically, we randomly split the data 3:1 for training and evaluation. We split the data multiple times using different random seeds and the results were similar, hence only one example is shown here. The training data contain 7599 randomly selected records with the remaining 2532 records in the evaluation data. This split facilitates the comparison of the RT-RL with a naive T-RL method (ignoring partial records for patients in the nonviable arm) and nondynamic regimes such as $A_{1} = A_{2} = 1$ and $A_{1} = A_{2} = 2$ . The naive T-RL method censors 404 patients’ data who received the nonviable treatment route $A_{1} = 1$ , $A_{2} = 3$ . The viability and interpretability of ${\hat{g}}^{opt}$ provided by the restricted and naive T-RL methods can be assessed with the probability of recommending the nonviable route to patients in the evaluation set. Moreover, we compare $\hat{E} (Y ∣ {\hat{g}}^{opt}) - \hat{E} (Y ∣ g^{obs})$ and the percent of patients who have $\hat{E} (Y ∣ {\hat{g}}^{opt}) > \hat{E} (Y ∣ g^{obs})$ to examine the methods’ ability to suggest optimal treatments.

According to Figure 3, RT-RL provides a viable DTR based on SFS at intake (sfs8p_0), living risk index at (lri7_0) intake, and first-stage treatment $(A_{1})$ . It recommends $A_{1} = A_{2} = 1$ to 947 (37.4%) patients and $A_{1} = A_{2} = 2$ to 1585 (62.6%) patients in the evaluation data. Although training set patients’ data were censored in the nonviable arm when using the naive T-RL, it still allows “stepped down” treatments in the evaluation set for patients who greatly improved their SFS at month 6, which leads to 674 (26.6%) patients being recommended for the nonviable treatment sequence.

Estimated optimal DTR from RT-RL and Naive T-RL. *sfs8p_t*: −1* SFS at t months post intake, higher value preferred; *lri*7_0: living risk index at intake, lower value preferred

Table 2 shows that RT-RL provides an optimal DTR in the desired class with the greatest improvement (0.7 points) in the expected outcome of interest (negative SFS) at 12 months post intake. The scale of the improvement can be standardized using pooled standard deviation. Given that the pooled standard deviation of the expected counterfactual outcomes under the observed DTR and the RT-RL estimated optimal DTR is $\sqrt{\frac{v a r \{\hat{E} (Y ∣ {\hat{g}}^{o p t})\} + v a r \{\hat{E} (Y ∣ g^{o b s})\}}{2}} = 4.11$ in the evaluation set, the improvement of 0.7 using RT-RL is equivalent to 0.17 SD. When comparing our method with the naive T-RL with data censoring, the improvement in the expected outcome of interest is 0.22 SD. Similarly, compared with the non-DTRs $A_{1} = A_{2} = 1$ and $A_{1} = A_{2} = 2$ , the improvements are 0.21 SD and 0.13 SD, respectively. Moreover, RT-RL is expected to help 56% of patients improve their potential mean substance use outcomes compared to the observed treatment in the evaluation set. In contrast, the naive T-RL cannot provide a viable and well-performing regime due to the lack of cross-stagerestrictionsand thebias introducedbyignoring partial records for patients in the nonviable arm.

TABLE 2.

Treatment regime performance on evaluation data. % improved refers to the percent of patients who are expected to benefit from the estimated regime compared to the observed treatment

	RT-RL	Naive T-RL	$A_{1} = 1$ , $A_{2} = 1$	$A_{1} = 2$ , $A_{2} = 2$
$\hat{E} (Y ∣ {\hat{g}}^{opt}) - \hat{E} (Y ∣ g^{obs})$ (SD)	0.70 (1.4)	−0.08 (1.9)	−0.03 (1.9)	0.14 (1.1)
% improved	56.0	38.6	39.8	42.3
SD of $\hat{E} (Y ∣ {\hat{g}}^{opt})$	3.78	3.34	3.11	4.60

Open in a new tab

5 |. DISCUSSION

The proposed method RT-RL searches for viable and interpretable DTR within clinically meaningful treatment trajectories. It allows user-specified restrictions on the observed treatment routes. According to simulation and GAIN data application results, RT-RL avoids selection bias when using partial information from patients in the nonviable arm to estimate optimal treatment decision rules for stages before their first “problematic” treatment. We achieve this by “correcting” the counterfactual rewards of the patients in the nonviable arms using information from similar patients who received viable treatments. The idea of stratifying based on patient history and truncating treatment options is not limited to tree-based methods. Such a framework for DTR estimation considering suboptimal correction can be applied to other methods when appropriate.

The naive T-RL method suffers from information loss and possible selection bias yet still recommends nonviable treatment routes to new patients. This happens because the rules are estimated stage wise, and the naive method considers all observed treatments in the current stage as possible treatments. Without restrictions based on previous treatment sequences and constraints on the current stage treatment space, it is not guaranteed that the recommended treatment is different from the nonviable treatment $b_{j, m, j}$ when the previous recommendations are ${\bar{a}}_{j - 1} = (b_{1, m, j}, .., b_{j - 1, m, j})$ .

In this study, we focused on constraints with nonviable treatment sequences. This approach provides a solution for many observational studies with relatively large sample sizes. However, when we have little observational data in any of the 0th to $M_{j} th$ subgroups, the restricted DTR is difficult to estimate. Similarly, the sample size of each subgroup will decrease with an increased number of nonviable arms. This small sample size problem may affect the reliability of the resulting DTR. Thus, we require the positivity assumption to hold for estimating a restricted optimal DTR, where we observe a group of similar patients treated with each potential treatment sequence. Even when the positivity assumption holds, the decision rules’ uncertainties may differ based on the sample size of each subgroup. Practically, to quantify the uncertainty in a subgroup, we can estimate the optimal DTRs using different bootstrapped training samples and examine the consistency of the suggested optimal treatments in one test sample. When significant uncertainty is observed, we may consider using ensemble methods, including Bayes optimal classifier, bagging, and boosting, to provide a stabler rule (Mitchell et al., 1997; Freund & Schapire, 1997), each of which has a unique subsampling method. Nevertheless, one disadvantage of the ensemble methods is that they can be challenging to interpret. The resulting set of decision rules might be too complicated for clinical practice as the interpretations for each rule, and the voting weights are unexplainable.

Moreover, the treatment sequence constraints can also come from patient characteristics or stage-wise rewards $(R_{j})$ . For example, if future treatments must consider whether a patient is a responder to current treatment at stage $j$ , all DTR ignoring the responder status at stage $j$ should not be applicable. Further studies are necessary to solve this issue.

Supplementary Material

Supplemental Material 1

NIHMS2110496-supplement-Supplemental_Material_1.pdf^{(52.4KB, pdf)}

Supplemental Material 2

NIHMS2110496-supplement-Supplemental_Material_2.zip^{(21.3KB, zip)}

Web Appendices referenced in Sections 2 and R code for implementing the proposed methods are available with this paper at the Biometrics website on Wiley Online Library. The R code is also on Github at https://github.com/Team-Wang-Lab/Restricted-TRL.

ACKNOWLEDGMENTS

The development of this paper was funded by grants from the National Institutes of Health (R01DA039901, R01MH114203, R01HD095973, R01DA047279, and P50DA054039), and Institute for Education Sciences (R324B210001) to Almirall. The data collection was funded by CSAT/SAMHSA contract #270-07-0191, using data provided by the following contracts: CSAT/SAMHSA #270-97-7011, #270-00-6500, #270-2003-00006 #270-98-7047, #277-00-6500 #270-2007-00004C, and grantees: TI-11317, TI-11321, TI-11323, TI-11324, TI-11894, TI-11892, TI-11422, TI-11423, TI-11424, TI-11432, TI-13344, TI-13354, TI-13356, TI-16400. The authors would like to thank Dr. Kirsten Herold for proofreading the paper.

Funding information

National Institutes of Health P50DA054039 R01DA015697 Center for Substance Abuse Treatment 270-07-0191

DATA AVAILABILITY STATEMENT

The application study data in this paper are not publicly available due to patient privacy and confidentiality issues.

REFERENCES

Bather J. (2000) Decision theory: An introduction to dynamic programming and sequential decisions. Hoboken, New Jersey, USA: John Wiley & Sons, Inc. [Google Scholar]
Dennis ML, Titus JC, White MK, Unsicker JI & Hodgkins D. (2003) Global appraisal of individual needs: Administration guide for the gain and related measures. Bloomington, IL: Chestnut Health Systems. [Google Scholar]
Fidler MC, Sanchez M, Raether B, Weissman NJ, Smith SR, Shanahan WR & Anderson CM (2011) A one-year randomized trial of lorcaserin for weight loss in obese and overweight adults: the blossom trial. The Journal of Clinical Endocrinology & Metabolism, 96, 3067–3077. [DOI] [PubMed] [Google Scholar]
Freund Y. & Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. [Google Scholar]
Godley MD, Godley SH, Dennis ML, Funk RR & Passetti LL (2007) The effect of assertive continuing care on continuing care linkage,adherenceandabstinencefollowingresidentialtreatment foradolescentswithsubstanceusedisorders. Addiction,102,81–93. [DOI] [PubMed] [Google Scholar]
Hall K, Stewart T, Chang J. & Freeman MK (2016) Characteristics of fda drug recalls: a 30-month analysis. American Journal of Health-System Pharmacy, 73, 235–240. [DOI] [PubMed] [Google Scholar]
Huang X, Choi S, Wang L. & Thall PF (2015) Optimization of multi-stage dynamic treatment regimes utilizing accumulated data. Statistics in Medicine, 34, 3424–3443. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber E. & Zhao Y. (2015) Tree-based methods for individualized treatment regimes. Biometrika, 102, 501–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitchell TM et al. (1997) Machine learning. New York: McGraw-Hill. [Google Scholar]
Murphy SA (2003) Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 331–355. [Google Scholar]
Murphy SA, van der Laan MJ, Robins JM & Group CPPR (2001) Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96, 1410–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA et al. (2012) Q-learning: a data analysis method for constructing adaptive interventions. Psychological Methods, 17, 478. [DOI] [PMC free article] [PubMed] [Google Scholar]
O’neil PM, Smith SR, Weissman NJ, Fidler MC, Sanchez M, Zhang J. et al. (2012) Randomized placebo-controlled clinical trial of lorcaserin for weight loss in type 2 diabetes mellitus: the BLOOM-DM study. Obesity, 20, 1426–1436. [DOI] [PubMed] [Google Scholar]
Rivera DE, Pew MD & Collins LM (2007) Using engineering control principles to inform the design of adaptive interventions: a conceptual introduction. Drug and Alcohol Dependence, 88, S31–S40. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins J. (1986) A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512. [Google Scholar]
Robins JM & Hernán MA (2009) Estimation of the causal effects of time-varying exposures. Longitudinal Data Analysis, 553, 599. [Google Scholar]
Schulte PJ, Tsiatis AA, Laber EB & Davidian M. (2014) Qanda-learningmethodsforestimatingoptimaldynamictreatment regimes. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 29, 640. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith SR, Weissman NJ, Anderson CM, Sanchez M, Chuang E, Stubbe S. et al. (2010) Multicenter, placebo-controlled trial of lorcaserin for weight management. New England Journal of Medicine, 363, 245–256. [DOI] [PubMed] [Google Scholar]
Sobell MB & Sobell LC (2000) Stepped care as a heuristic approach to the treatment of alcohol problems. Journal of Consulting and Clinical Psychology, 68, 573. [PubMed] [Google Scholar]
Sun Y. & Wang L. (2021) Stochastic tree search for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 116, 421–432. [Google Scholar]
Sutton RS & Barto AG (1998) Introduction to reinforcement learning, volume 135. Cambridge: MIT Press. [Google Scholar]
Tao Y. & Wang L. (2017) Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics, 73, 145–155. [DOI] [PubMed] [Google Scholar]
Tao Y, Wang L. & Almirall D. (2018) Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. The Annals of Applied Statistics, 12, 1914. [DOI] [PMC free article] [PubMed] [Google Scholar]
US Food and Drug Administration (2020) FDA requests the withdrawal of the weight-loss drug Belviq, Belviq XR (lorcaserin) from the market. Available at: https://www.fda.gov/drugs/drug-safetyand-availability/fda-requests-withdrawal-weight-loss-drugbelviq-belviq-xr-lorcaserin-market. Accessed 02/06/2022.
US Food and Drug Administration (2021) Teligent Pharma, Inc.’s issues worldwide voluntary recall of lidocaine HCl topical solution 4% due to super potency. Available at: https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts/teligentpharma-incs-issues-worldwide-voluntary-recall-lidocaine-hcltopical-solution-4-due-super. Accessed 02/06/2022.
Wang L, Rotnitzky A, Lin X, Millikan RE & Thall PF (2012) Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer. Journal of the American Statistical Association, 107, 493–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watkins CJ & Dayan P. (1992) Q-learning. Machine Learning, 8, 279–292. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material 1

NIHMS2110496-supplement-Supplemental_Material_1.pdf^{(52.4KB, pdf)}

Supplemental Material 2

NIHMS2110496-supplement-Supplemental_Material_2.zip^{(21.3KB, zip)}

Data Availability Statement

The application study data in this paper are not publicly available due to patient privacy and confidentiality issues.

[R1] Bather J. (2000) Decision theory: An introduction to dynamic programming and sequential decisions. Hoboken, New Jersey, USA: John Wiley & Sons, Inc. [Google Scholar]

[R2] Dennis ML, Titus JC, White MK, Unsicker JI & Hodgkins D. (2003) Global appraisal of individual needs: Administration guide for the gain and related measures. Bloomington, IL: Chestnut Health Systems. [Google Scholar]

[R3] Fidler MC, Sanchez M, Raether B, Weissman NJ, Smith SR, Shanahan WR & Anderson CM (2011) A one-year randomized trial of lorcaserin for weight loss in obese and overweight adults: the blossom trial. The Journal of Clinical Endocrinology & Metabolism, 96, 3067–3077. [DOI] [PubMed] [Google Scholar]

[R4] Freund Y. & Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. [Google Scholar]

[R5] Godley MD, Godley SH, Dennis ML, Funk RR & Passetti LL (2007) The effect of assertive continuing care on continuing care linkage,adherenceandabstinencefollowingresidentialtreatment foradolescentswithsubstanceusedisorders. Addiction,102,81–93. [DOI] [PubMed] [Google Scholar]

[R6] Hall K, Stewart T, Chang J. & Freeman MK (2016) Characteristics of fda drug recalls: a 30-month analysis. American Journal of Health-System Pharmacy, 73, 235–240. [DOI] [PubMed] [Google Scholar]

[R7] Huang X, Choi S, Wang L. & Thall PF (2015) Optimization of multi-stage dynamic treatment regimes utilizing accumulated data. Statistics in Medicine, 34, 3424–3443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Laber E. & Zhao Y. (2015) Tree-based methods for individualized treatment regimes. Biometrika, 102, 501–514. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Mitchell TM et al. (1997) Machine learning. New York: McGraw-Hill. [Google Scholar]

[R10] Murphy SA (2003) Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 331–355. [Google Scholar]

[R11] Murphy SA, van der Laan MJ, Robins JM & Group CPPR (2001) Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96, 1410–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA et al. (2012) Q-learning: a data analysis method for constructing adaptive interventions. Psychological Methods, 17, 478. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] O’neil PM, Smith SR, Weissman NJ, Fidler MC, Sanchez M, Zhang J. et al. (2012) Randomized placebo-controlled clinical trial of lorcaserin for weight loss in type 2 diabetes mellitus: the BLOOM-DM study. Obesity, 20, 1426–1436. [DOI] [PubMed] [Google Scholar]

[R14] Rivera DE, Pew MD & Collins LM (2007) Using engineering control principles to inform the design of adaptive interventions: a conceptual introduction. Drug and Alcohol Dependence, 88, S31–S40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Robins J. (1986) A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512. [Google Scholar]

[R16] Robins JM & Hernán MA (2009) Estimation of the causal effects of time-varying exposures. Longitudinal Data Analysis, 553, 599. [Google Scholar]

[R17] Schulte PJ, Tsiatis AA, Laber EB & Davidian M. (2014) Qanda-learningmethodsforestimatingoptimaldynamictreatment regimes. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 29, 640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Smith SR, Weissman NJ, Anderson CM, Sanchez M, Chuang E, Stubbe S. et al. (2010) Multicenter, placebo-controlled trial of lorcaserin for weight management. New England Journal of Medicine, 363, 245–256. [DOI] [PubMed] [Google Scholar]

[R19] Sobell MB & Sobell LC (2000) Stepped care as a heuristic approach to the treatment of alcohol problems. Journal of Consulting and Clinical Psychology, 68, 573. [PubMed] [Google Scholar]

[R20] Sun Y. & Wang L. (2021) Stochastic tree search for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 116, 421–432. [Google Scholar]

[R21] Sutton RS & Barto AG (1998) Introduction to reinforcement learning, volume 135. Cambridge: MIT Press. [Google Scholar]

[R22] Tao Y. & Wang L. (2017) Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics, 73, 145–155. [DOI] [PubMed] [Google Scholar]

[R23] Tao Y, Wang L. & Almirall D. (2018) Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. The Annals of Applied Statistics, 12, 1914. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] US Food and Drug Administration (2020) FDA requests the withdrawal of the weight-loss drug Belviq, Belviq XR (lorcaserin) from the market. Available at: https://www.fda.gov/drugs/drug-safetyand-availability/fda-requests-withdrawal-weight-loss-drugbelviq-belviq-xr-lorcaserin-market. Accessed 02/06/2022.

[R25] US Food and Drug Administration (2021) Teligent Pharma, Inc.’s issues worldwide voluntary recall of lidocaine HCl topical solution 4% due to super potency. Available at: https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts/teligentpharma-incs-issues-worldwide-voluntary-recall-lidocaine-hcltopical-solution-4-due-super. Accessed 02/06/2022.

[R26] Wang L, Rotnitzky A, Lin X, Millikan RE & Thall PF (2012) Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer. Journal of the American Statistical Association, 107, 493–508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Watkins CJ & Dayan P. (1992) Q-learning. Machine Learning, 8, 279–292. [Google Scholar]

PERMALINK

Estimating tree-based dynamic treatment regimes using observational data with restricted treatment sequences

Nina Zhou

Lu Wang

Daniel Almirall

Abstract

1 |. INTRODUCTION

FIGURE 1.

2 |. METHOD

2.1 |. Notation and setup

2.2 |. Constrained optimization procedure

Proposition 1 (Double Robustness).

2.3 |. Implementing RT-RL

FIGURE 2.

3 |. SIMULATION STUDIES

TABLE 1.

4 |. APPLICATION TO LOC PLACEMENT FOR ADOLESCENT SUBSTANCE USERS

FIGURE 3.

TABLE 2.

5 |. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Funding information

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Estimating tree-based dynamic treatment regimes using observational data with restricted treatment sequences

Nina Zhou

Lu Wang

Daniel Almirall

Abstract

1 |. INTRODUCTION

FIGURE 1.

2 |. METHOD

2.1 |. Notation and setup

2.2 |. Constrained optimization procedure

Proposition 1 (Double Robustness).

2.3 |. Implementing RT-RL

FIGURE 2.

3 |. SIMULATION STUDIES

TABLE 1.

4 |. APPLICATION TO LOC PLACEMENT FOR ADOLESCENT SUBSTANCE USERS

FIGURE 3.

TABLE 2.

5 |. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Funding information

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases