Entropy Learning for Dynamic Treatment Regimes

Binyan Jiang; Rui Song; Jialiang Li; Donglin Zeng

doi:10.5705/ss.202018.0076

. Author manuscript; available in PMC: 2020 Jan 1.

Published in final edited form as: Stat Sin. 2019;29(4):1633–1655. doi: 10.5705/ss.202018.0076

Entropy Learning for Dynamic Treatment Regimes

Binyan Jiang ¹, Rui Song ², Jialiang Li ^3,^✉, Donglin Zeng ⁴

PMCID: PMC6750237 NIHMSID: NIHMS1011474 PMID: 31534307

Abstract

Estimating optimal individualized treatment rules (ITRs) in single or multi-stage clinical trials is one key solution to personalized medicine and has received more and more attention in statistical community. Recent development suggests that using machine learning approaches can significantly improve the estimation over model-based methods. However, proper inference for the estimated ITRs has not been well established in machine learning based approaches. In this paper, we propose a entropy learning approach to estimate the optimal individualized treatment rules (ITRs). We obtain the asymptotic distributions for the estimated rules so further provide valid inference. The proposed approach is demonstrated to perform well in finite sample through extensive simulation studies. Finally, we analyze data from a multi-stage clinical trial for depression patients. Our results offer novel findings that are otherwise not revealed with existing approaches.

Keywords: Dynamic treatment regime, entropy learning, personalized medicine

1. Introduction

One important goal in personalized medicine is to develop a decision support system to provide the adequate management for individual patients with specific diseases. Estimating individualized treatment rules (ITRs) using evidence from single- or multi-stage clinical trials provides the key solution to develop such a system. Development of powerful methods for the estimation has received more and more attention in statistical community. Early methods for estimating ITRs include Q-learning (Watkins and Dayan, 1992; Murphy, 2005; Chakraborty et al., 2010; Goldberg and Kosorok, 2012; Laber et al., 2014; Song et al., 2015) and A-learning (Robins et al., 2000; Murphy, 2003), where Q-learning models the conditional mean of the outcome given historical covariates and treatment using a well-constructed statistical model and A-learning models the contrast function that is sufficient for treatment decision.

Recently, Zhao et al. (2012) discovered that it is possible to cast the estimation of the optimal regime into a weighted classification problem. Thus, Zhao et al. (2012, 2015) proposed an outcome weighted learning (OWL) to directly optimize the approximate expected clinical outcome, where the objective function is a hinge loss weighted by individual outcomes. Although the latter has been shown to outperform the model-based approaches as in Q- and A-learning in numerical studies and asymptotic behavior might be established due to its convexity Hjort and Pollard (2011), there is no valid inference procedure for the parameters in the optimal treatment rules due to non-differentiability of the hinge loss near decision boundary, and the minimization operator is more or less heuristic.

In this paper, we propose a class of smooth-loss based outcome weighted learning methods to estimate the optimal ITRs, among which one special case of the proposed losses is a weighed entropy loss (Murphy, 2012). By using continuously-differentiable loss functions, we not only maintain the Fisher consistency of the derived treatment rule, but also are able to obtain proper inference for the parameters in the derived rule. We can further quantify the uncertainty of value function under the estimated treatment rule, which is potentially useful for designing future trials and comparing with other non-optimal treatment rules. Numerically, when comparing to the existing inference for the model-based approaches, such as the bootstrap approach for Q-learning, our inference procedure does not require tuning parameters and shows more accurate inference in finite sample numerical studies.

We notice that Bartlett et al. (2006) established a profound conceptual work on classification loss for quite general setting. However, to link such work to recursive or dynamic optimization is not trivial. We work under a logistic loss to achieve this. Luedtke and van der Laan (2016) tried to unify surrogate loss function for outcome-dependent learning. Their way of showing the validity is different from our derivation. Our justification is more intuitive and our computing algorithm is also different. While super learning is quite general and powerful method, logistic regression can be implemented easily and fit our needs directly. Moreover, asymptotic properties of our estimators are established for conducting proper inference, which is not addressed in the two above-mentioned literatures.

The paper is structured as follows. In Section 2, we introduce the proposed entropy learning method for single- and multi-stage settings. In Section 3, we provide the asymptotic properties of our estimators. In Section 4, simulation studies are conducted to assess the performance of our methods. In Section 5, we apply the entropy learning to the well-known STAR*D study. We conclude the paper in Section 6. Technical proofs are provided in the supplementary materials.

2. Method

2.1. Smooth surrogate loss for outcome weighted learning

To motivate our approach of choosing a smooth surrogate loss for learning the optimal ITRs, we first consider data from one single-stage randomized trial with two treatment arms. Treatment assignment is denoted by $A \in A = {- 1, 1}$ . Patient’s prognostic variables are denoted as a p-dimensional vector X. We use R to denote the observable clinical outcome, also called the reward, and assume that R is positive and bounded from above, with larger values of R being more desirable. Data consist of {(X_i, A_i, R_i): i = 1 …, n}.

For a given treatment decision $D$ , which maps X to {−1, 1}, we denote $ℙ^{D}$ as the distribution of (X, A, R) given that $A = D (X)$ . Then an optimal treatment rule is a rule that maximizes the value function

E^{D} (R) = E {R \frac{I (A = D (X))}{A π + (1 - A) / 2}},

(2.1)

where π = P(A = 1|X). Following Qian and Murphy (2011), it can be shown that the maximization problem is equivalent to the problem of minimizing

E {R \frac{I (A \neq D (X))}{A π + (1 - A) / 2}} .

(2.2)

The latter is a weighted classification error so can be estimated by the observed sample using

n^{- 1} \sum_{i = 1}^{n} {R_{i} \frac{I (A_{i} \neq D (X_{i}))}{A_{i} π + (1 - A_{i}) / 2}} .

(2.3)

Due to the discontinuity and nonconvexity of the 0–1 loss on the right hand side of (2.2), direct minimization of (2.3) is difficult and parameter inference is also infeasible. To alleviate this challenge, the hinge loss from the support vector machine (SVM) was proposed to substitute the 0–1 loss previously (Zhao et al., 2012, 2015) to alleviate the non-convexity and computational problem. However, due to the non-differentiability of the hinge loss, the inference remains challenging. This motivates us to seek a more smooth surrogate loss function for estimation.

Consider an arbitrary surrogate loss $h (a, y) : {- 1, 1} \times R \mapsto R$ . Then by replacing the 0–1 loss by this surrogate loss, we estimate the treatment rule by minimizing

R_{h} (f) = E {R \frac{h (A, f (X))}{A π + (1 - A) / 2}} .

(2.4)

To evade the non-convexity, we require that h(a, y) be convex in y. Furthermore, simple algebra gives

E {\frac{R}{A π + (1 - A) / 2} h (A, f (X)) | X = x} = E [R | X = x, A = 1] h (1, f (x)) + E [R | X = x, A = - 1] h (- 1, f (x)) = a_{x} h (1, f (x)) + b_{x} h (- 1, f (x)),

where $a_{x} = E [R | X = x, A = 1]$ and $b_{x} = E [R | X = x, A = - 1]}$ . Hence, for any given x, the minimizer for f(x), denoted by y_x, solves equation

a_{x} h^{'} (1, y) + b_{x} h^{'} (- 1, y) = 0,

where h′(a, y) is the first derivative of h(a, y) with respect to y. To ensure that the surrogate loss still leads to the correct optimal rule which is equivalent to sgn(a_x — b_x), we require that such a solution has the same sign as (a_x — b_x). On the other hand, since a_xh′(1,y) + b_xh′(−1,y) is non-decreasing in y, we conclude that for a_x > b_x, if a_xh′(1, 0)+b_xh′(−1, 0) ≤ 0, then the solution y_x should be positive; while for a_x < b_x, if a_xh′(1, 0)+b_xh′(−1, 0) ≥ 0, then the solution y_x should be negative. In other words, one sufficient condition to ensure the Fisher consistency is

(a_{x} - b_{x}) (a_{x} h^{'} (1, 0) + b_{x} h^{'} (- 1, 0)) \leq 0.

However, since a_x and b_x can be arbitrary non-negative value, this condition holds if and only if

h^{'} (1, 0) = - h^{'} (- 1, 0) and h^{'} (1, 0) \leq 0.

In conclusion, the choice of h(a, y) should satisfy

(I) For a = −1 and 1, h(a, y) is twice differentiable and convex in y;
(II) h′(1, 0) = −h′(−1, 0) and h′(1, 0) ≤ 0.

There are many loss functions which satisfy the above two conditions. Particularly, we can consider loss functions of the form h(a, y) = −ay + g(y). Then the first condition automatically holds if g is twice differentiable and convex. The first equation in the second condition also holds. Finally, since h′(1, 0) = −1 + g′(0), the second part holds if we choose g such that g′(0) = 0. One special case is to choose

g (y) = 2 log (1 + exp (y)) - y,

and the corresponding loss function is

h (a, y) = - (a + 1) y + 2 log (1 + exp (y)),

which corresponds to the entropy loss for logistic regression (Figure 1). In our subsequent development, we will focus on using this loss function, although the results apply to any general smooth loss satisfying those two conditions. Correspondingly, (2.4) becomes:

R (f) = E {\frac{R}{A π + (1 - A) / 2} [- 0.5 (A + 1) f (X) + log (1 + exp (f (X)))]} .

(2.5)

Figure 1: — Comparison of loss functions.

2.2. Learning optimal ITRs using the entropy loss

Now suppose the randomized trial involves T stages where patients might receive different treatments across the multiple stages. With some abuse of notation, we use X_t, R_t and A_t to denote the set of covariates, clinical outcome and corresponding treatment at stage t = 1, ⋯, T, and let S_t = (X₁, A_i, ⋯, X_t−1, A_t−1, X_t) be the history by t.

A dynamic treatment regime (DTR) is a sequence of deterministic decision rules, d = (d₁,·⋯, d_T), where d_t is a map from the space of history information S_t, denoted by $S_{t}$ , to the action space of available treatments $A_{t} = {- 1, 1}$ . The optimal DTR is to maximize the expected total value function $E^{d} (\sum_{t = 1}^{T} R_{t})$ , where the expectation is taken with respect to the distribution of (X₁, A₁, R₁, ⋯, X_T, A_T, R_T) given the treatment assignment A_t = d_t (S_t).

DTRs aim to maximize the expected cumulative rewards and hence the optimal treatment decision at the current stage must depend on subsequent decision rules. This motivates a backwards recursive procedure which estimates the optimal decision rule at future stages first, and then the optimal decision rule at current stage by restricting the analysis to the subjects who have followed the estimated optimal decision rules thereafter. Assume that we observe data (X_1i, A_1i, R_1i ⋯, X_Ti, A_Ti, R_Ti), i = 1, ⋯, n, forming n independent and identically distributed patient trajectories, and let S_ti = {(X_1i, A_1i, ⋯, A_t−1,i, X_ti): i = 1, …, n} for 1 ≤ t ≤ T. Denote π(A_t, S_t) = A_tπ_t – (1 − A_t)/2 where π_t = P(A_t = 1|S_t) for t = T, …, 1. Suppose that we already possess the optimal regimes at stages t + 1, ⋯, T and denote them as $d_{t + 1}^{*}, \dots, d_{T}^{*}$ . Then the optimal decision rule at stage t, $d_{t}^{*} (S_{t})$ should maximize

E {(\sum_{j = t}^{T} R_{j}) \frac{\prod_{j = t + 1}^{T} I (A_{j} = d_{j}^{*} (S_{j}))}{\prod_{j = t}^{T} π (A_{j}, S_{j})} I (A_{t} = d_{t} (S_{t})) | S_{t}},

where we assume all subjects have followed the optimal DTRs after stage t. Hence, $d_{t}^{*}$ is a map from $S_{t}$ to {−1, 1} which minimizes

E {(\sum_{j = t}^{T} R_{j}) \frac{\prod_{j = t + 1}^{T} I (A_{j} = d_{j}^{*} (S_{j}))}{\prod_{j = t}^{T} π (A_{j}, S_{j})} I (A_{t} \neq d_{t} (S_{t})) | S_{t}} .

Following (2.5), we consider the entropy learning framework where the decision function at stage t is given as

d_{t} (S_{t}) = 2 I {{(1 + exp (- f_{t} (X_{t})))}^{- 1} > 1 / 2} - 1 = sgn {f_{t} (X_{t})},

(2.6)

for some function f_t(·). Here for simplicity, as defined in equation (2.6), the decision rule is assumed to be depending on the history information S_t through X_t only. Although S_t = S_t−1 ∪ {A_t−1, X_t}, any elements in S_t−1 and A_t−1 can be included as one the covariates in X_t and hence such an assumption is not stringent at all. In particularly, our method is still valid when X_t is set to be S_t. Given the observed samples, we obtain estimators for the optimal treatments using the following backward procedure.

Step 1. Minimize

- \frac{1}{n} \sum_{i = 1}^{n} {\frac{R_{T i}}{π (A_{T i}, S_{T i})} [0.5 (A_{T i} + 1) f_{T} (X_{T i}) - log (1 + exp (f_{T} (X_{T i})))]} .

(2.7)

to obtain the stage-T optimal treatment regime. This is the same as the single-stage treatment selection procedure. Let ${\hat{f}}_{T}$ be the estimator of f_T obtained by minimizing (2.7). For a given S_T, the estimated optimal regime is then given by ${\hat{d}}_{T} (S_{T}) = sgn ({\hat{f}}_{T} (X_{T}))$ .

Step 2. For t = T − 1, ⋯, 1, we sequentially minimize

- n^{- 1} \sum_{i = 1}^{n} {\frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})} [0.5 (A_{t i} + 1) f_{t} (X_{t i}) - log (1 + exp (f_{t} (X_{t i}))]},

(2.8)

where ${\hat{d}}_{t + 1}, \dots, {\hat{d}}_{T}$ are obtained prior to stage t. Let ${\hat{f}}_{t}$ be the estimator of f_t obtained by minimizing (2.8). For a given S_t, the estimated optimal regime is then given by ${\hat{d}}_{t} (S_{t}) = sgn ({\hat{f}}_{t} (X_{t}))$ .

Let $H_{p_{t}}$ be the set of all functions from $R^{p_{t}}$ to $R$ . As outlined in Section 2.1, the following proposition justifies the validity of our approach.

Proposition 1. Suppose

f_{t} = arg {max}_{f \in H_{p_{t}}} E {\frac{(\sum_{j = t}^{T} R_{j}) \prod_{j = t + 1}^{T} I (A_{j} = sgn (f_{j} (X_{j})))}{\prod_{j = t}^{T} π (A_{j}, S_{j})} [0.5 (A_{t} + 1) f (X_{t}) - log (1 + exp (f (X_{t})))]},

(2.9)

backwards through t = T,T − 1, …, 1. We have $d_{j}^{*} (S_{j}) = sgn(f_{j} (X_{j}))$ for j = 1, …, T.

Let $V_{t} = E^{(d_{t}^{*}, \dots, d_{T}^{*})} \sum_{i = t}^{T} R_{i}$ be the maximal expected value function at stage t. After obtaining the estimated decision rules ${\hat{d}}_{T}, \dots, {\hat{d}}_{t}$ , for simplicity, we estimate V_t by

{\hat{V}}_{t} = n^{- 1} \sum_{i = 1}^{n} {\frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})} I (A_{t i} = {\hat{d}}_{t} (S_{t i}))} .

(2.10)

We remark that our results also fit into the more general and robust estimation framework constructed by Zhang et al. (2012), Zhang et al. (2013).

3. Asymptotic Theory for Linear Decisions

Suppose the stage-t covariates X_t is of dimension p_t for 1 ≤ t ≤ T, and assume that the function f_t(X_t) in (2.7) and (2.8) is of the linear form: $f_{t} (X_{t}) = (1, X_{t}^{⊤}) β_{t}$ for some $β_{t} \in ℝ^{p_{t} + 1}$ . (2.7) and (2.8) can then be carried out as a weighted logistic regression. In this section, we establish the asymptotic distributions for the estimated parameters and value functions under the aforementioned linear decision assumption. We remark that when the true unknown solution is nonlinear, similar to other linear learning rules, our approach can only be understood as finding the best approximation of the true solution (2.9) in the linear space.

We consider the multi-stage case only as results for the single-stage case is exactly the same as those for stage T. For the multi-stage case, we denote $X_{t}^{*} = {(1, X_{t}^{⊤})}^{⊤}$ and the observations $X_{t i}^{*} = {(1, X_{t i}^{⊤})}^{⊤}$ for t = 1, …, T and i = 1, …, n. The n × (p_t + 1) design matrix for stage t is then given by $X_{t, 1 : n} = {(X_{t 1}^{*}, \dots, X_{t n}^{*})}^{⊤}$ . Let $β_{t}^{0} = {(β_{t 0}^{0}, β_{t 1}^{0}, \dots, β_{t p_{t}}^{0})}^{⊤}$ be the solution of (2.9) at stage t and let ${\hat{β}}_{t} = {({\hat{β}}_{t 0}, {\hat{β}}_{t 1}, \dots, {\hat{β}}_{t p_{i}})}^{⊤}$ be its estimator obtained by solving (2.7) when t = T and (2.8) when t = T — 1, …, 1.

3.1. Parameter estimation

By setting the first derivative of (2.8) to be 0 for stage t where 1 ≤ t ≤ T −1, we have

0 = - \frac{1}{n} \sum_{i = 1}^{n} {\frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})} [.5 (A_{t i} + 1) - \frac{exp (X_{t i}^{* ⊤} β_{t})}{1 + exp (X_{t i}^{* ⊤} β_{t})}]} X_{t i}^{*} .

The Hessian matrix of the left hand side of the above equation is:

H_{t} (β_{t}) = \frac{1}{n} X_{t, 1 : n}^{⊤} D_{t} (β_{t}) X_{t, 1 : n},

where D_t(β_t) = diag{d_t1, …, d_tn} with

d_{t i} = \frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})} \cdot \frac{exp (X_{t i}^{* ⊤} β_{t})}{{(1 + exp (X_{t i}^{* ⊤} β_{t}))}^{2}} .

Since the R_ti’s are positive, H_t(β_t) is positive definite with probability one. Consequently, the objective function as in (2.8) is strictly convex, implying the existence and uniqueness of ${\hat{β}}_{t}$ for t = T − 1, …, 1. This is also true for t = T using a similar argument. To obtain the asymptotic distribution of the estimators, we would need the following regularity conditions:

(A1) I_t(β_t) is finite and positive definite for any $β_{t} \in ℝ^{p_{t} + 1}$ , t = 1, …, T, where
$I_{t} (β_{t}) = E \frac{(\sum_{j = t}^{T} R_{j}) \prod_{j = t + 1}^{T} I (A_{j} = d_{j} (S_{j}))}{\prod_{j = t}^{T} π (A_{j}, S_{j})} \cdot \frac{exp (X_{t}^{* ⊤} β_{t}) X_{t}^{*} X_{t}^{* ⊤}}{{(1 + exp (X_{t}^{* ⊤} β_{t}))}^{2}} .$
(A2) There exists a constant B_T such that R_t < B_T for t = 1, …, T. In addition, we assume that X_t1i, …, X_tni are i.i.d. random variables with bounded support for i = 1, …, p_t. Here X_tij is the jth element of X_ti.
(A3) Denote $Y_{t} = X_{t}^{* ⊤} β_{t}^{0}$ and let g_t(y) be the density function of Y_t for 1 ≤ t ≤ T. We assume that y⁻¹g_t(y) → 0 as y → 0. In addition, we assume that there exists a small constant b such that for any positive constant C and $β \in N_{t, b} : = {β : {| β - β_{t}^{0} |}_{\infty} < b}$ , $P (| X_{t}^{* ⊤} β | < C y) = O (y)$ as y → 0.
(A4) There exist constants 0 < c_t1 < c_t2 < 1 such that c_t1 < π_t < c_t2 for t = 1, …, T and $P (\prod_{j = 1}^{T} I (A_{j} = d_{j}^{*} (S_{j})) = 1) > 0$ .

Remark 1. By its definition, I_t(β_t) is positive semidefinite. In A1 we assume that I_t(β_t) is positive definite to ensure that the true optimal treatment rule is unique and estimable. The boundness assumption A2 can be further relaxed using some truncation techniques. Assumption A3 indicates that the probability of $Y_{t} \leq C n^{- \frac{1}{2}}$ is of order $O (n^{- \frac{1}{2}})$ . This is in some sense necessary to ensure that the optimal decision is estimable, and is also essential for establishing asymptotic normality without an additional Bernoulli point mass as in Laber et al. (2014). Assumption A4 is to ensure that the treatment design is valid such that the probability of a patient being assigned to the unknown optimal treatments is non-negligible.

Theorem 1. Under assumptions A1–A4, we have for t = T, …, 1, for any constant κ > 0, there exists a large enough constant C_t,

P ({| {\hat{β}}_{t} - β_{t}^{0} |}_{\infty} > C_{t} \sqrt{\frac{log n}{n}}) = o (\frac{log n}{n}),

(3.1)

and given $X_{t}^{*}$ , for any x > 1 and $x = o (\sqrt{n})$ , we have

P (| X_{t}^{* ⊤} (β_{t}^{0} - {\hat{β}}_{t}) | > \frac{x W_{t}}{\sqrt{n}} | X_{t}^{*}) = {1 + O (\frac{x^{3}}{\sqrt{n}})} Φ (- x) + O (\frac{log n}{\sqrt{n}}),

(3.2)

where $W_{t}^{2} = Var (X_{t}^{* ⊤} (β_{t}^{0} - {\hat{β}}_{t}))$ and Φ(·) is the cumulative distribution function of the standard normal distribution. In addition, for the ith sample we have

E | \prod_{j = t}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i})) - \prod_{j = t}^{T} I (A_{j i} = d_{j}^{*} (S_{j i})) | = o (\frac{log n}{n}) .

(3.3)

Furthermore we have,

\sqrt{n} I_{t} (β_{t}^{0}) ({\hat{β}}_{t} - β_{t}^{0}) \to N (0, Γ_{t}),

(3.4)

Where Γ_t= (γ_tjk)_{1≤j, k≤p+1} with

γ_{t j k} = E {[\frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = d_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})}]}^{2} \cdot {[0.5 (A_{t i} + 1) - \frac{exp (X_{t i}^{* ⊤} β_{t}^{0})}{1 + exp (X_{t i}^{* ⊤} β_{t}^{0})}]}^{2} X_{t i j}^{*} X_{t i k}^{*},

and $X_{t i j}^{*}$ is the jth element of $X_{t i}^{*}$ .

Remark 2. We remark that the proof of Theorem 1 is not straightforward as for stage t < T, the n terms in the summation of the objective function (2.8) are weakly dependent to each other. Note that the estimation errors of the indicator functions in (2.8) might aggregate when the estimators are obtained sequentially. We thus need to show that the estimation errors of these indicator functions are well controlled. By establishing Bernstein-type concentration inequalities (3.1) and large deviation results (3.2) for the parameter estimation, we establish error bounds (3.3) for the estimation of these indicator functions. The asymptotic distribution of the estimators can then be established subsequently. Detailed proofs are provided in the supplementary material. On the other hand, from the proofs we can see that the asymptotic results we obtained in the above theorem would also hold if other loss functions satisfying the two conditions raised in Section 2.1 are used, with some corresponding modifications to Condition (A1) and the covariance matrix.

In practice we estimate Γ_t in Theorem 1 by ${\hat{Γ}}_{t} = {({\hat{γ}}_{t j k})}_{1 \leq j, k \leq p_{t} + 1}$ with

{\hat{γ}}_{t j k} = \frac{1}{n} \sum_{i = 1}^{n} {[\frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})}]}^{2} \cdot {[0.5 (A_{t i} + 1) - \frac{exp (X_{t i}^{*} {\hat{β}}_{t})}{1 + exp (X_{t i}^{* ⊤} {\hat{β}}_{t})}]}^{2} X_{t i j}^{*} X_{t i k}^{*} .

The covariance matrix $\sqrt{n} ({\hat{β}}_{t} - β_{t}^{0})$ can then be estimated by: ${\hat{Σ}}_{t} = H_{t}^{- 1} ({\hat{β}}_{t}) {\hat{Γ}}_{t} H_{t}^{- 1} ({\hat{β}}_{t})$ .

3.2. Estimating the optimal value function

In this subsection, we establish the asymptotic normality of the estimated maximal expected value function defined in (2.10) when the f(x) is a linear function of x.

Theorem 2. Under the same assumptions of Theorem 1, we have,

\sqrt{n} ({\hat{V}}_{t} - V_{t}) \to N (0, Σ_{V_{t}}), t = 1, \dots, T,

where ${\hat{V}}_{t}$ is defined as in (2.10) and,

Σ_{V_{t}} = E {\frac{(\sum_{j = t}^{T} R_{j}) \prod_{j = t + 1}^{T} I (A_{j} = d_{j} (S_{j}))}{\prod_{j = t}^{T} π (A_{j}, S_{j})} I (A_{t} = d_{t} (S_{t}))}^{2} - {E \frac{(\sum_{j = t}^{T} R_{j}) \prod_{j = t + 1}^{T} I (A_{j} = d_{j} (S_{j}))}{\prod_{j = t}^{T} π (A_{j}, S_{j})} I (A_{t} = d_{t} (S_{t}))}^{2} .

When conducting inferences, Σ_Vt can be simply estimated by their empirical estimators:

{\hat{Σ}}_{V_{i}} = \frac{1}{n} \sum_{i = 1}^{n} {\frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})} I (A_{t i} = {\hat{d}}_{t} (S_{t i}))}^{2} - {\frac{1}{n} \sum_{i = 1}^{n} \frac{(\sum_{j = t}^{T} R_{j i}) \prod_{j = t + 1}^{T} I (A_{j i} = {\hat{d}}_{j} (S_{j i}))}{\prod_{j = t}^{T} π (A_{j i}, S_{j i})} I (A_{t i} = {\hat{d}}_{t} (S_{t i}))}^{2} .

3.3. Testing Treatment Effects

In practice, treatments in some stages might not be effective for some patients. When the true optimal treatment rule is linear in X_t, non-significant treatment effect on stage t for some 1 ≤ t ≤ T is equivalent to $X_{t}^{* ⊤} β_{t}^{0} = 0$ . Here $X_{t}^{*} = {(1, X_{t}^{⊤})}^{⊤}$ . From Theorem 1 we immediately have that given X_t, $X_{t}^{* ⊤} {\hat{β}}_{t} \to N (X_{t}^{* ⊤} β_{t}^{0}, \frac{1}{n} X_{t}^{* ⊤} I_{t} {(β_{t}^{0})}^{- 1} Γ_{t} I_{t} (β_{t}^{0}) X_{t}^{*})$ . Therefore, we can then use $X_{t}^{* ⊤} {\hat{β}}_{t}$ as a test statistic for testing the significance of treatment effects: for a realization $x_{t}^{*}$ and a given significance level α, we reject $H_{0} : x_{t}^{* ⊤} β_{t}^{0} = 0$ if $\sqrt{n} | {(x_{t}^{* ⊤} {\hat{I}}_{t} {({\hat{β}}_{t})}^{- 1} {\hat{Γ}}_{t} {\hat{I}}_{t}_{t} ({\hat{β}}_{t}) x_{t}^{*})}^{- 1 / 2} x_{t}^{* ⊤} {\hat{β}}_{t} | > Φ (1 - α / 2)$ , where ${\hat{I}}_{t} ({\hat{β}}_{t}), {\hat{Γ}}_{t} ({\hat{β}}_{t})$ are empirical estimators of I_t, Γ_t evaluated at ${\hat{β}}_{t}$ and Φ(·) is the cumulative distribution function of the standard normal distribution.

Before we proceed to numerical studies, we remark that theoretical results we obtained in this section would still be valid when the model is mis-specified. However, the parameters we are estimating are the maximizer of (2.5) under the linear space, instead of the parameters in the optimal decision rules.

4. Simulation Study

We next carry out numerical studies to assess the performance of our proposed methods.

One-stage.

The treatment A is generated uniformly from { − 1, 1} and is independent of theprognostic variables X = (x₁, …, x_p)^⊺. We set the reward R = Q(X) + T (X, A) + ϵ where T(X, A) reflects the interaction between treatment and prognostic variables and ϵ is a random variable such that ϵ = |Y|/10 where Y has a standard normal distribution. Such a folded normal error is chosen since R is restricted to be positive. We consider the following models.

Model 1. x₁,x₂,x₃ are generated independently and uniformly in [−1, 1]. We generate the reward R = Q(X) + T(X, A) + ϵ by setting T(X, A) = 3(.4 − x₁ − x₂)A, Q(X) = 8 + 2x₁ − x₂ + .5x₃. In this case, the decision boundary is only determined by x₁ and x₂.

Model 2. X = (x₁,x₂,x₃)^⊺ is generated from a multivariate normal distribution with mean zero and covariance matrix Σ = (σ_ij)_3×3 where σ_ij = .5^|i−j| for 1 ≤ i, j ≤ 3. We generate the reward R by setting $T (X, A) = (.8 - 2 x_{1} - 2 x_{2}) A$ , $Q (X) = 8 + 2 x_{1} - x_{2} + .5 x_{3}$ . The decision boundary of this case is also determined by x₁ and x₂.

Next we consider some multi-stages cases. The treatments At are generated independently and uniformly from {−1, 1} and are independent to the p-dimensional vector of prognostic variables X_t = (x_t1, …, x_tp)^⊺, t =1, …, T. ϵ is generated in the same way as in the single stage.

Two-stage.

Model 3. Stage 1 outcome R₁ is generated as follow: R₁ = (1 − 5x₁₁ − 5x₁₂)A₁ + 11.1 + .1x₁₁ − .1x₁₂ + .1x₁₃ + ϵ where x₁₁, x₁₂,x₁₃ are generated independently from a uniform distribution in [−1, 1]. Stage 2 outcome R₂ is generated by R₂ = .5A₁A₂ + 3 + (.2 − x₂₁ − x₂₂)A₂ + ϵ where x_2i = x_1i, i = 1, 2, 3. In this case the covariates from the two stages are identical.

Model 4. We use the same setting as Model 3 except that we set x_2i = .8x_1i + .2U_i, i = 1, 2, 3 where U_i is randomly generated from U[−1,1]. In this case covariates from the two stages are different and correlated.

4.1. Estimation and classification performance

We first examine the performance of the estimated coefficient parameters, the corresponding value functions and the classification accuracy.

For stage t, given a sample size n, we repeat the simulation for 2000 times and compute the coverage rate C R_tj· which is the proportion that $[{\hat{β}}_{t j} - 1.96 {\hat{σ}}_{t j j}, {\hat{β}}_{t j} + 1.96 {\hat{σ}}_{t j j}]$ covers the true parameter β_tj· for j = 0, …, p, where ${\hat{σ}}_{t j j}$ is the (j, j)th element of ${\hat{Σ}}_{t}$ . $C R_{V_{t}}$ is defined similarly for the coverage rate of the value function. A validation set with 100,000 observations is simulated to compute the oracle values and assess the performance of our approach.

We set the sample size to be n = 50,100, 200,400 and 800. Coverage rates under Models 1–4 are given in Tables 1 and 2. For each replication under each model, we also compute the misclassification rate in each stage. Figure 2 gives the boxplots of the misclassification rates over 2000 replications for all four models.

Table 1:

Coverage rates of the expected value function and coefficient parameters under Models 1 and 2.

	Model 1					Model 2
n	$C R_{V_{1}}$	CR₁₀	CR₁₁	CR₁₂	CR₁₃	$C R_{V_{1}}$	CR₁₀	CR₁₁	CR₁₂	CR₁₃
50	0.927	0.948	0.950	0.938	0.945	0.946	0.944	0.937	0.931	0.924
100	0.936	0.950	0.947	0.949	0.944	0.942	0.947	0.949	0.945	0.940
200	0.942	0.954	0.947	0.955	0.952	0.951	0.950	0.950	0.953	0.947
400	0.940	0.949	0.960	0.954	0.944	0.946	0.963	0.952	0.949	0.933
800	0.944	0.944	0.953	0.947	0.943	0.951	0.955	0.952	0.954	0.943

Open in a new tab

Table 2:

Coverage rates of the expected value function and coefficient parameters under Models 3 and 4.

Model 3	Stage 1					Stage 2
n	$C R_{V_{1}}$	CR₁₀	CR₁₁	CR₁₂	CR₁₃	$C R_{V_{2}}$	CR₂₀	CR₂₁	CR₂₂	CR₂₃
50	0.872	0.946	0.937	0.945	0.947	0.912	0.949	0.939	0.951	0.951
100	0.928	0.949	0.956	0.953	0.948	0.941	0.952	0.956	0.954	0.94J
200	0.936	0.947	0.942	0.942	0.951	0.950	0.950	0.946	0.948	0.935
400	0.941	0.943	0.948	0.943	0.950	0.943	0.948	0.952	0.948	0.956
800	0.957	0.944	0.955	0.945	0.941	0.954	0.939	0.951	0.952	0.952
Model 4	Stage 1					Stage 2
50	0.865	0.948	0.944	0.941	0.947	0.908	0.942	0.948	0.942	0.942
100	0.908	0.951	0.939	0.954	0.940	0.942	0.955	0.943	0.947	0.949
200	0.941	0.940	0.943	0.951	0.948	0.948	0.948	0.954	0.954	0.951
400	0.945	0.944	0.946	0.956	0.952	0.948	0.943	0.951	0.947	0.950
800	0.954	0.949	0.946	0.957	0.953	0.951	0.950	0.963	0.952	0.950

Open in a new tab

Figure 2: — Boxplot of misclassification rates over 2000 replications.

From Tables 1 and 2 we observe that the coverage rates are close to the nominal level (95%) and improve as the sample size increases, indicating that the asymptotic normality of our estimators is well established. In particular, the coverage rates of the coefficient parameter estimators are very close to 95% even when the sample size is as small as 50. The boxplots in Figure 2 also indicate that the misclassification rate of the estimated decision rule decreases towards zero as the sample size increases.

Note that the ultimate goal of dynamic treatment regimes is to maximize the value functions. We next compare our Entropy learning with Q-learning and Outcome-weighted learning in terms of value function estimation. Throughout this paper, Q-learning and Outcome-weighted learning are implemented using the R package “DTRlearn”. In addition to Models 1–4, we also consider the following nonlinear cases.

Model 5. x₁,x₂,x₃ are generated independently and uniformly in [−1, 1]. We generate the reward R = Q(X, A) + ϵ with Q(X, A) = [−T(X) (A+1) + 2 log(1 + exp(T(X)))]⁻¹, where T(X) = (x₁ − x₂ + 2x₁x₂).

Model 6. Same as Model 5 except that x₁,x₂, x₃ are discrete variables generated independently and uniformly in {−1, 0, 1}.

Model 7. Stage 1 outcome R₁ is generated as follow: R₁ = [0.2 − T₁(X₁)(A₁ + 1) + 2 log(1 + T₁(X₁))]⁻¹ + ϵ where $T_{1} (X_{1}) = x_{11} - x_{12} + 2 x_{13}^{2} + 2 x_{11} x_{12}$ with x₁₁, x₁₂, x₁₃ generated independently from a uniform distribution in [−1, 1]. Stage 2 outcome R₂ is generated by R₂ = [0.05 + (1 + A₂) (1 + A₁)/4 − T₂ (X₂)(A₂ + 1) + 2 log(1 + T₂ (X₂))]⁻¹ + ϵ where x_2i = x_1i, i = 1, 2, 3 and $T_{2} (X_{2}) = x_{21} - x_{22} + 2 x_{23}^{2} + 2 x_{21} x_{22}$ .

Model 8. Same as Model 7 except that x₁₁,x₁₂,x₁₃ are discrete variables generated independently and uniformly in {−1, 0, 1}.

For each model, we generate 200 random samples and the corresponding estimated treatment rules used to compute the value function using (2.3) with a validation set of size n = 500, 000. The above procedure is repeated for 100 times and the results are reported in table 3.

Table 3:

Comparison of value functions using Entropy learning (E-learning), Q-learning and Outcome-weighted learning (OW-Learning) under Models 1–8.

Model	E-Learning	Q-Learning	OW-Learning
Model 1	10.2(0.1)	10.3(0.0)	10.3(0.0)
Model 2	9.4(0.1)	9.4(0.0)	9.4(0.0)
Model 3 Stage 2	3.7(0.1)	3.7(0.0)	3.7(0.0)
Model 3 Stage 1	14.5(0.4)	15.0(0.0)	15.0(0.0)
Model 4 Stage 2	3.6(0.1)	3.6(0.0)	3.6(0.00)
Model 4 Stage 1	14.5(0.6)	15.0(0.0)	15.0(0.0)
Model 5	1.8(0.0)	1.7(0.0)	1.8(0.0)
Model 6	4.8(0.1)	4.1(0.1)	-(-)
Model 7 Stage 2	1.5(0.0)	1.5(0.0)	1.5(0.0)
Model 7 Stage 1	1.1(0.1)	1.0(0.0)	1.1(0.1)
Model 8 Stage 2	3.0(0.1)	2.8(0.2)	-(-)
Model 8 Stage 1	1.9(0.3)	0.9(0.2)	-(-)

Open in a new tab

From Table 3 we note the value functions of our Entropy learning method are comparable to those of Q-learning and Outcome-weighted learning under models 1–4. However, under Models 5 and 7, where the true treatment regimes are nonlinear, the value functions of Entropy learning and Outcome-weighted learning are very close and seem to be slightly better than those of Q-learning. However, when we consider discrete covariates in Models 6 and 8, Outcome-weighted learning can hardly produce any result due to a very large condition number in solving a system of equations.

4.2. Testing $X_{t}^{* ⊤} β_{t}^{0} = 0$

In the dynamic treatment regime literature, the non-regularity condition $P (X_{t}^{* ⊤} β_{t}^{0} = 0) = 0$ is usually required (for example in Q-learning) to enable parameter inference. Here in this study we study the performance of testing $X_{t}^{* ⊤} β_{t}^{0} = 0$ based on our entropy learning approach.

Case 1: testing $X^{* ⊤} β_{1}^{0} = 0$ under model 1. Let $X^{*} = {(1, x_{1}, x_{2}, x_{3})}^{⊤}$ be the covariate of a new observation and $β_{1}^{0} = {(β_{10}^{0}, β_{11}^{0}, β_{12}^{0}, β_{13}^{0})}^{⊤}$ be the true parameters. By setting x₁ = x₃ = 1 and $x_{2} = - (β_{10}^{0} + x_{1} β_{11}^{0} + x_{3} β_{13}^{0}) / β_{12}^{0}$ we have $X^{* ⊤} β_{1}^{0} = 0$ .
Case 2: testing $X_{1}^{* ⊤} β_{1}^{0} = 0$ under model 4. We set x₁₁ = x₁₃ = −1 and $x_{12} = - (β_{10}^{0} + x_{11} β_{11}^{0} + x_{13} β_{13}^{0}) / β_{12}^{0}$ .

We set n = 50, 100, 200, 400. Note that

X_{t}^{* ⊤} {\hat{β}}_{t} \to N (X_{t}^{* ⊤} β_{t}^{0}, X_{t}^{* ⊤} I_{t} {(β_{t}^{0})}^{- 1} Γ_{t} I_{t} (β_{t}^{0}) X_{t}) .

We use $X_{t}^{* ⊤} {\hat{I}}_{t} {({\hat{β}}_{t})}^{- 1} {\hat{Γ}}_{t} {\hat{I}}_{t} ({\hat{β}}_{t}) X_{t}^{*}$ to estimate the variance of $X_{t}^{* ⊤} {\hat{β}}_{t}$ where ${\hat{I}}_{t}$ and ${\hat{Γ}}_{t}$ are the empirical estimators of I_t and Γ_t. For each case we run the simulation for p=1000 times and for each replication we compute the p-value of $X_{t}^{* ⊤} {\hat{β}}_{t}$ . P-value plots are given in Figures 3 and 4. We can see that the distribution of the p-values can be well fitted by the uniform distribution in [0,1], indicating that our tests can perform well in detecting non-significant treatment effect.

Figure 3: — P-value of $X^{⊤} {\hat{β}}_{1}$ under case 1 over 1000 replications.

Figure 4: — P-value of $X_{1}^{⊤} {\hat{β}}_{1}$ under case 2 over 1000 replications.

4.3. Type I error comparison with Q-learning

We next assess the performance of hypothesis tests since it is often of interest to investigate the significance of coefficient parameters. Note that in Models 3 and 4, we have β₁₃ = β₂₃ = 0. We then compute the type I error for testing β₁₃ = 0 and β₂₃ = 0. In the optimization problems (2.7) and (2.8) decisions A_i are formularized as the weights of a weighted negative log-likelihood. Consequently, unlike Q-learning (Zhao et al. (2009)), the objective functions for the estimation of the parameters become continuous functions and parameter inferences become feasible even without the non-regularity condition. For comparison, we compute the same quantities using the bootstrap scheme for Q-learning. We remark that the β_ij·’s using Entropy learning and the β_ij·’s using Q-learning are generally different, but in Models 3 and 4, x₁₃ and x₂₃ were not involved in treatment selection part, and hence the true β’s in both entropy learning and Q-learning are zero. Here the significance level α is set to be 0.05 and we consider n = 50,100, 400, 800. The simulation is repeated for 2000 times and the result is given in Table 4. From Table 4 we see that most of the type I errors using entropy learning are closer to α = 0.05, which indicates that our learning method can be more appropriate for testing the significance of covariates.

Table 4:

Type I error comparison using Entropy learning and Q-learning, where “Elearn” refers to Entropy learning and “Qlearn” refers to Q-learning.

	Model 3				Model 4
	H₀ : β₁₃ = 0		H₀ : β₂₃ = 0		H₀ : β₁₃ = 0		H₀ : β₂₃ = 0
n	Elearn	Qlearn	Elearn	Qlearn	Elearn	Qlearn	Elearn	Qlearn
50	0.063	0.069	0.050	0.057	0.060	0.054	0.055	0.056
100	0.044	0.063	0.054	0.056	0.044	0.057	0.043	0.055
400	0.049	0.043	0.055	0.043	0.047	0.053	0.047	0.046
800	0.050	0.059	0.044	0.064	0.047	0.053	0.054	0.055

Open in a new tab

5. Application to STAR*D

We consider a real data example extracted from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study funded by the National Institute of Mental Health. STAR*D is a multisite, prospective, randomized, multistep clinical trial of outpatients with nonpsychotic major depressive disorder; see Rush et al. (2004) and Sinyor et al. (2010) for more study details. The complete trial involved four sequential treatment stages (or levels), and patients were encouraged to participate in the next level of treatment if they failed to achieve remission or adequate reduction in symptoms.

Patients who did not experience a remission of symptoms during the first level of the STAR*D study in which they initially took the antidepressant citalopram, a selective serotonin reuptake inhibitor (SSRI) for up to 14 weeks had the option of continuing to level 2 of the trial where they could explore additional treatment options designed to help them become symptom-free (Rush et al. (2006)). Because there was only single treatment for all patients in level 1, we will not discuss such data in this paper.

Level 2 of the study offered seven different treatments; four options “switched”: the study participants from citalopram to a new medication or talk therapy, and three options “augmented”: citalopram treatment by adding a new medication or talk therapy to the citalopram they were already receiving. Data taken from Level 2 will be treated as the first stage observations in this paper and we define A₁ = −1 if the treatment option is a switch and A₁ = 1 if the treatment option is an augmentation.

During levels 1 and 2 of the STAR*D trial, which started with 2,876 participants, about half of all patients became symptom-free. The other half were then eligible to enter level 3 where as in level 2, patients were given the choice of either switching medications or adding another medication to their existing medication (Fava et al. (2006)). Data taken from level 3 of this trial will be treated as the second stage observations in this paper and we define A₂ = −1 if the treatment option is a switch and A₂ = 1 if the treatment option is an augmentation.

After removing cases with missing values from the data files we obtain a sample of 316 patients whose medical information from the two stages are available. Among these 316 patients, 119 and 197 of them were respectively assigned to the augmentation group and the switch group in Stage 1, 115 and 201 of them were respectively assigned to the augmentation group and the switch group in Stage 2. The 16-item Quick Inventory of Depressive Symptomatology-Self-Report (QIDS-SR(16)) scores were obtained at treatment visits for the patients and considered as the primary outcome variable in this paper. To accommodate with our model where the reward is positive and “the larger the better”, we used R = c — QIDS-SR(16) as the reward at each level where c is a constant that bounds the empirical QIDS-SR(16) scores. In this study we simply set c = 30 so that all QIDS-SR(16) scores are positive.

Following earlier analysts (eg. Kuk et al. (2010, 2014)), we consider the following set of clinically meaningful covariates: (i) chronic depression indicator: equals 1 if chronic episode > 2 years and 0 otherwise; (ii) gender: male= 0 and female= 1; (iii) patient age (years); (iv) general medical condition (GMC) defined to be 1 for the presence of one or more general medical conditions and 0 otherwise; (v) anxious feature defined to be 1 if Hamilton Depression Rating Scale anxiety/somatization factor score ≥ 7 and 0 otherwise (Fava et al. (2008)); In addition, we also consider (vi) week: the number of weeks patients had spent in the corresponding stage when the QIDS-SR(16) scores at exit were determined and (vii) the baseline QIDSSR(16) scores at the corresponding stages. Summary of these covariate are given in Table 5.

Table 5:

Summary statistics for the covariates in the STAR*D study: for continuous variables, we report means and standard deviations; for dichotomous variables, we report proportions and standard deviations.

	Chronic		Gender		Age		GMC
	Stage 1	Stage 2	Stage 1	Stage 2	Stage 1	Stage 2	Stage 1
Switch	0.29 (0.03)	0.29 (0.03)	0.51 (0.04)	0.46 (0.04)	43.99 (0.88)	45.78 (0.84)	0.59 (0.04)
Augmentation	0.26 (0.04)	0.26 (0.04)	0.49 (0.05)	0.57 (0.05)	44.76 (1.05)	41.65 (1.11)	0.56 (0.05)
	GMC	Anxiety		Week		QIDS-SR(16)
	Stage 2	Stage 1	Stage 2	Stage 1	Stage 2	Stage 1	Stage 2
Switch	0.62 (0.03)	0.76 (0.03)	0.74 (0.03)	9.21 (0.30)	7.48 (0.34)	14.96 (0.29)	14.54 (0.31)
Augmentation	0.51 (0.05)	0.70 (0.04)	0.73 (0.04)	9.64 (0.40)	9.35 (0.46)	13.45 (0.37)	12.77 (0.37)

Open in a new tab

We applied the methods introduced in this paper to estimate the covariate effects on the optimal treatment allocation for the patients in this study. The fitted results under the entropy learning approach are given in Table 6. From Table 6 we notice that baseline QIDS-SR(16) score is a significant predictor to determine whether the patient should be treated with the switch option or the argumentation option in both stages. More specifically, given other covariates, if the patient has a higher baseline score, adopting a switch option might have better medical outcome. In addition, for Stage 2 analysis, baseline score, gender, age and the treatment time are all significant in determining the best treatment options. Interestingly, the treatment time is significant with a positive sign, indicating that given other covariates, treatment argumentation might benefit the patients for a longer term.

Table 6:

Entropy learning for the STAR*D study.

	Stage 1		Stage 2
	coefficient(sd)	p-value	coefficient(sd)	p-value
Entropy learning
intercept	0.855 (0.987)	0.386	0.452 (0.792)	0.569
chronic	−1.231 (0.455)	0.007	0.103 (0.314)	0.742
gender	−0.604 (0.340)	0.859	0.702 (0.269)	0.009
age	0.001 (0.016)	0.950	−0.028 (0.012)	0.022
gmc	0.089 (0.359)	0.805	−0.121 (0.274)	0.658
anxious	0.095 (0.373)	0.799	0.235 (0.298)	0.431
week	0.066 (0.036)	0.071	0.089 (0.029)	0.002
qctot	−0.084 (0.044)	0.056	−0.111 (0.034)	0.001
A₁	-	-	0.925 (0.273)	0.001
${\hat{V}}_{i}$	59.617 (5.485)	-	25.697 (1.325)	-

Open in a new tab

For comparison, using the same sets of covariates, estimation results based on Q-learning are given in Table 7 where the estimated confidence intervals are obtained from the bootstrap procedure. Eyeballing Table 7, we note that gender is identified as the only important factor to the treatment selection at stage 2. Such an existing method may be less powerful than our proposed entropy learning since it may miss potentially useful markers. Consequently, Q-learning may not be able to achieve the most appropriate treatment allocation using a set of important personalized characteristics identified from a significance study. To compare the performance of the proposed method with Q-learning in terms of value function, we also compute the estimated mean and standard deviation of the value functions using the fitted regimes obtained by using our method and the Q-learning method; see the ${\hat{V}}_{i}$ values in Tables 6 and 7. We do observe larger mean value functions for our entropy learning approach, indicating that the treatment regime obtained by our approach is outperforming that of Q-learning in this data set.

Table 7:

Bootstrap confidence interval of Q-learning for the STAR*D study. Lower: lower bound of the 95% confident interval; Upper: upper bound of the 95% confident interval;

	Stage 1			Stage 2
	coefficient	Lower	Upper	coefficient	Lower	Upper
Q-learning
intercept	0.99	−4.28	5.50	−2.17	−5.32	0.73
chronic	−0.48	−2.31	1.33	−0.63	−1.75	0.48
gender	0.66	−0.80	2.24	1.30	0.37	2.28
age	−0.03	−0.09	0.04	0.02	−0.03	0.07
gmc	0.06	−1.48	1.59	0.26	−0.83	1.39
anxious	1.35	−0.32	3.00	0.62	−0.45	1.65
week	−0.14	−0.31	0.04	−0.07	−0.16	0.02
qctot	−0.06	−0.24	0.14	−0.02	−0.16	0.11
A₁	-	-	-	0.11	−0.44	0.66
${\hat{V}}_{i}$	40.34	32.08	48.60	20.54	17.83	23.25

Open in a new tab

Noted from our experiences, the entropy learning approach may be incorrectly interpreted by some practitioners. The fitted regression model should not be confused with an ordinary association study resulted from fitting unweighted logistic regression models to the two stage data (see Table 8). In fact, the significant findings from Table 8 only establish how covariates affect the likelihood of being observed in a treatment in lieu of the likelihood of being allocated into the most appropriate treatment.

Table 8:

Ordinary association study for the STAR*D data using logistic regression models.

	Stage 1		Stage 2
	coefficient(sd)	p-value	coefficient(sd)	p-value
intercept	0.210 (0.740)	0.777	0.182 (0.772)	0.813
chronic	−0.183 (0.279)	0.511	0.141 (0.296)	0.635
gender	0.012 (0.242)	0.961	0.537 (0.260)	0.039
age	0.010 (0.011)	0.352	−0.030 (0.012)	0.012
gmc	−0.093 (0.259)	0.719	−0.219 (0.275)	0.425
anxious	−0.117 (0.275)	0.671	0.096 (0.295)	0.744
week	0.032 (0.029)	0.269	0.098 (0.027)	< 0.001
qctot	−0.091 (0.030)	0.003	−0.104 (0.031)	0.001
A₁	-	-	0.851 (0.260)	0.001

Open in a new tab

Finally, since the original design at level 2 of the STAR*D trial was an equipoise-stratified design, one potential source of confounding effects could be due to patient’s preference to the strata in the design. A further examination of this issue would be to consider including patient’s preference in the treatment estimation strategies if we trust that patients’ selection of treatment options (between switch and augmentation) within each stratum is random, as was apparently assumed in the original equipoise-stratified design (Sinyor et al. (2010)).

6. Discussion

There are many open questions that may follow our development in this paper. First, the linear specification of the treatment allocation rule may be replaced by a nonparametric formulation such as a partly linear model or an additive regression model. The implementation of such methods is now widely available in all kinds of statistical packages. More efforts are still demanded to establish similar theoretical properties as in this paper and achieve interpretable results.

Second, to carry out the clinical study and select the best treatment using our approach, it is necessary to evaluate the required sample size at the designing stage. Applying the theoretical results attained in this paper, we may proceed to calculate the total number of subjects for every treatment group. However, more empirical studies on various types of settings and data distributions can provide stronger support to the suggestion based on asymptotic results.

Finally, missing values are quite common for the multi-stage analysis. Most of analysts follow the standard practice to exclude cases with missing observations under the missing-at-random assumption. It is a difficult task to investigate the reason of missing data and an even more difficult task to address the problem when missing is not at random. We encourage more research works in this direction.

Supplementary Material

supp

NIHMS1011474-supplement-supp.pdf^{(232.6KB, pdf)}

7. Acknowledgement

We thank the Editor, the Associate Editor and two reviewers for instructive comments. The work was partly supported by Academic Research Funds R-155-000-205-114, R-155-000-195-114 and Tier 2 MOE funds in Singapore MOE2017-T2-2-082: R-155-000-197-112 (Direct cost) and R-155-000-197-113 (IRC).

Footnotes

Supplementary Materials

Title: Supplement Material for “Entropy Learning for Dynamic Treatment Regimes”. Technical proofs for the propositions and theorems are provided in this supplemetrary file.

References

Bartlett P, Jordan M & McAuliffe J (2006). Convexity, classification, and risk bounds. J. Am. Statist. Assoc 101, 138–156. [Google Scholar]
Chakraborty B, Murphy S & Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods. Med. Res 19, 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fava M et al. (2006) A comparison of mirtazapine and nortriptyline following two consecutive failed medication treatments for depressed outpatients: A star*d report. Am. J. Psychiatry 163, 1161–1172. [DOI] [PubMed] [Google Scholar]
Fava M et al. (2008) Difference in treatment outcome in outpatiens with anxious versus nonaxious depression: a star*d report. Am. J. Psychiatry 165, 342–351. [DOI] [PubMed] [Google Scholar]
Goldberg Y & Kosorok M (2012). Q-learning with censored data. Ann. Statist 40, 529. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hjort N & Pollard D (2011). Asymptotics for minimisers of convex processes. Preprint series ArXiv preprint arXiv:1107.3806. [Google Scholar]
Kuk A, Li J & Rush A (2010). Recursive subsetting to identify patients in the star*d: a method to enhance the accuracy of early prediction of treatment outcome and to inform personalized care. J. Clin. Psychiatry 71, 1502–1508. [DOI] [PubMed] [Google Scholar]
Kuk A, Li J & Rush A (2014). Variable and threshold selection to control predictive accuracy in logistic regression. J. R. Statist. Soc C 63, 657–672. [Google Scholar]
Laber E, Lizotte D, Qian M, Pelham W, & Murphy S (2014). Dynamic treatment regimes: Technical challenges and applications. Electron. J. Stat, 8, 1225–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Luedtke A & van der Laan M (2016). Super-learning of an optimal dynamic treatment rule. Int. J. Biostat 12, 305–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy K (2012). Machine Learning: A Probabilistic Perspective MIT press. [Google Scholar]
Murphy S (2003). Optimal dynamic treatment regimes. J. R. Statist. Soc. B 65, 331–366. [Google Scholar]
Murphy S (2003). A generalization error for Q-learning. J. Mach. Learn. Res 6, 1073–1097. [PMC free article] [PubMed] [Google Scholar]
Qian M & Murphy S (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39, 1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins J, Hernan M & Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11, 550–560. [DOI] [PubMed] [Google Scholar]
Rush A et al. (2004). Sequenced treatment alternatives to relieve depression (star*d): rationale and design. Control. Clin. Trials 25, 119–142. [DOI] [PubMed] [Google Scholar]
Rush A et al. (2006). Bupropion-sr, sertraline, or venlafaxine-xr after failure of ssris for depression. N. Engl. J. Med 354, 1231–1242. [DOI] [PubMed] [Google Scholar]
Sinyor M, Schaffer A & Levitt A (2010). The sequenced treatment alternatives to relieve depression (star*d) trial: a review. Can. J. Psychiatry 55, 126–135. [DOI] [PubMed] [Google Scholar]
Song R, Wang W, Zeng D & Kosorok M (2015). Penalized q-learning for dynamic treatment regimens. Stat. Sin 25, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watkins C & Dayan P (1992). Q-learning. Mach. Learn 8, 279–292. [Google Scholar]
Zhao Y, Kosorok M & Zeng D (2009). Reinforcement learning design for cancer clinical trials. Stat. Med 28, 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang B, Tsiatis A, Laber E & Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68, 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang B, Tsiatis A, Laber E & Davidian M (2012). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100, 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Laber E & Kosorok M (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Am. Stat. Assoc 110, 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush A & Kosorok M (2012). Estimating individualized treatment rules using outcome weighted learning. J. Am. Stat. Assoc 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

NIHMS1011474-supplement-supp.pdf^{(232.6KB, pdf)}

[R1] Bartlett P, Jordan M & McAuliffe J (2006). Convexity, classification, and risk bounds. J. Am. Statist. Assoc 101, 138–156. [Google Scholar]

[R2] Chakraborty B, Murphy S & Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods. Med. Res 19, 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Fava M et al. (2006) A comparison of mirtazapine and nortriptyline following two consecutive failed medication treatments for depressed outpatients: A star*d report. Am. J. Psychiatry 163, 1161–1172. [DOI] [PubMed] [Google Scholar]

[R4] Fava M et al. (2008) Difference in treatment outcome in outpatiens with anxious versus nonaxious depression: a star*d report. Am. J. Psychiatry 165, 342–351. [DOI] [PubMed] [Google Scholar]

[R5] Goldberg Y & Kosorok M (2012). Q-learning with censored data. Ann. Statist 40, 529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Hjort N & Pollard D (2011). Asymptotics for minimisers of convex processes. Preprint series ArXiv preprint arXiv:1107.3806. [Google Scholar]

[R7] Kuk A, Li J & Rush A (2010). Recursive subsetting to identify patients in the star*d: a method to enhance the accuracy of early prediction of treatment outcome and to inform personalized care. J. Clin. Psychiatry 71, 1502–1508. [DOI] [PubMed] [Google Scholar]

[R8] Kuk A, Li J & Rush A (2014). Variable and threshold selection to control predictive accuracy in logistic regression. J. R. Statist. Soc C 63, 657–672. [Google Scholar]

[R9] Laber E, Lizotte D, Qian M, Pelham W, & Murphy S (2014). Dynamic treatment regimes: Technical challenges and applications. Electron. J. Stat, 8, 1225–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Luedtke A & van der Laan M (2016). Super-learning of an optimal dynamic treatment rule. Int. J. Biostat 12, 305–332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Murphy K (2012). Machine Learning: A Probabilistic Perspective MIT press. [Google Scholar]

[R12] Murphy S (2003). Optimal dynamic treatment regimes. J. R. Statist. Soc. B 65, 331–366. [Google Scholar]

[R13] Murphy S (2003). A generalization error for Q-learning. J. Mach. Learn. Res 6, 1073–1097. [PMC free article] [PubMed] [Google Scholar]

[R14] Qian M & Murphy S (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39, 1180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Robins J, Hernan M & Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11, 550–560. [DOI] [PubMed] [Google Scholar]

[R16] Rush A et al. (2004). Sequenced treatment alternatives to relieve depression (star*d): rationale and design. Control. Clin. Trials 25, 119–142. [DOI] [PubMed] [Google Scholar]

[R17] Rush A et al. (2006). Bupropion-sr, sertraline, or venlafaxine-xr after failure of ssris for depression. N. Engl. J. Med 354, 1231–1242. [DOI] [PubMed] [Google Scholar]

[R18] Sinyor M, Schaffer A & Levitt A (2010). The sequenced treatment alternatives to relieve depression (star*d) trial: a review. Can. J. Psychiatry 55, 126–135. [DOI] [PubMed] [Google Scholar]

[R19] Song R, Wang W, Zeng D & Kosorok M (2015). Penalized q-learning for dynamic treatment regimens. Stat. Sin 25, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Watkins C & Dayan P (1992). Q-learning. Mach. Learn 8, 279–292. [Google Scholar]

[R21] Zhao Y, Kosorok M & Zeng D (2009). Reinforcement learning design for cancer clinical trials. Stat. Med 28, 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Zhang B, Tsiatis A, Laber E & Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68, 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zhang B, Tsiatis A, Laber E & Davidian M (2012). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100, 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Zhao Y, Zeng D, Laber E & Kosorok M (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Am. Stat. Assoc 110, 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Zhao Y, Zeng D, Rush A & Kosorok M (2012). Estimating individualized treatment rules using outcome weighted learning. J. Am. Stat. Assoc 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Entropy Learning for Dynamic Treatment Regimes

Binyan Jiang

Rui Song

Jialiang Li

Donglin Zeng

Abstract

1. Introduction

2. Method

2.1. Smooth surrogate loss for outcome weighted learning

Figure 1:

2.2. Learning optimal ITRs using the entropy loss

3. Asymptotic Theory for Linear Decisions

3.1. Parameter estimation

3.2. Estimating the optimal value function

3.3. Testing Treatment Effects

4. Simulation Study

One-stage.

Two-stage.

4.1. Estimation and classification performance

Table 1:

Table 2:

Figure 2:

Table 3:

4.2. Testing Xt*⊤βt0=0

Figure 3:

Figure 4:

4.3. Type I error comparison with Q-learning

Table 4:

5. Application to STAR*D

Table 5:

Table 6:

Table 7:

Table 8:

6. Discussion

Supplementary Material

7. Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.2. Testing $X_{t}^{* ⊤} β_{t}^{0} = 0$