Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jan 1.
Published in final edited form as: Stat Sin. 2019;29(4):1633–1655. doi: 10.5705/ss.202018.0076

Entropy Learning for Dynamic Treatment Regimes

Binyan Jiang 1, Rui Song 2, Jialiang Li 3,, Donglin Zeng 4
PMCID: PMC6750237  NIHMSID: NIHMS1011474  PMID: 31534307

Abstract

Estimating optimal individualized treatment rules (ITRs) in single or multi-stage clinical trials is one key solution to personalized medicine and has received more and more attention in statistical community. Recent development suggests that using machine learning approaches can significantly improve the estimation over model-based methods. However, proper inference for the estimated ITRs has not been well established in machine learning based approaches. In this paper, we propose a entropy learning approach to estimate the optimal individualized treatment rules (ITRs). We obtain the asymptotic distributions for the estimated rules so further provide valid inference. The proposed approach is demonstrated to perform well in finite sample through extensive simulation studies. Finally, we analyze data from a multi-stage clinical trial for depression patients. Our results offer novel findings that are otherwise not revealed with existing approaches.

Keywords: Dynamic treatment regime, entropy learning, personalized medicine

1. Introduction

One important goal in personalized medicine is to develop a decision support system to provide the adequate management for individual patients with specific diseases. Estimating individualized treatment rules (ITRs) using evidence from single- or multi-stage clinical trials provides the key solution to develop such a system. Development of powerful methods for the estimation has received more and more attention in statistical community. Early methods for estimating ITRs include Q-learning (Watkins and Dayan, 1992; Murphy, 2005; Chakraborty et al., 2010; Goldberg and Kosorok, 2012; Laber et al., 2014; Song et al., 2015) and A-learning (Robins et al., 2000; Murphy, 2003), where Q-learning models the conditional mean of the outcome given historical covariates and treatment using a well-constructed statistical model and A-learning models the contrast function that is sufficient for treatment decision.

Recently, Zhao et al. (2012) discovered that it is possible to cast the estimation of the optimal regime into a weighted classification problem. Thus, Zhao et al. (2012, 2015) proposed an outcome weighted learning (OWL) to directly optimize the approximate expected clinical outcome, where the objective function is a hinge loss weighted by individual outcomes. Although the latter has been shown to outperform the model-based approaches as in Q- and A-learning in numerical studies and asymptotic behavior might be established due to its convexity Hjort and Pollard (2011), there is no valid inference procedure for the parameters in the optimal treatment rules due to non-differentiability of the hinge loss near decision boundary, and the minimization operator is more or less heuristic.

In this paper, we propose a class of smooth-loss based outcome weighted learning methods to estimate the optimal ITRs, among which one special case of the proposed losses is a weighed entropy loss (Murphy, 2012). By using continuously-differentiable loss functions, we not only maintain the Fisher consistency of the derived treatment rule, but also are able to obtain proper inference for the parameters in the derived rule. We can further quantify the uncertainty of value function under the estimated treatment rule, which is potentially useful for designing future trials and comparing with other non-optimal treatment rules. Numerically, when comparing to the existing inference for the model-based approaches, such as the bootstrap approach for Q-learning, our inference procedure does not require tuning parameters and shows more accurate inference in finite sample numerical studies.

We notice that Bartlett et al. (2006) established a profound conceptual work on classification loss for quite general setting. However, to link such work to recursive or dynamic optimization is not trivial. We work under a logistic loss to achieve this. Luedtke and van der Laan (2016) tried to unify surrogate loss function for outcome-dependent learning. Their way of showing the validity is different from our derivation. Our justification is more intuitive and our computing algorithm is also different. While super learning is quite general and powerful method, logistic regression can be implemented easily and fit our needs directly. Moreover, asymptotic properties of our estimators are established for conducting proper inference, which is not addressed in the two above-mentioned literatures.

The paper is structured as follows. In Section 2, we introduce the proposed entropy learning method for single- and multi-stage settings. In Section 3, we provide the asymptotic properties of our estimators. In Section 4, simulation studies are conducted to assess the performance of our methods. In Section 5, we apply the entropy learning to the well-known STAR*D study. We conclude the paper in Section 6. Technical proofs are provided in the supplementary materials.

2. Method

2.1. Smooth surrogate loss for outcome weighted learning

To motivate our approach of choosing a smooth surrogate loss for learning the optimal ITRs, we first consider data from one single-stage randomized trial with two treatment arms. Treatment assignment is denoted by AA={1,1}. Patient’s prognostic variables are denoted as a p-dimensional vector X. We use R to denote the observable clinical outcome, also called the reward, and assume that R is positive and bounded from above, with larger values of R being more desirable. Data consist of {(Xi, Ai, Ri): i = 1 …, n}.

For a given treatment decision D, which maps X to {−1, 1}, we denote D as the distribution of (X, A, R) given that A=D(X). Then an optimal treatment rule is a rule that maximizes the value function

ED(R)=E{RI(A=D(X))Aπ+(1A)/2}, (2.1)

where π = P(A = 1|X). Following Qian and Murphy (2011), it can be shown that the maximization problem is equivalent to the problem of minimizing

E{RI(AD(X))Aπ+(1A)/2}. (2.2)

The latter is a weighted classification error so can be estimated by the observed sample using

n1i=1n{RiI(AiD(Xi))Aiπ+(1Ai)/2}. (2.3)

Due to the discontinuity and nonconvexity of the 0–1 loss on the right hand side of (2.2), direct minimization of (2.3) is difficult and parameter inference is also infeasible. To alleviate this challenge, the hinge loss from the support vector machine (SVM) was proposed to substitute the 0–1 loss previously (Zhao et al., 2012, 2015) to alleviate the non-convexity and computational problem. However, due to the non-differentiability of the hinge loss, the inference remains challenging. This motivates us to seek a more smooth surrogate loss function for estimation.

Consider an arbitrary surrogate loss h(a,y):{1,1}×RR. Then by replacing the 0–1 loss by this surrogate loss, we estimate the treatment rule by minimizing

Rh(f)=E{Rh(A,f(X))Aπ+(1A)/2}. (2.4)

To evade the non-convexity, we require that h(a, y) be convex in y. Furthermore, simple algebra gives

E{RAπ+(1A)/2h(A,f(X))|X=x}=E[R|X=x,A=1]h(1,f(x))+E[R|X=x,A=1]h(1,f(x))=axh(1,f(x))+bxh(1,f(x)),

where ax=E[R|X=x,A=1] and bx=E[R|X=x,A=1]}. Hence, for any given x, the minimizer for f(x), denoted by yx, solves equation

axh(1,y)+bxh(1,y)=0,

where h′(a, y) is the first derivative of h(a, y) with respect to y. To ensure that the surrogate loss still leads to the correct optimal rule which is equivalent to sgn(axbx), we require that such a solution has the same sign as (axbx). On the other hand, since axh′(1,y) + bxh′(1,y) is non-decreasing in y, we conclude that for ax > bx, if axh′(1, 0)+bxh′(−1, 0) ≤ 0, then the solution yx should be positive; while for ax < bx, if axh′(1, 0)+bxh′(−1, 0) ≥ 0, then the solution yx should be negative. In other words, one sufficient condition to ensure the Fisher consistency is

(axbx)(axh(1,0)+bxh(1,0))0.

However, since ax and bx can be arbitrary non-negative value, this condition holds if and only if

h(1,0)=h(1,0)andh(1,0)0.

In conclusion, the choice of h(a, y) should satisfy

  • (I) For a = −1 and 1, h(a, y) is twice differentiable and convex in y;

  • (II) h′(1, 0) = −h′(−1, 0) and h′(1, 0) ≤ 0.

There are many loss functions which satisfy the above two conditions. Particularly, we can consider loss functions of the form h(a, y) = −ay + g(y). Then the first condition automatically holds if g is twice differentiable and convex. The first equation in the second condition also holds. Finally, since h′(1, 0) = −1 + g′(0), the second part holds if we choose g such that g′(0) = 0. One special case is to choose

g(y)=2log(1+exp(y))y,

and the corresponding loss function is

h(a,y)=(a+1)y+2log(1+exp(y)),

which corresponds to the entropy loss for logistic regression (Figure 1). In our subsequent development, we will focus on using this loss function, although the results apply to any general smooth loss satisfying those two conditions. Correspondingly, (2.4) becomes:

R(f)=E{RAπ+(1A)/2[0.5(A+1)f(X)+log(1+exp(f(X)))]}. (2.5)

Figure 1:

Figure 1:

Comparison of loss functions.

2.2. Learning optimal ITRs using the entropy loss

Now suppose the randomized trial involves T stages where patients might receive different treatments across the multiple stages. With some abuse of notation, we use Xt, Rt and At to denote the set of covariates, clinical outcome and corresponding treatment at stage t = 1, ⋯, T, and let St = (X1, Ai, ⋯, Xt−1, At−1, Xt) be the history by t.

A dynamic treatment regime (DTR) is a sequence of deterministic decision rules, d = (d1,·⋯, dT), where dt is a map from the space of history information St, denoted by St, to the action space of available treatments At={1,1}. The optimal DTR is to maximize the expected total value function Ed(t=1TRt), where the expectation is taken with respect to the distribution of (X1, A1, R1, ⋯, XT, AT, RT) given the treatment assignment At = dt (St).

DTRs aim to maximize the expected cumulative rewards and hence the optimal treatment decision at the current stage must depend on subsequent decision rules. This motivates a backwards recursive procedure which estimates the optimal decision rule at future stages first, and then the optimal decision rule at current stage by restricting the analysis to the subjects who have followed the estimated optimal decision rules thereafter. Assume that we observe data (X1i, A1i, R1i ⋯, XTi, ATi, RTi), i = 1, ⋯, n, forming n independent and identically distributed patient trajectories, and let Sti = {(X1i, A1i, ⋯, At−1,i, Xti): i = 1, …, n} for 1 ≤ tT. Denote π(At, St) = Atπt – (1 − At)/2 where πt = P(At = 1|St) for t = T, …, 1. Suppose that we already possess the optimal regimes at stages t + 1, ⋯, T and denote them as dt+1*,,dT*. Then the optimal decision rule at stage t, dt*(St) should maximize

E{(j=tTRj)j=t+1TI(Aj=dj*(Sj))j=tTπ(Aj,Sj)I(At=dt(St))|St},

where we assume all subjects have followed the optimal DTRs after stage t. Hence, dt* is a map from St to {−1, 1} which minimizes

E{(j=tTRj)j=t+1TI(Aj=dj*(Sj))j=tTπ(Aj,Sj)I(Atdt(St))|St}.

Following (2.5), we consider the entropy learning framework where the decision function at stage t is given as

dt(St)=2I{(1+exp(ft(Xt)))1>1/2}1=sgn{ft(Xt)}, (2.6)

for some function ft(·). Here for simplicity, as defined in equation (2.6), the decision rule is assumed to be depending on the history information St through Xt only. Although St = St−1 ∪ {At−1, Xt}, any elements in St−1 and At−1 can be included as one the covariates in Xt and hence such an assumption is not stringent at all. In particularly, our method is still valid when Xt is set to be St. Given the observed samples, we obtain estimators for the optimal treatments using the following backward procedure.

Step 1. Minimize

1ni=1n{RTiπ(ATi,STi)[0.5(ATi+1)fT(XTi)log(1+exp(fT(XTi)))]}. (2.7)

to obtain the stage-T optimal treatment regime. This is the same as the single-stage treatment selection procedure. Let f^T be the estimator of fT obtained by minimizing (2.7). For a given ST, the estimated optimal regime is then given by d^T(ST)=sgn(f^T(XT)).

Step 2. For t = T − 1, ⋯, 1, we sequentially minimize

n1i=1n{(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)[0.5(Ati+1)ft(Xti)log(1+exp(ft(Xti))]}, (2.8)

where d^t+1,,d^T are obtained prior to stage t. Let f^t be the estimator of ft obtained by minimizing (2.8). For a given St, the estimated optimal regime is then given by d^t(St)=sgn(f^t(Xt)).

Let Hpt be the set of all functions from Rpt to R. As outlined in Section 2.1, the following proposition justifies the validity of our approach.

Proposition 1. Suppose

ft=argmaxfHptE{(j=tTRj)j=t+1TI(Aj=sgn(fj(Xj)))j=tTπ(Aj,Sj)[0.5(At+1)f(Xt)log(1+exp(f(Xt)))]}, (2.9)

backwards through t = T,T − 1, …, 1. We have dj*(Sj)=sgn(fj(Xj)) for j = 1, …, T.

Let Vt=E(dt*,,dT*)i=tTRi be the maximal expected value function at stage t. After obtaining the estimated decision rules d^T,,d^t, for simplicity, we estimate Vt by

V^t=n1i=1n{(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)I(Ati=d^t(Sti))}. (2.10)

We remark that our results also fit into the more general and robust estimation framework constructed by Zhang et al. (2012), Zhang et al. (2013).

3. Asymptotic Theory for Linear Decisions

Suppose the stage-t covariates Xt is of dimension pt for 1 ≤ tT, and assume that the function ft(Xt) in (2.7) and (2.8) is of the linear form: ft(Xt)=(1,Xt)βt for some βtpt+1. (2.7) and (2.8) can then be carried out as a weighted logistic regression. In this section, we establish the asymptotic distributions for the estimated parameters and value functions under the aforementioned linear decision assumption. We remark that when the true unknown solution is nonlinear, similar to other linear learning rules, our approach can only be understood as finding the best approximation of the true solution (2.9) in the linear space.

We consider the multi-stage case only as results for the single-stage case is exactly the same as those for stage T. For the multi-stage case, we denote Xt*=(1,Xt) and the observations Xti*=(1,Xti) for t = 1, …, T and i = 1, …, n. The n × (pt + 1) design matrix for stage t is then given by Xt,1:n=(Xt1*,,Xtn*). Let βt0=(βt00,βt10,,βtpt0) be the solution of (2.9) at stage t and let β^t=(β^t0,β^t1,,β^tpi) be its estimator obtained by solving (2.7) when t = T and (2.8) when t = T — 1, …, 1.

3.1. Parameter estimation

By setting the first derivative of (2.8) to be 0 for stage t where 1 ≤ tT −1, we have

0=1ni=1n{(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)[.5(Ati+1)exp(Xti*βt)1+exp(Xti*βt)]}Xti*.

The Hessian matrix of the left hand side of the above equation is:

Ht(βt)=1nXt,1:nDt(βt)Xt,1:n,

where Dt(βt) = diag{dt1, …, dtn} with

dti=(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)exp(Xti*βt)(1+exp(Xti*βt))2.

Since the Rti’s are positive, Ht(βt) is positive definite with probability one. Consequently, the objective function as in (2.8) is strictly convex, implying the existence and uniqueness of β^t for t = T − 1, …, 1. This is also true for t = T using a similar argument. To obtain the asymptotic distribution of the estimators, we would need the following regularity conditions:

  • (A1) It(βt) is finite and positive definite for any βtpt+1, t = 1, …, T, where
    It(βt)=E(j=tTRj)j=t+1TI(Aj=dj(Sj))j=tTπ(Aj,Sj)exp(Xt*βt)Xt*Xt*(1+exp(Xt*βt))2.
  • (A2) There exists a constant BT such that Rt < BT for t = 1, …, T. In addition, we assume that Xt1i, …, Xtni are i.i.d. random variables with bounded support for i = 1, …, pt. Here Xtij is the jth element of Xti.

  • (A3) Denote Yt=Xt*βt0 and let gt(y) be the density function of Yt for 1 ≤ tT. We assume that y−1gt(y) → 0 as y → 0. In addition, we assume that there exists a small constant b such that for any positive constant C and βNt,b:={β:|ββt0|<b}, P(|Xt*β|<Cy)=O(y) as y → 0.

  • (A4) There exist constants 0 < ct1 < ct2 < 1 such that ct1 < πt < ct2 for t = 1, …, T and P(j=1TI(Aj=dj*(Sj))=1)>0.

Remark 1. By its definition, It(βt) is positive semidefinite. In A1 we assume that It(βt) is positive definite to ensure that the true optimal treatment rule is unique and estimable. The boundness assumption A2 can be further relaxed using some truncation techniques. Assumption A3 indicates that the probability of YtCn12 is of order O(n12). This is in some sense necessary to ensure that the optimal decision is estimable, and is also essential for establishing asymptotic normality without an additional Bernoulli point mass as in Laber et al. (2014). Assumption A4 is to ensure that the treatment design is valid such that the probability of a patient being assigned to the unknown optimal treatments is non-negligible.

Theorem 1. Under assumptions A1–A4, we have for t = T, …, 1, for any constant κ > 0, there exists a large enough constant Ct,

P(|β^tβt0|>Ctlognn)=o(lognn), (3.1)

and given Xt*, for any x > 1 and x=o(n), we have

P(|Xt*(βt0β^t)|>xWtn|Xt*)={1+O(x3n)}Φ(x)+O(lognn), (3.2)

where Wt2=Var(Xt*(βt0β^t)) and Φ(·) is the cumulative distribution function of the standard normal distribution. In addition, for the ith sample we have

E|j=tTI(Aji=d^j(Sji))j=tTI(Aji=dj*(Sji))|=o(lognn). (3.3)

Furthermore we have,

nIt(βt0)(β^tβt0)N(0,Γt), (3.4)

Where Γt= (γtjk)1≤j, k≤p+1 with

γtjk=E[(j=tTRji)j=t+1TI(Aji=dj(Sji))j=tTπ(Aji,Sji)]2[0.5(Ati+1)exp(Xti*βt0)1+exp(Xti*βt0)]2Xtij*Xtik*,

and Xtij* is the jth element of Xti*.

Remark 2. We remark that the proof of Theorem 1 is not straightforward as for stage t < T, the n terms in the summation of the objective function (2.8) are weakly dependent to each other. Note that the estimation errors of the indicator functions in (2.8) might aggregate when the estimators are obtained sequentially. We thus need to show that the estimation errors of these indicator functions are well controlled. By establishing Bernstein-type concentration inequalities (3.1) and large deviation results (3.2) for the parameter estimation, we establish error bounds (3.3) for the estimation of these indicator functions. The asymptotic distribution of the estimators can then be established subsequently. Detailed proofs are provided in the supplementary material. On the other hand, from the proofs we can see that the asymptotic results we obtained in the above theorem would also hold if other loss functions satisfying the two conditions raised in Section 2.1 are used, with some corresponding modifications to Condition (A1) and the covariance matrix.

In practice we estimate Γt in Theorem 1 by Γ^t=(γ^tjk)1j,kpt+1 with

γ^tjk=1ni=1n[(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)]2[0.5(Ati+1)exp(Xti*β^t)1+exp(Xti*β^t)]2Xtij*Xtik*.

The covariance matrix n(β^tβt0) can then be estimated by: Σ^t=Ht1(β^t)Γ^tHt1(β^t).

3.2. Estimating the optimal value function

In this subsection, we establish the asymptotic normality of the estimated maximal expected value function defined in (2.10) when the f(x) is a linear function of x.

Theorem 2. Under the same assumptions of Theorem 1, we have,

n(V^tVt)N(0,ΣVt),t=1,,T,

where V^t is defined as in (2.10) and,

ΣVt=E{(j=tTRj)j=t+1TI(Aj=dj(Sj))j=tTπ(Aj,Sj)I(At=dt(St))}2{E(j=tTRj)j=t+1TI(Aj=dj(Sj))j=tTπ(Aj,Sj)I(At=dt(St))}2.

When conducting inferences, ΣVt can be simply estimated by their empirical estimators:

Σ^Vi=1ni=1n{(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)I(Ati=d^t(Sti))}2{1ni=1n(j=tTRji)j=t+1TI(Aji=d^j(Sji))j=tTπ(Aji,Sji)I(Ati=d^t(Sti))}2.

3.3. Testing Treatment Effects

In practice, treatments in some stages might not be effective for some patients. When the true optimal treatment rule is linear in Xt, non-significant treatment effect on stage t for some 1 ≤ tT is equivalent to Xt*βt0=0. Here Xt*=(1,Xt). From Theorem 1 we immediately have that given Xt, Xt*β^tN(Xt*βt0,1nXt*It(βt0)1ΓtIt(βt0)Xt*). Therefore, we can then use Xt*β^t as a test statistic for testing the significance of treatment effects: for a realization xt* and a given significance level α, we reject H0:xt*βt0=0 if n|(xt*I^t(β^t)1Γ^tI^tt(β^t)xt*)1/2xt*β^t|>Φ(1α/2), where I^t(β^t),Γ^t(β^t) are empirical estimators of It, Γt evaluated at β^t and Φ(·) is the cumulative distribution function of the standard normal distribution.

Before we proceed to numerical studies, we remark that theoretical results we obtained in this section would still be valid when the model is mis-specified. However, the parameters we are estimating are the maximizer of (2.5) under the linear space, instead of the parameters in the optimal decision rules.

4. Simulation Study

We next carry out numerical studies to assess the performance of our proposed methods.

One-stage.

The treatment A is generated uniformly from { − 1, 1} and is independent of theprognostic variables X = (x1, …, xp). We set the reward R = Q(X) + T (X, A) + ϵ where T(X, A) reflects the interaction between treatment and prognostic variables and ϵ is a random variable such that ϵ = |Y|/10 where Y has a standard normal distribution. Such a folded normal error is chosen since R is restricted to be positive. We consider the following models.

Model 1. x1,x2,x3 are generated independently and uniformly in [−1, 1]. We generate the reward R = Q(X) + T(X, A) + ϵ by setting T(X, A) = 3(.4 − x1x2)A, Q(X) = 8 + 2x1x2 + .5x3. In this case, the decision boundary is only determined by x1 and x2.

Model 2. X = (x1,x2,x3) is generated from a multivariate normal distribution with mean zero and covariance matrix Σ = (σij)3×3 where σij = .5|ij| for 1 ≤ i, j ≤ 3. We generate the reward R by setting T(X,A)=(.82x12x2)A, Q(X)=8+2x1x2+.5x3. The decision boundary of this case is also determined by x1 and x2.

Next we consider some multi-stages cases. The treatments At are generated independently and uniformly from {−1, 1} and are independent to the p-dimensional vector of prognostic variables Xt = (xt1, …, xtp), t =1, …, T. ϵ is generated in the same way as in the single stage.

Two-stage.

Model 3. Stage 1 outcome R1 is generated as follow: R1 = (1 − 5x11 − 5x12)A1 + 11.1 + .1x11 − .1x12 + .1x13 + ϵ where x11, x12,x13 are generated independently from a uniform distribution in [−1, 1]. Stage 2 outcome R2 is generated by R2 = .5A1A2 + 3 + (.2 − x21x22)A2 + ϵ where x2i = x1i, i = 1, 2, 3. In this case the covariates from the two stages are identical.

Model 4. We use the same setting as Model 3 except that we set x2i = .8x1i + .2Ui, i = 1, 2, 3 where Ui is randomly generated from U[−1,1]. In this case covariates from the two stages are different and correlated.

4.1. Estimation and classification performance

We first examine the performance of the estimated coefficient parameters, the corresponding value functions and the classification accuracy.

For stage t, given a sample size n, we repeat the simulation for 2000 times and compute the coverage rate C Rtj· which is the proportion that [β^tj1.96σ^tjj,β^tj+1.96σ^tjj] covers the true parameter βtj· for j = 0, …, p, where σ^tjj is the (j, j)th element of Σ^t. CRVt is defined similarly for the coverage rate of the value function. A validation set with 100,000 observations is simulated to compute the oracle values and assess the performance of our approach.

We set the sample size to be n = 50,100, 200,400 and 800. Coverage rates under Models 1–4 are given in Tables 1 and 2. For each replication under each model, we also compute the misclassification rate in each stage. Figure 2 gives the boxplots of the misclassification rates over 2000 replications for all four models.

Table 1:

Coverage rates of the expected value function and coefficient parameters under Models 1 and 2.

Model 1 Model 2
n CRV1 CR10 CR11 CR12 CR13 CRV1 CR10 CR11 CR12 CR13
50 0.927 0.948 0.950 0.938 0.945 0.946 0.944 0.937 0.931 0.924
100 0.936 0.950 0.947 0.949 0.944 0.942 0.947 0.949 0.945 0.940
200 0.942 0.954 0.947 0.955 0.952 0.951 0.950 0.950 0.953 0.947
400 0.940 0.949 0.960 0.954 0.944 0.946 0.963 0.952 0.949 0.933
800 0.944 0.944 0.953 0.947 0.943 0.951 0.955 0.952 0.954 0.943

Table 2:

Coverage rates of the expected value function and coefficient parameters under Models 3 and 4.

Model 3 Stage 1 Stage 2
n CRV1 CR10 CR11 CR12 CR13 CRV2 CR20 CR21 CR22 CR23
50 0.872 0.946 0.937 0.945 0.947 0.912 0.949 0.939 0.951 0.951
100 0.928 0.949 0.956 0.953 0.948 0.941 0.952 0.956 0.954 0.94J
200 0.936 0.947 0.942 0.942 0.951 0.950 0.950 0.946 0.948 0.935
400 0.941 0.943 0.948 0.943 0.950 0.943 0.948 0.952 0.948 0.956
800 0.957 0.944 0.955 0.945 0.941 0.954 0.939 0.951 0.952 0.952
Model 4 Stage 1 Stage 2
50 0.865 0.948 0.944 0.941 0.947 0.908 0.942 0.948 0.942 0.942
100 0.908 0.951 0.939 0.954 0.940 0.942 0.955 0.943 0.947 0.949
200 0.941 0.940 0.943 0.951 0.948 0.948 0.948 0.954 0.954 0.951
400 0.945 0.944 0.946 0.956 0.952 0.948 0.943 0.951 0.947 0.950
800 0.954 0.949 0.946 0.957 0.953 0.951 0.950 0.963 0.952 0.950

Figure 2:

Figure 2:

Boxplot of misclassification rates over 2000 replications.

From Tables 1 and 2 we observe that the coverage rates are close to the nominal level (95%) and improve as the sample size increases, indicating that the asymptotic normality of our estimators is well established. In particular, the coverage rates of the coefficient parameter estimators are very close to 95% even when the sample size is as small as 50. The boxplots in Figure 2 also indicate that the misclassification rate of the estimated decision rule decreases towards zero as the sample size increases.

Note that the ultimate goal of dynamic treatment regimes is to maximize the value functions. We next compare our Entropy learning with Q-learning and Outcome-weighted learning in terms of value function estimation. Throughout this paper, Q-learning and Outcome-weighted learning are implemented using the R package “DTRlearn”. In addition to Models 1–4, we also consider the following nonlinear cases.

Model 5. x1,x2,x3 are generated independently and uniformly in [−1, 1]. We generate the reward R = Q(X, A) + ϵ with Q(X, A) = [−T(X) (A+1) + 2 log(1 + exp(T(X)))]−1, where T(X) = (x1x2 + 2x1x2).

Model 6. Same as Model 5 except that x1,x2, x3 are discrete variables generated independently and uniformly in {−1, 0, 1}.

Model 7. Stage 1 outcome R1 is generated as follow: R1 = [0.2 − T1(X1)(A1 + 1) + 2 log(1 + T1(X1))]−1 + ϵ where T1(X1)=x11x12+2x132+2x11x12 with x11, x12, x13 generated independently from a uniform distribution in [−1, 1]. Stage 2 outcome R2 is generated by R2 = [0.05 + (1 + A2) (1 + A1)/4 − T2 (X2)(A2 + 1) + 2 log(1 + T2 (X2))]−1 + ϵ where x2i = x1i, i = 1, 2, 3 and T2(X2)=x21x22+2x232+2x21x22.

Model 8. Same as Model 7 except that x11,x12,x13 are discrete variables generated independently and uniformly in {−1, 0, 1}.

For each model, we generate 200 random samples and the corresponding estimated treatment rules used to compute the value function using (2.3) with a validation set of size n = 500, 000. The above procedure is repeated for 100 times and the results are reported in table 3.

Table 3:

Comparison of value functions using Entropy learning (E-learning), Q-learning and Outcome-weighted learning (OW-Learning) under Models 1–8.

Model E-Learning Q-Learning OW-Learning
Model 1 10.2(0.1) 10.3(0.0) 10.3(0.0)
Model 2 9.4(0.1) 9.4(0.0) 9.4(0.0)
Model 3 Stage 2 3.7(0.1) 3.7(0.0) 3.7(0.0)
Model 3 Stage 1 14.5(0.4) 15.0(0.0) 15.0(0.0)
Model 4 Stage 2 3.6(0.1) 3.6(0.0) 3.6(0.00)
Model 4 Stage 1 14.5(0.6) 15.0(0.0) 15.0(0.0)
Model 5 1.8(0.0) 1.7(0.0) 1.8(0.0)
Model 6 4.8(0.1) 4.1(0.1) -(-)
Model 7 Stage 2 1.5(0.0) 1.5(0.0) 1.5(0.0)
Model 7 Stage 1 1.1(0.1) 1.0(0.0) 1.1(0.1)
Model 8 Stage 2 3.0(0.1) 2.8(0.2) -(-)
Model 8 Stage 1 1.9(0.3) 0.9(0.2) -(-)

From Table 3 we note the value functions of our Entropy learning method are comparable to those of Q-learning and Outcome-weighted learning under models 1–4. However, under Models 5 and 7, where the true treatment regimes are nonlinear, the value functions of Entropy learning and Outcome-weighted learning are very close and seem to be slightly better than those of Q-learning. However, when we consider discrete covariates in Models 6 and 8, Outcome-weighted learning can hardly produce any result due to a very large condition number in solving a system of equations.

4.2. Testing Xt*βt0=0

In the dynamic treatment regime literature, the non-regularity condition P(Xt*βt0=0)=0 is usually required (for example in Q-learning) to enable parameter inference. Here in this study we study the performance of testing Xt*βt0=0 based on our entropy learning approach.

  • Case 1: testing X*β10=0 under model 1. Let X*=(1,x1,x2,x3) be the covariate of a new observation and β10=(β100,β110,β120,β130) be the true parameters. By setting x1 = x3 = 1 and x2=(β100+x1β110+x3β130)/β120 we have X*β10=0.

  • Case 2: testing X1*β10=0 under model 4. We set x11 = x13 = −1 and x12=(β100+x11β110+x13β130)/β120.

We set n = 50, 100, 200, 400. Note that

Xt*β^tN(Xt*βt0,Xt*It(βt0)1ΓtIt(βt0)Xt).

We use Xt*I^t(β^t)1Γ^tI^t(β^t)Xt* to estimate the variance of Xt*β^t where I^t and Γ^t are the empirical estimators of It and Γt. For each case we run the simulation for p=1000 times and for each replication we compute the p-value of Xt*β^t. P-value plots are given in Figures 3 and 4. We can see that the distribution of the p-values can be well fitted by the uniform distribution in [0,1], indicating that our tests can perform well in detecting non-significant treatment effect.

Figure 3:

Figure 3:

P-value of Xβ^1 under case 1 over 1000 replications.

Figure 4:

Figure 4:

P-value of X1β^1 under case 2 over 1000 replications.

4.3. Type I error comparison with Q-learning

We next assess the performance of hypothesis tests since it is often of interest to investigate the significance of coefficient parameters. Note that in Models 3 and 4, we have β13 = β23 = 0. We then compute the type I error for testing β13 = 0 and β23 = 0. In the optimization problems (2.7) and (2.8) decisions Ai are formularized as the weights of a weighted negative log-likelihood. Consequently, unlike Q-learning (Zhao et al. (2009)), the objective functions for the estimation of the parameters become continuous functions and parameter inferences become feasible even without the non-regularity condition. For comparison, we compute the same quantities using the bootstrap scheme for Q-learning. We remark that the βij·’s using Entropy learning and the βij·’s using Q-learning are generally different, but in Models 3 and 4, x13 and x23 were not involved in treatment selection part, and hence the true β’s in both entropy learning and Q-learning are zero. Here the significance level α is set to be 0.05 and we consider n = 50,100, 400, 800. The simulation is repeated for 2000 times and the result is given in Table 4. From Table 4 we see that most of the type I errors using entropy learning are closer to α = 0.05, which indicates that our learning method can be more appropriate for testing the significance of covariates.

Table 4:

Type I error comparison using Entropy learning and Q-learning, where “Elearn” refers to Entropy learning and “Qlearn” refers to Q-learning.

Model 3 Model 4
H0 : β13 = 0 H0 : β23 = 0 H0 : β13 = 0 H0 : β23 = 0
n Elearn Qlearn Elearn Qlearn Elearn Qlearn Elearn Qlearn
50 0.063 0.069 0.050 0.057 0.060 0.054 0.055 0.056
100 0.044 0.063 0.054 0.056 0.044 0.057 0.043 0.055
400 0.049 0.043 0.055 0.043 0.047 0.053 0.047 0.046
800 0.050 0.059 0.044 0.064 0.047 0.053 0.054 0.055

5. Application to STAR*D

We consider a real data example extracted from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study funded by the National Institute of Mental Health. STAR*D is a multisite, prospective, randomized, multistep clinical trial of outpatients with nonpsychotic major depressive disorder; see Rush et al. (2004) and Sinyor et al. (2010) for more study details. The complete trial involved four sequential treatment stages (or levels), and patients were encouraged to participate in the next level of treatment if they failed to achieve remission or adequate reduction in symptoms.

Patients who did not experience a remission of symptoms during the first level of the STAR*D study in which they initially took the antidepressant citalopram, a selective serotonin reuptake inhibitor (SSRI) for up to 14 weeks had the option of continuing to level 2 of the trial where they could explore additional treatment options designed to help them become symptom-free (Rush et al. (2006)). Because there was only single treatment for all patients in level 1, we will not discuss such data in this paper.

Level 2 of the study offered seven different treatments; four options “switched”: the study participants from citalopram to a new medication or talk therapy, and three options “augmented”: citalopram treatment by adding a new medication or talk therapy to the citalopram they were already receiving. Data taken from Level 2 will be treated as the first stage observations in this paper and we define A1 = −1 if the treatment option is a switch and A1 = 1 if the treatment option is an augmentation.

During levels 1 and 2 of the STAR*D trial, which started with 2,876 participants, about half of all patients became symptom-free. The other half were then eligible to enter level 3 where as in level 2, patients were given the choice of either switching medications or adding another medication to their existing medication (Fava et al. (2006)). Data taken from level 3 of this trial will be treated as the second stage observations in this paper and we define A2 = −1 if the treatment option is a switch and A2 = 1 if the treatment option is an augmentation.

After removing cases with missing values from the data files we obtain a sample of 316 patients whose medical information from the two stages are available. Among these 316 patients, 119 and 197 of them were respectively assigned to the augmentation group and the switch group in Stage 1, 115 and 201 of them were respectively assigned to the augmentation group and the switch group in Stage 2. The 16-item Quick Inventory of Depressive Symptomatology-Self-Report (QIDS-SR(16)) scores were obtained at treatment visits for the patients and considered as the primary outcome variable in this paper. To accommodate with our model where the reward is positive and “the larger the better”, we used R = c — QIDS-SR(16) as the reward at each level where c is a constant that bounds the empirical QIDS-SR(16) scores. In this study we simply set c = 30 so that all QIDS-SR(16) scores are positive.

Following earlier analysts (eg. Kuk et al. (2010, 2014)), we consider the following set of clinically meaningful covariates: (i) chronic depression indicator: equals 1 if chronic episode > 2 years and 0 otherwise; (ii) gender: male= 0 and female= 1; (iii) patient age (years); (iv) general medical condition (GMC) defined to be 1 for the presence of one or more general medical conditions and 0 otherwise; (v) anxious feature defined to be 1 if Hamilton Depression Rating Scale anxiety/somatization factor score ≥ 7 and 0 otherwise (Fava et al. (2008)); In addition, we also consider (vi) week: the number of weeks patients had spent in the corresponding stage when the QIDS-SR(16) scores at exit were determined and (vii) the baseline QIDSSR(16) scores at the corresponding stages. Summary of these covariate are given in Table 5.

Table 5:

Summary statistics for the covariates in the STAR*D study: for continuous variables, we report means and standard deviations; for dichotomous variables, we report proportions and standard deviations.

Chronic Gender Age GMC
Stage 1 Stage 2 Stage 1 Stage 2 Stage 1 Stage 2 Stage 1
Switch 0.29 (0.03) 0.29 (0.03) 0.51 (0.04) 0.46 (0.04) 43.99 (0.88) 45.78 (0.84) 0.59 (0.04)
Augmentation 0.26 (0.04) 0.26 (0.04) 0.49 (0.05) 0.57 (0.05) 44.76 (1.05) 41.65 (1.11) 0.56 (0.05)
GMC Anxiety Week QIDS-SR(16)
Stage 2 Stage 1 Stage 2 Stage 1 Stage 2 Stage 1 Stage 2
Switch 0.62 (0.03) 0.76 (0.03) 0.74 (0.03) 9.21 (0.30) 7.48 (0.34) 14.96 (0.29) 14.54 (0.31)
Augmentation 0.51 (0.05) 0.70 (0.04) 0.73 (0.04) 9.64 (0.40) 9.35 (0.46) 13.45 (0.37) 12.77 (0.37)

We applied the methods introduced in this paper to estimate the covariate effects on the optimal treatment allocation for the patients in this study. The fitted results under the entropy learning approach are given in Table 6. From Table 6 we notice that baseline QIDS-SR(16) score is a significant predictor to determine whether the patient should be treated with the switch option or the argumentation option in both stages. More specifically, given other covariates, if the patient has a higher baseline score, adopting a switch option might have better medical outcome. In addition, for Stage 2 analysis, baseline score, gender, age and the treatment time are all significant in determining the best treatment options. Interestingly, the treatment time is significant with a positive sign, indicating that given other covariates, treatment argumentation might benefit the patients for a longer term.

Table 6:

Entropy learning for the STAR*D study.

Stage 1 Stage 2
coefficient(sd) p-value coefficient(sd) p-value
Entropy learning
intercept 0.855 (0.987) 0.386 0.452 (0.792) 0.569
chronic −1.231 (0.455) 0.007 0.103 (0.314) 0.742
gender −0.604 (0.340) 0.859 0.702 (0.269) 0.009
age 0.001 (0.016) 0.950 −0.028 (0.012) 0.022
gmc 0.089 (0.359) 0.805 −0.121 (0.274) 0.658
anxious 0.095 (0.373) 0.799 0.235 (0.298) 0.431
week 0.066 (0.036) 0.071 0.089 (0.029) 0.002
qctot −0.084 (0.044) 0.056 −0.111 (0.034) 0.001
A1 - - 0.925 (0.273) 0.001
V^i 59.617 (5.485) - 25.697 (1.325) -

For comparison, using the same sets of covariates, estimation results based on Q-learning are given in Table 7 where the estimated confidence intervals are obtained from the bootstrap procedure. Eyeballing Table 7, we note that gender is identified as the only important factor to the treatment selection at stage 2. Such an existing method may be less powerful than our proposed entropy learning since it may miss potentially useful markers. Consequently, Q-learning may not be able to achieve the most appropriate treatment allocation using a set of important personalized characteristics identified from a significance study. To compare the performance of the proposed method with Q-learning in terms of value function, we also compute the estimated mean and standard deviation of the value functions using the fitted regimes obtained by using our method and the Q-learning method; see the V^i values in Tables 6 and 7. We do observe larger mean value functions for our entropy learning approach, indicating that the treatment regime obtained by our approach is outperforming that of Q-learning in this data set.

Table 7:

Bootstrap confidence interval of Q-learning for the STAR*D study. Lower: lower bound of the 95% confident interval; Upper: upper bound of the 95% confident interval;

Stage 1 Stage 2
coefficient Lower Upper coefficient Lower Upper
Q-learning
intercept 0.99 −4.28 5.50 −2.17 −5.32 0.73
chronic −0.48 −2.31 1.33 −0.63 −1.75 0.48
gender 0.66 −0.80 2.24 1.30 0.37 2.28
age −0.03 −0.09 0.04 0.02 −0.03 0.07
gmc 0.06 −1.48 1.59 0.26 −0.83 1.39
anxious 1.35 −0.32 3.00 0.62 −0.45 1.65
week −0.14 −0.31 0.04 −0.07 −0.16 0.02
qctot −0.06 −0.24 0.14 −0.02 −0.16 0.11
A1 - - - 0.11 −0.44 0.66
V^i 40.34 32.08 48.60 20.54 17.83 23.25

Noted from our experiences, the entropy learning approach may be incorrectly interpreted by some practitioners. The fitted regression model should not be confused with an ordinary association study resulted from fitting unweighted logistic regression models to the two stage data (see Table 8). In fact, the significant findings from Table 8 only establish how covariates affect the likelihood of being observed in a treatment in lieu of the likelihood of being allocated into the most appropriate treatment.

Table 8:

Ordinary association study for the STAR*D data using logistic regression models.

Stage 1 Stage 2
coefficient(sd) p-value coefficient(sd) p-value
intercept 0.210 (0.740) 0.777 0.182 (0.772) 0.813
chronic −0.183 (0.279) 0.511 0.141 (0.296) 0.635
gender 0.012 (0.242) 0.961 0.537 (0.260) 0.039
age 0.010 (0.011) 0.352 −0.030 (0.012) 0.012
gmc −0.093 (0.259) 0.719 −0.219 (0.275) 0.425
anxious −0.117 (0.275) 0.671 0.096 (0.295) 0.744
week 0.032 (0.029) 0.269 0.098 (0.027) < 0.001
qctot −0.091 (0.030) 0.003 −0.104 (0.031) 0.001
A1 - - 0.851 (0.260) 0.001

Finally, since the original design at level 2 of the STAR*D trial was an equipoise-stratified design, one potential source of confounding effects could be due to patient’s preference to the strata in the design. A further examination of this issue would be to consider including patient’s preference in the treatment estimation strategies if we trust that patients’ selection of treatment options (between switch and augmentation) within each stratum is random, as was apparently assumed in the original equipoise-stratified design (Sinyor et al. (2010)).

6. Discussion

There are many open questions that may follow our development in this paper. First, the linear specification of the treatment allocation rule may be replaced by a nonparametric formulation such as a partly linear model or an additive regression model. The implementation of such methods is now widely available in all kinds of statistical packages. More efforts are still demanded to establish similar theoretical properties as in this paper and achieve interpretable results.

Second, to carry out the clinical study and select the best treatment using our approach, it is necessary to evaluate the required sample size at the designing stage. Applying the theoretical results attained in this paper, we may proceed to calculate the total number of subjects for every treatment group. However, more empirical studies on various types of settings and data distributions can provide stronger support to the suggestion based on asymptotic results.

Finally, missing values are quite common for the multi-stage analysis. Most of analysts follow the standard practice to exclude cases with missing observations under the missing-at-random assumption. It is a difficult task to investigate the reason of missing data and an even more difficult task to address the problem when missing is not at random. We encourage more research works in this direction.

Supplementary Material

supp

7. Acknowledgement

We thank the Editor, the Associate Editor and two reviewers for instructive comments. The work was partly supported by Academic Research Funds R-155-000-205-114, R-155-000-195-114 and Tier 2 MOE funds in Singapore MOE2017-T2-2-082: R-155-000-197-112 (Direct cost) and R-155-000-197-113 (IRC).

Footnotes

Supplementary Materials

Title: Supplement Material for “Entropy Learning for Dynamic Treatment Regimes”. Technical proofs for the propositions and theorems are provided in this supplemetrary file.

References

  1. Bartlett P, Jordan M & McAuliffe J (2006). Convexity, classification, and risk bounds. J. Am. Statist. Assoc 101, 138–156. [Google Scholar]
  2. Chakraborty B, Murphy S & Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods. Med. Res 19, 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Fava M et al. (2006) A comparison of mirtazapine and nortriptyline following two consecutive failed medication treatments for depressed outpatients: A star*d report. Am. J. Psychiatry 163, 1161–1172. [DOI] [PubMed] [Google Scholar]
  4. Fava M et al. (2008) Difference in treatment outcome in outpatiens with anxious versus nonaxious depression: a star*d report. Am. J. Psychiatry 165, 342–351. [DOI] [PubMed] [Google Scholar]
  5. Goldberg Y & Kosorok M (2012). Q-learning with censored data. Ann. Statist 40, 529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Hjort N & Pollard D (2011). Asymptotics for minimisers of convex processes. Preprint series ArXiv preprint arXiv:1107.3806. [Google Scholar]
  7. Kuk A, Li J & Rush A (2010). Recursive subsetting to identify patients in the star*d: a method to enhance the accuracy of early prediction of treatment outcome and to inform personalized care. J. Clin. Psychiatry 71, 1502–1508. [DOI] [PubMed] [Google Scholar]
  8. Kuk A, Li J & Rush A (2014). Variable and threshold selection to control predictive accuracy in logistic regression. J. R. Statist. Soc C 63, 657–672. [Google Scholar]
  9. Laber E, Lizotte D, Qian M, Pelham W, & Murphy S (2014). Dynamic treatment regimes: Technical challenges and applications. Electron. J. Stat, 8, 1225–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Luedtke A & van der Laan M (2016). Super-learning of an optimal dynamic treatment rule. Int. J. Biostat 12, 305–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Murphy K (2012). Machine Learning: A Probabilistic Perspective MIT press. [Google Scholar]
  12. Murphy S (2003). Optimal dynamic treatment regimes. J. R. Statist. Soc. B 65, 331–366. [Google Scholar]
  13. Murphy S (2003). A generalization error for Q-learning. J. Mach. Learn. Res 6, 1073–1097. [PMC free article] [PubMed] [Google Scholar]
  14. Qian M & Murphy S (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39, 1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Robins J, Hernan M & Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11, 550–560. [DOI] [PubMed] [Google Scholar]
  16. Rush A et al. (2004). Sequenced treatment alternatives to relieve depression (star*d): rationale and design. Control. Clin. Trials 25, 119–142. [DOI] [PubMed] [Google Scholar]
  17. Rush A et al. (2006). Bupropion-sr, sertraline, or venlafaxine-xr after failure of ssris for depression. N. Engl. J. Med 354, 1231–1242. [DOI] [PubMed] [Google Scholar]
  18. Sinyor M, Schaffer A & Levitt A (2010). The sequenced treatment alternatives to relieve depression (star*d) trial: a review. Can. J. Psychiatry 55, 126–135. [DOI] [PubMed] [Google Scholar]
  19. Song R, Wang W, Zeng D & Kosorok M (2015). Penalized q-learning for dynamic treatment regimens. Stat. Sin 25, 901–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Watkins C & Dayan P (1992). Q-learning. Mach. Learn 8, 279–292. [Google Scholar]
  21. Zhao Y, Kosorok M & Zeng D (2009). Reinforcement learning design for cancer clinical trials. Stat. Med 28, 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhang B, Tsiatis A, Laber E & Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68, 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Zhang B, Tsiatis A, Laber E & Davidian M (2012). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100, 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Zhao Y, Zeng D, Laber E & Kosorok M (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Am. Stat. Assoc 110, 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Zhao Y, Zeng D, Rush A & Kosorok M (2012). Estimating individualized treatment rules using outcome weighted learning. J. Am. Stat. Assoc 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

RESOURCES