Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jun 1.
Published in final edited form as: Biometrics. 2014 Oct 29;71(2):450–459. doi: 10.1111/biom.12258

Sequential Multiple Assignment Randomized Trial (SMART) with Adaptive Randomization for Quality Improvement in Depression Treatment Program

Ying Kuen Cheung 1,, Bibhas Chakraborty 1,2, Karina W Davidson 3
PMCID: PMC4429017  NIHMSID: NIHMS680616  PMID: 25354029

Summary

Implementation study is an important tool for deploying state-of-the-art treatments from clinical efficacy studies into a treatment program, with the dual goals of learning about effectiveness of the treatments and improving the quality of care for patients enrolled into the program. In this article, we deal with the design of a treatment program of dynamic treatment regimens (DTRs) for patients with depression post acute coronary syndrome. We introduce a novel adaptive randomization scheme for a sequential multiple assignment randomized trial of DTRs. Our approach adapts the randomization probabilities to favor treatment sequences having comparatively superior Q-functions used in Q-learning. The proposed approach addresses three main concerns of an implementation study: it allows incorporation of historical data or opinions, it includes randomization for learning purposes, and it aims to improve care via adaptation throughout the program. We demonstrate how to apply our method to design a depression treatment program using data from a previous study. By simulation, we illustrate that the inputs from historical data are important for the program performance measured by the expected outcomes of the enrollees, but also show that the adaptive randomization scheme is able to compensate poorly specified historical inputs by improving patient outcomes within a reasonable horizon. The simulation results also confirm that the proposed design allows efficient learning of the treatments by alleviating the curse of dimensionality.

Keywords: Behavioral intervention, Dynamic treatment regimen, Implementation research, Problem-solving therapy, Play-the-winner, Q-learning

1. Introduction

It is common to conduct implementation studies to facilitate the uptake of new treatments, particularly in the area of behavioral intervention for chronic conditions. An implementation study generally aims to adapt existing treatments in a care setting or treatment program so as to improve the quality of care for patients enrolled to the program while producing knowledge about the treatments (Cheung and Duan, 2014). We are motivated by the needs for implementing effective depression interventions in clinical care for patients suffering from acute coronary syndrome. Even though numerous treatment modules including medications and problem-solving therapy (PST) are available, depression management for these patients remains poor due to suboptimal treatment administration (Thombs et al., 2008). As the management of depression may involve multiple treatment components given over a period of time, a successful intervention is likely a direct result of administering each component or their combination in an optimal sequence, possibly based on intermediate outcomes, with an objective to maximize the eventual health outcome. Thus, the optimal intervention is potentially a dynamic treatment regimen (DTR; Murphy, 2003); see Chakraborty, Laber, and Zhao (2013) for an example of a DTR in a depression study setting.

In an attempt to improve care of patients with depression after acute coronary syndrome, Davidson et al. (2013) compare a centralized depression care approach to standard care in the Comparison of Depression Interventions after Acute Coronary Syndrome (CODIACS) Vanguard trial. The trial adopted the stepped care approach, whereby initial treatments were chosen based on patient preference or standard care, and then “stepped” based on intermediate symptoms in the treatment arm. As a result, patients in CODIACS received different treatment sequences. This would allow estimation of an optimal DTR by using reinforcement learning techniques such as Q-learning (Watkins, 1989; Murphy, 2005a); and in principle, the results could be transitioned to implementation in a depression treatment program. The validity of such analysis, however, depends on the untestable assumption of ignorable treatment (Rubin, 1974; Robins, 1997), which asserts that the assigned treatment is independent of the potential future outcomes, conditional on the subject history. Therefore, deploying a stepped care approach in a depression treatment program does not necessarily lead to unbiased learning about the optimal treatments.

An alternative to the stepped care approach is the sequential multiple assignment randomized trial (SMART) strategy, whereby a patient is initially randomized to a treatment, and is re-randomized at a subsequent stage based on intermediate outcomes (Thall, Millikan, and Sung, 2000; Lavori and Dawson, 2004). By virtue of randomization, the assumption of ignorable treatment holds. In this article, we propose using SMART-type designs for implementation studies within a depression treatment program. A regular SMART design randomizes subjects to available treatment options according to pre-specified probabilities, often without regard to the likely benefits of the treatment. In particular, a typical strategy aims to achieve equal sample sizes across possible treatment sequences (Murphy, 2005b). This approach in theory maximizes the comparative power for comparing two treatment sequences, but it is at odds with the objective of an implementation study. In addition, in order to cover all possible branches of treatment sequences, a SMART design may suffer from the “curse of dimensionality”. In drug trials such as the CATIE trial for schizophrenia (Schneider et al., 2001) and the STAR*D trial for depression (Rush et al., 2004), the curse of dimensionality can be alleviated by restricting intervention options that are ethically and scientifically feasible, hence reducing the number of possible sequences to be tested. For example, it is quite common to use the play-the-winner strategy, whereby a patient with a positive intermediate outcome will stay on the same treatment. In giving cancer therapy, the frontline treatment is continued upon a response or stable disease (Thall et al., 2007). While reasonable for administering medications, playing the winner is not necessarily the ethical or optimal approach for behavioral interventions such as PST.

Motivated by these practical considerations, we propose applying SMART with adaptive randomization (AR) for implementation research within a depression treatment program. The use of AR is increasingly common in cancer trials for non-dynamic treatments (e.g., Cheung et al., 2006; Thall and Wathen, 2007; Barker et al., 2009). The idea is to use the outcome data from patients treated previously in a study to unbalance the randomization probabilities in favor of the empirically superior treatments for the future study subjects. This is appealing for the purpose of quality improvement in a care program. In this article, the term ‘adaptive’ refers to the fact that treatment decisions are adapted between patients, rather than within patients. A non-adaptive SMART allows evaluation of DTRs that adapt treatments within patients, but does not adapt between patients. While few SMARTs consider between-patient adaptation, Lee et al. (2014) recently apply -greedy randomization in a dose finding trial to adaptively assign doses based on the posterior expectation of Q-function-type utilities. Like Lee et al. (2014), our AR strategy is also based on Q-functions in Q-learning. Section 2 reviews Q-learning and introduces the proposed SMART with adaptive randomization (SMART-AR) design. In Section 3, we design a SMART-AR for a depression treatment program using the CODIACS data, and discuss design calibration. Section 4 presents simulation results comparing different designs and assessing robustness of the methods. Some remarks are given in Section 5.

2. Methods

2.1. Q-learning: Review and Notation

Let Ati denote the treatment given to patient i at stage t, and St denote the set of treatment options at stage t so that AtiSt. Let Jt be the number of treatment options at stage t. The objective of Q-learning is to identify the optimal decision dt(ht)St for the stage-t intervention given the patient history Hti = ht just prior to the intervention so as to maximize the mean of the eventual health outcome Yi. In this article, we will focus on Q-learning for two-stage DTR, for which we define Q-functions Q2(h2, a2) = E(Yi|H2i = h2, A2i = a2) and Q1(h1, a1) = E {maxa2∈s2 Q2(H2i, a2)|H1i = h1, A1i = a1}. If the Q-functions were known, we could use backward induction to evaluate the optimal DTR that maximizes the expected value of the outcome that is, for patient i, dt(hti)=argmaxatstQt(hti,at) for t = 2, 1.

In practice, however, the true Q-functions are unknown and must be estimated from the data. Q-learning postulates the conceptual model for stage-t Q-function using linear regression: Qt(ht, at; θt) = θt0+θt1Tht+θt2Tat+θt2Tat+θt3Thtat where the design matrix comprised of hti and ati can be properly coded as continuous or dummy variables. For stage 2, the regression parameter θ2 can be estimated by least squares: θ^2=argminθi=1n{YiQ2(H2i,A2i;θ)}2. For stage 1, we define for patient i a pseudo-outcome Ŷi = maxa2∈s2 Q2(H2i, a2; θˆ2) as a proxy for the quantity under expectation in the definition of Q1(h1, a1), and estimate θ1 using least squares based on the pseudo-outcomes: θ^1=argminθi=1n{Y^iQ1(H1i,A1i;θ)}2. With the model-based estimates of the Q-functions, we can apply backward induction to obtain an estimate of the optimal DTR: t(hti) = argmaxatSt Qt(hti, at; θˆt) for t = 2, 1.

2.2. SMART with Adaptive Randomization (SMART-AR)

If the optimal DTR d* were known for the population served by a treatment program, it would be natural to treat each enrolled patient with d*. In most situations, we will need to learn about d* using the data from patients in the program, and hence some randomization of the treatment sequences is needed. In a typical SMART design, the ith patient enrolled to a treatment program is assigned a stage-t intervention a with probability πt(a|hti) where the function πt(·|ht) is pre-specified such that Σast πt(a|ht) = 1 for each given ht. For example, by setting πt(a|ht) = 1/Jt for all aSt, we aim to achieve balance between treatments.

The idea of AR is to determine the function πt(·|ht) using the data enrolled in the study. Our AR criterion is based on the fact that Q-learning aims to maximize Qt(ht, a; θˆt) for given ht. Precisely, let n(i) denote the number of patients with the final outcome evaluated just prior to the enrollment of patient i. It is likely that n(i) < i − 1 as patients are enrolled to the program in a staggered fashion. For patient i, we randomly assign an intervention based on an initial set of probabilities {πt0(a|hti)}ifn(i)<Nmin for some pre-specified Nmin. The set πt0 may be chosen based on data in previous studies (see Section 3.1) and can thus be viewed as historical randomization probabilities. Once there are at least Nmin patients with complete evaluation, that is, n(i) ≥ Nmin, we update the randomization probabilities using the data of the first n(i) patients, as follows: For patient i at stage t, define for treatment a

ρ^t(a|hti)=exp{Qt(hti,a;θ^t)σ^tlogb}, (1)

for some pre-specified base b ≥ 1. The least squares estimate θˆt in (1) is evaluated using the data in the first n(i) patients, and σ^t2 is the mean squared error due to θˆt:

σ^12=k=1n(i){Y^kQ1(H1k,A1k;θ^1)}2n(i)dim(θ1)andσ^22=k=1n(i){YkQ2(H2k,A2k;θ^2)}2n(i)dim(θ2)

where dim(θ) denotes the dimension of the vector θ. The empirical randomization probability for treatment a is then calculated by normalizing ρˆt, that is,

π^t(a|hti)=ρ^t(a|hti)aStρ^t(a|hti)=exp{^ti(a)logb}aStexp{^ti(a)logb} (2)

where ^ti(a)={Qt(hti,a;θ^t)Qt(hti,a^tiw;θ^t)}/σ^t and a^tiw is the estimated worst action given hti. To allow input from historical data or perspective, we propose using a weighted average of the historical and empirical randomization probabilities on the logarithm scale: Define

ρt(a|hti)=exp{λnb1logπt0(a|hti)+(1λnb1)logπ^t(a|hti)} (3)

where λn ∈ [0, 1] goes to zero as n = n(i) grows. The randomization probability to treatment a at stage t given hti is given by normalizing ρ̃t: π̃t(a|hti) = ρ̃t(a|hti)/Σa′st ρ̃t(a′|hti). Thus, under (3), the historical πt0 influences the randomization probabilities even after AR is in effect, although its contribution goes to zero as n increases when b > 1.

2.3. Design Parameters

In addition to { πt0 }, a SMART-AR design requires specifying b, Nmin, and {λn}. The base b indicates how “greedy” the AR scheme is. When b = 1, the trial reduces to a non-adaptive SMART using the historical πt0 for allocation. When b > 1, the allocation tends to favor the treatments with larger Qt(hti, a; θˆt). Heuristically, we may view Δˆti(a) as an empirical version of the effect size Δti(a) = {Qt(hti, a) − Qt(hti, aw)}/σt between a and aw, where σt is the standard deviation of the (pseudo-)outcome and aw is the worst action based on the true Q-function; hence πˆt is expected to be close to (2) with Δˆti(a) replaced by Δti(a). For example, when there are two possible actions St = {a, aw}, under a moderate effect size Δti(a) = 0.5 per Cohen (1988), the empirical πˆt(a|hti) approximates b/(1+b), which equals 0.59 when b = 2, and 0.91 when b = 100. In this example, setting b from 2 to 100 seems to span a reasonably wide range of “greediness” for a moderate effect size.

The minimum sample size Nmin indicates how early AR comes into effect. The choice of Nmin apparently depends on dim(θt) so that the Q-functions can be reliably estimated. Based on our experience with linear regression models, one should set Nmin ≥ 3 × maxt{dim(θt)}.

While there are many possible choices of λn, we consider λn = τ1/(b−1) Nmin/n for b > 1 and τ ∈ [0,1]. The value of τ attenuates the greediness of AR via the weight given to πt0, which equals τ when AR first becomes in effect. Generally, a larger value of τ leads to more attenuation, whereas τ = 0 implies no attenuation so that the SMART-AR uses the empirical randomization probabilities for allocation, that is, π̃tπˆt. Together with b and Nmin, the attenuation τ allows a wide range of specification of a SMART-AR design. Additional properties of this form of λn and the design parameters are given in the web supplement. Calibration of these design parameters is further discussed in Section 3.2.

3. Application: Design of a Depression Treatment Program

3.1. Analysis of Historical Data: Specification of π0t

In this section, we describe a SMART-AR design for a depression treatment program with an aim to reduce depression symptoms 6 months after enrolling to the program. Specifically, the final outcome Y is defined as the 6-month reduction in Beck Depression Inventory (BDI). We consider two treatment options, medication or PST, that are to be given at baseline and possibly modified during the step period (6 to 8 weeks since initial treatment). Specifically, let Ati denote the indicator of patient i receiving PST at stage-t for t = 1, 2; that is, St = {0, 1} for t = 1, 2, with 0 indicating medication, and 1 PST. After the step period, each patient is evaluated with an intermediate BDI; let Ri denote the indicator that the intermediate BDI decreased by at least 3 units during the step period for patient i. Thus, the longitudinal trajectory of patient i is given by (A1i, Ri, A2i, Yi) so that there are two stages of decisions with H1i = ∅ and H2i = (A1i, Ri).

We set up the initial randomization probabilities πt0 in the SMART-AR using the data of CODIACS, in which information about (Ati, Ri, Yi) is available in 108 subjects with 56 receiving no PST and 52 receiving PST at baseline. We apply Q-learning to the data with

Q2(h2i,a2i;θ2)=β0+β1a1i+β2a2i+β3a1ia2i+γ1ri+γ2ri(1a1i)a2i+γ3ria1i(1a2i) (4)

for stage 2 where θ2 = (β0, β1, β2, β3, γ1, γ2, γ3); and Q1(h1i, a1i; θ1) = Q1(a1i; θ1) = α0 + α1a1i for stage 1 where θ1 = (α0, α1). The Q-function (4) is parameterized so that a negative value of γ2 and γ3 will support play-the-winner respectively for medication and PST as initial treatment. The least squares γˆ2 = −13 (P = 0.03) is in line with the expectation that a patient showing response to initial medication should continue with the medication. On the other hand, the least squares γˆ3 = 6.5 with P = 0.27 is ambivalent about whether playing the winner is optimal for PST. As PST aims to improve a patient's mental health by teaching them how to systematically solve self-identified psychological problem, it is conceivable that additional PST sessions may not be beneficial once the patient acquires the skills.

Table 1 summarizes the results of Q-learning, which estimates 1 = 1 and 2(h2i) = 2(1, ri) = 0 for ri = 0, 1 as the optimal decisions. In other words, the optimal sequence is non-dynamic in that it starts with PST and switches to medication regardless of the intermediate response. It is instructive to also look at the results for patients starting with medication (a1i = 0): the optimal follow-up decision in stage 2 will be switching to PST for patients who do not respond (ri = 0) and staying on medication for those who do (ri = 1). This analysis supports playing the winner for medication as the initial treatment. To evaluate the robustness of the analysis, we also performed Q-learning under a saturated Q-function

Table 1.

Summary of Q-learning of the CODIACS data. The probabilities π^tCODIACS are calculated according to (2) with b = 2, and are used as historical randomization probabilities πt0 in the SMART-AR for the depression treatment program.

Stage 1 Stage 2


a1 Q1(a1; θˆ1)
π^tCODIACS(a1)
(a1, r) a2 Q2(a1, r, a2; θˆ2)
π^tCODIACS(a2|a1,r)
0 10.2 0.33 (0, 0) 0 2.2 0.30
1 15.4 0.67 1 10.5 0.70
(0, 1) 0 10.0 0.62
1 5.2 0.38
(1, 0) 0 7.8 0.60
1 4.0 0.40
(1, 1) 0 22.0 0.74
1 11.7 0.26

Mean squared errors due to θˆt are σ^12=24.6 and σ^22=45.2.

Q2Sat(h2i,a2i)=β0+β1a1i+β2a2i+β3a1ia2i+γ1ri+γ2ria2i+γ3ria1i+γ4ria1ia2i (5)

for stage 2. Note that (5) reduces to (4) when γ4 = −(γ2 + γ3). Applying Q-learning with Q2Sat resulted in the same optimal sequence (1 = 1, 2 = 0) as that with (4); this was equivalent to choosing the strategy with maximum marginal means. Also, the Q-functions are estimated with very similar values with 1(0) = 10.7 and 1(1) = 15.4 (cf. Table 1). It indicates that model (4) is adequate, while slightly more parsimonious than the saturated model.

Based on the estimated Q-functions in Table 1, we evaluated {π^tCODIACS(a|hti)} according to (2) with b = 2. The entire set of π^tCODIACS is also given in Table 1: these values would in turn be used as the historical randomization probabilities πt0 in the SMART-AR for the depression treatment program. We used b = 2 in the calculation of π^tCODIACS so as to avoid extreme initial randomization probabilities in the SMART-AR.

3.2. Design Calibration

We take the general calibration approach in Lee and Cheung (2009, 2011) to determine the SMART-AR design parameters (b, Nmin, τ). First, we iterate the design parameter values over a three-dimensional grid. Specifically, in light of the discussion in Section 2.3, we considered b = 2, e, 5, 10, 20, 100, Nmin = 20, 30, 40, 50, and τ = 0, 0.25, 0.50, 0.75,1.00.

Second, for each triplet (b, Nmin, τ), we simulate SMART-AR under a set of pre-specified calibration scenarios. Specifically, the outcomes Y in the simulations were generated as normal with mean specified according to (5) and variance 45; and the parameter values of Scenarios 1–4 in Table 2 were used in the calibration scenarios; these values were chosen so that the analysis model (4) was correct. The optimal DTR d* and the worst DTR dw and their values are also given in Table 2; the value of a DTR d is defined as V(d) = Ed(Y) the expected outcome Y under d. The intermediate responses (R) were generated as Bernoulli with Pr(R = 1|A1 = 0) = 0.52 and Pr(R = 1|A1 = 1) = 0.54 based on the results in CODIACS. In each simulated trial, inter-enrollment times were simulated according to a Poisson process with a rate of four patients per month based on our clinical expectation.

Table 2.

Parameter values for model (5) used in simulation, and the corresponding optimal d* and worst dw. Scenarios 1–4, under which (4) is correct, are used as calibration scenarios for the SMART-AR design. The analysis model (4) is misspecified under Scenarios 5 and 6.

Variable Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
β0 2.2 2.2 2.2 2.2 1.3 2.2
β1 5.6 5.6 5.6 5.6 6.5 5.6
β2 8.3 8.3 8.3 8.3 9.2 8.3
β3 -12 -6.1 -12 -6.1 -12 -12
γ1 7.7 7.7 7.7 7.7 9.6 7.7
γ2 -13 -6.5 -13 -6.5 -15 -13
γ3 6.5 6.5 -6.5 -6.5 4.6 -6.5
γ4 6.6 0.1 19.5 13 6.4 -19.5

Optimal DTR
d1
1 1 0 1 1 0
d2(d1,0)
0 1 1 1 0 1
d2(d1,1)
0 0 0 1 0 0
V (d*) 15.5 16.5 10.2 14.2 15.5 10.2

Worst DTR
d1w
0 0 0 0 0 1
d2w(d1w,0)
0 0 0 0 0 1
d2w(d1w,1)
1 0 1 0 1 1
V (dw) 3.8 6.2 3.8 6.2 3.3 -13

Third, we determine the “optimal” design based on some performance averaged over the calibration scenarios. Specifically, we considered the program performance in the first 100 enrollees; for each simulated trial, we evaluated the adjusted value of the estimated optimal based on the first 100 patients and the adjusted average patient outcomes, defined as:

AV(d^)=V(d^)V(dw)V(d)V(dw)andAPO100=Y¯100V(dw)V(d)V(dw) (6)

where 100 denotes the mean BDI reduction of the 100 patients. For each design under each scenario, we estimated E{AV()}, var{AV()}, E(APO100), and var(APO100) based on 1,000 simulation replicates. These performance metrics were then averaged across the calibration scenarios. The value and patient outcome in (6) were standardized against d* and dw so that we averaged quantities taking values on the same range (0,1) across the scenarios.

The right panel of Figure 1 plots E(APO100), which is a measure of quality improvement in a program, under the design parameters considered. Overall, the average patient outcome increases with a large b, small Nmin, and small τ; that is, a greedy AR scheme tends to treat more patients with the better regimens on average. However, it is important to note that a greedy AR is also associated with greater var(APO100); see the web supplement. The increase in variability is the price for adapting to data, and is in line with AR for non-dynamic treatments. The left panel of Figure 1 plots var{AV()}, which gives indication about the efficiency of subsequent to a SMART-AR. It appears that var{AV()} increases with extreme values of b and when Nmin = 50. Large τ increases variability (reduces efficiency) with non-greedy AR (e.g., b = 2, Nmin = 50), but improves efficiency with greedy AR (e.g., b = 20, Nmin = 20). This suggests that an efficient is a result of non-extreme greediness in a SMART-AR. To account for both efficiency and quality improvement, our calibration strategy aims to find the triplet (b, Nmin, τ) that maximizes E(APO100) among those with var{AV()} no greater than the minimum var{AV()} by 5%. We got b = 10, Nmin = 30, and τ = 0.75. We emphasize that the goal of calibration is to get a reasonable set of design parameters for a wide range of scenarios, while an “optimal” design is specifically tied the choice of the calibration scenarios and the optimality criteria. It is thus reassuring to note that E{AV()} spans a narrow range under the design parameters considered (see web supplement); that is, the SMART-AR allows learning of the optimal DTR over a wide range of design parameters.

Figure 1.

Figure 1

Calibration of SMART-AR over a grid of b, Nmin, and τ: average var{AV()} in the left panel and average E(APO100) right. The dark dashed and solid lines respectively indicate the minimum and maximum values attained by all design parameters considered.

4. Design Comparison

Simulations were performed to compare several SMART designs. Specifically, we considered the followings: (A) SMART-ARopt: the optimal SMART-AR with b = 10, Nmin = 30, and τ = 0.75; (B) SMART-AR1: SMART-AR with b = 1, which amounts to a non-adaptive SMART using π^tCODIACS throughout; (C) SMART-B: Non-adaptive SMART using balanced randomization, that is, πt ≡ 1/Jt; (D) SMART-PTW: Non-adaptive SMART with play-the-winner, by which patients responding to the initial treatment will stay on the same treatment, that is, π2(a2|a1, r = 1) = 1 for a2 = a1 and 0 for a2a1, and πt = 0.5 otherwise. (E) SMART-PTWm: Non-adaptive SMART with play-the-winner for medication only, that is, π2(0|a1 = 0, r = 1) = 1 and π2(1|a1 = 0, r = 1) = 0, and πt = 0.5 otherwise. While SMART-ARopt was calibrated with respect to Scenarios 1–4 in Table 2, we also evaluated the designs under scenarios, namely Scenarios 5 and 6, in which the analysis model (4) is misspecified to assess robustness. In addition, while we considered patient accrual rate of four per month in the calibration, we examined the methods under different accrual rates.

Figure 2 plots the expected BDI reduction of a patient at each given enrollment number up to the 100th enrollee; the smoothed curves were obtained using locally-weighted regression. Generally, the expected patient outcome improves over time under SMART-ARopt in all scenarios, but remains constant throughout a non-adaptive SMART. Since faster patient accrual implies more patients will be enrolled before adaptation comes into effect, it leads to delayed improvement of SMART-ARopt. In Scenario 1, the true parameter values are chosen based on regression analysis using the CODIACS data. Therefore, SMART-AR1 is expected to perform well when compared to other non-adaptive SMART designs. In this case, the use of AR still provides improvements over time. By the 100th enrollee, the expected BDI reduction improves by about 2.5 units. Scenario 2, 5, and 6 produce qualitatively similar results, although it is interesting to note that AR results in improvements in Scenarios 5 and 6 when the analysis model is misspecified. In Scenarios 3 and 4, play-the-winner tends to produce better BDI reduction than the other non-adaptive programs. However, SMART-ARopt is able to compensate the initial deficit and surpasses SMART-PTW by the 100th enrollees even under an accrual rate that doubles our original expectation.

Figure 2.

Figure 2

Program performance (average BDI reduction) of SMART-AR under various accrual rates. The SMART-AR is applied with b = 10, Nmin = 30, and τ = 0.75. The non-adaptive SMART-AR1 is indicated by ‘o’, SMART-B by ‘+’, SMART-PTW by ‘p’, and SMART-PTWm by ‘m’. The dark solid line on top of each figure corresponds to the performance of the true optimal non-randomized DTR d*.

SMART-B is outperformed by some other non-adaptive schemes in all scenarios. This is not surprising because by aiming to balance sample sizes among all possible treatment sequences, assignments to the worst sequence dw will likely preclude a balanced design from being an optimal program. To further examine using balanced randomization as an initial randomization scheme in a SMART-AR, we ran simulation of SMART-ARopt using the initial randomization probabilities πt01/Jt. Figure 3 shows that AR is able to correct for a poor choice of the initial probabilities. Under Scenario 1, for example, the program starting with balanced randomization improves by over 4 units in BDI change in the 25-month period under SMART-ARopt. (To put the magnitude of improvement in perspective, the centralized care arm in CODIACS improves upon the standard care by 3 units on the BDI scale.) Having said that, this simulation study indicates the importance of having a reliable choice of πt0 based on historical data, as far as treating the early enrollees is concerned.

Figure 3.

Figure 3

Program performance (average BDI reduction) of SMART-AR with different initial randomization probabilities. The SMART-AR is applied with b = 10, Nmin = 30, and τ = 0.75 under an accrual rate 4 per month. The non-adaptive SMART-AR1 is indicated by ‘o’, SMART-B by ‘+’, SMART-PTW by ‘p’, and SMART-PTWm by ‘m’. The dark solid line on top of each figure corresponds to the performance of the true optimal non-randomized DTR d*.

To examine the learning ability of the designs, Table 3(a) gives some properties of evaluated based on model (4) using the first 100 enrollees. Under SMART-PTWm, the parameter γ2 is not estimable because R(1 − A1)A2 is completely confounded by the main effects A1, R, and A2; thus, there is no Q-learning results for this design (and SMART-PTW for the same reason). SMART-ARopt, SMART-AR1, and SMART-B have similar accuracy in terms of the probability of correctly estimating d* and the adjusted value E{AV()}. On the other hand, SMART-ARopt yields more efficient estimator with smaller variability in its value than SMART-B. At first glance, it may not appear to be feasible, because balanced randomization maximizes comparative power between two treatment sequences. However, since our goal is not to compare all possible sequences, but rather identify the optimal one among the good ones, AR allocates resources to the promising sequences thus maximizing the resolution of the relevant comparisons. This simulation shows that by incorporating AR to SMART, it does not only allow learning, but also improves learning upon the non-adaptive designs. As accrual rate increases, SMART-ARopt converges to SMART-AR1, and thus var{AV()} under SMART-ARopt rises and approaches that of SMART-AR1.

Table 3.

Properties of dˆ under SMART-B, SMART-AR1, and SMART-ARopt using analysis models (4) and (5) under different patient accrual rates.

(a) Q-learning using model (4) (b) Q-learning using model (5)
Bal. AR1 ARopt Bal. AR1 ARopt
Accrual rate: 4 6 8 4 6 8
Scenario 1
Probability d* 0.91 0.94 0.95 0.95 0.95 0.90 0.93 0.94 0.94 0.94
E{AV()} 0.98 0.99 0.99 0.99 0.99 0.98 0.99 0.99 0.99 0.99
var{AV()} × 103 3.05 2.31 1.32 1.67 1.98 3.49 2.54 1.54 2.25 2.19

Scenario 2
Probability d* 0.75 0.78 0.80 0.77 0.77 0.73 0.76 0.78 0.77 0.76
E{AV()} 0.97 0.97 0.98 0.97 0.97 0.96 0.97 0.97 0.97 0.97
var{AV()} × 103 4.61 5.26 2.56 2.90 4.11 5.40 6.15 3.74 3.50 5.41

Scenario 3
Probability d* 0.53 0.51 0.51 0.53 0.53 0.51 0.50 0.52 0.52 0.52
E{AV()} 0.95 0.95 0.96 0.96 0.95 0.95 0.94 0.96 0.96 0.95
var{AV()} × 103 8.47 11.00 7.38 7.24 10.74 8.26 12.75 8.15 7.64 12.02

Scenario 4
Probability d* 0.75 0.76 0.79 0.75 0.74 0.74 0.74 0.76 0.74 0.73
E{AV()} 0.95 0.95 0.96 0.95 0.95 0.95 0.94 0.95 0.94 0.94
var{AV()} × 103 9.79 13.83 7.97 11.14 12.09 10.68 16.57 11.00 13.59 14.75

Scenario 5
Probability d* 0.91 0.91 0.91 0.91 0.91 0.82 0.85 0.88 0.86 0.86
E{AV()} 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.99 0.98 0.98
var{AV()} × 103 1.85 1.73 1.23 1.55 1.66 3.18 2.48 2.29 2.26 2.27

Scenario 6
Probability d* 0.01 0.00 0.16 0.10 0.05 0.78 0.76 0.76 0.78 0.77
E{AV()} 0.84 0.77 0.92 0.90 0.86 0.98 0.98 0.98 0.98 0.98
var{AV()} × 103 9.14 5.23 2.52 4.20 4.61 1.26 1.80 1.41 1.33 1.42

Table 3(b) shows the Q-learning results using the saturated model (5), which in effect is nonparametric. In most scenarios, using model (4) yields smaller var{AV()} than (5), even at times when (4) is incorrect; cf. Scenario 5. In Scenario 6, the nonparametric analysis is advantageous in terms of the probability of correctly estimating d*. Its advantage is less pronounced in terms of E{AV()}, as Q-learning using model (4) often leads to selecting the second best DTR, which is not far worse than d*. Thus there seems to be intrinsic robustness of Q-learning using a misspecified model in terms of the adjusted value. In addition, under an accrual rate 4 per month, Q-learning using (4) in SMART-ARopt has better E{AV()} and var{AV()} than that in the non-adaptive SMART. This suggests that AR further enhances robustness of Q-learning.

5. Discussion

In this article, we have introduced a novel synthesis of Q-learning and adaptive randomization for designing a SMART-AR, and described how to initiate a depression treatment program based on the CODIACS study. In our application, we defined the intermediate response by a BDI reduction at least 3 units. One could also classify a response into more than two categories—for example, worsening (< 0), no improvement (0 – 2), and clinically significant improvement (≥ 3)—and apply different randomization probabilities for each category. Such design enrichment needs to be coupled with a larger model for the Q-function. Generally, with C response categories in a two-stage DTR, dim(θ2) is of the order J1 × C × J2: the actual number of parameters depends on how many interaction terms are postulated, while it is important to include interactions in the search of optimal DTR. Therefore, when a program starts and data are few, it is necessary to consider “simple” designs as in Section 3. As enrollment grows, one can enrich the design to account for more information by incorporating interaction effects with patient covariates, refining and increasing treatment options such as different classes of medications, redefining the intermediate response, and adopting more than two stages of treatments. The proposed AR scheme accommodates such enrichment to the extent that Q-learning is feasible given the sample size. In addition, we have demonstrated in our simulation study that the use of AR improves the performance and robustness of Q-learning by allocating patients away from treatment sequences that are not promising. Similar observations have been noted for adaptive procedures for selection among multiple non-dynamic treatments (e.g. Cheung, 2008). In the context of DTRs, due to the curse of dimensionality, this advantage offered by AR is potentially enormous.

Since the weighted average (3) combines the historical and the empirical inputs on the probability scale, it can be easily applied with other empirical randomization scheme. For example, the empirical component (2) may be replaced with an -optimal criterion as in Lee et al. (2014), who adopt a Bayesian approach and estimate the Q-function using full likelihood. Our approach, while less formal than a fully Bayesian approach, offers flexibility. It can be applied with other reinforcement learning technique that has an explicit objective function, and can be easily extended to deal with a broader set of problems, such as when the outcome is binary (e.g., Moodie, Dean, and Sun, 2013). To use either approach, however, it is critical to properly calibrate the design parameters: (b, Nmin, τ) in our method, and an -sequence and the prior distribution for the Bayesian model in Lee et al. (2014).

As in any sequential methods, the advantages of AR diminish with fast patient accrual, as demonstrated in our simulation. Specifically, a SMART-AR converges to a non-adaptive SMART, as more patients are enrolled before adaptation begins. The non-adaptive SMART can thus be used to provide a lower bound of performance for SMART-AR. This underlines the crucial role of historical input πt0. This point should be read in light of the nature of implementation research, in which the SMART-AR is used as a dissemination tool to deploy existing treatments to a community. Thus, historical perspective is often available and useful, and may prove more beneficial than applying balanced randomization initially.

Supplementary Material

Supp Material

Footnotes

Supplementary Materials: The R code that was used to perform the simulation in Section 3 is are available with this paper as a web supplement at the Biometrics website on Wiley Online Library. The web-based supplementary materials also contain additional properties of the AR design mentioned in Section 2.3 and post the full calibration results in Section 3.2.

References

  1. Barker AD, Sigman CC, Kelloff GJ, Hylton NM, Berry DA, Esserman LJ. I-SPY 2: An adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clinical Pharmacology & Therapeutics. 2009;86:97–100. doi: 10.1038/clpt.2009.68. [DOI] [PubMed] [Google Scholar]
  2. Chakraborty B, Laber EB, Zhao Y. Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics. 2013;69:714–723. doi: 10.1111/biom.12052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cheung YK. Simple sequential boundaries for treatment selection in multi-armed randomized clinical trials with a control. Biometrics. 2008;64:940–949. doi: 10.1111/j.1541-0420.2007.00929.x. [DOI] [PubMed] [Google Scholar]
  4. Cheung K, Duan N. Design of implementation studies for quality improvement programs: an effectiveness-cost-effectiveness framework. American Journal of Public Health. 2014;104:e23–30. doi: 10.2105/AJPH.2013.301579. Epub 2013 Nov 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cheung YK, Inoue LYT, Wathen JK, Thall PF. Continuous Bayesian adaptive randomization based on event times with covariates. Statistics in Medicine. 2006;25:55–70. doi: 10.1002/sim.2247. [DOI] [PubMed] [Google Scholar]
  6. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd. Lawrence Erlbaum Associates; 1988. [Google Scholar]
  7. Davidson KW, Bigger JT, Burg MM, et al. Centralized, stepped, patient preference-based treatment of patients with post-acute coronary syndrome depression: CODIACS vanguard randomized controlled trial. JAMA Intern Med. 2013;173:997–1004. doi: 10.1001/jamainternmed.2013.915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lavori PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical Trials. 2004;1:9–20. doi: 10.1191/1740774s04cn002oa. [DOI] [PubMed] [Google Scholar]
  9. Lee J, Thall PF, Ji Y, Müller P. Bayesian dose-finding in two treatment cycles based on the joint utility of efficacy and toxicity. Journal of the American Statistical Association. 2014 doi: 10.1080/01621459.2014.926815.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lee SM, Cheung YK. Model calibration in the continual reassessment method. Clinical Trials. 2009;6:227–238. doi: 10.1177/1740774509105076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lee SM, Cheung YK. Calibration of prior variance in the Bayesian continual reassessment method. Statistics in Medicine. 2011;30:2081–2089. doi: 10.1002/sim.4139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Moodie EEM, Dean N, Sun YR. Q-learning: Flexible learning about useful utilities. Statistics in Biosciences. 2013 doi: 10.1007/s12561-013-9103-z. [DOI] [Google Scholar]
  13. Murphy SA. Optimal dynamic treatment regimes (with discussion) Journal of the Royal Statistical Society, Series B. 2003;65:331–366. [Google Scholar]
  14. Murphy SA. A generalization error for Q-learning. Journal of Machine Learning Research. 2005a;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]
  15. Murphy SA. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine. 2005b;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
  16. Robins J. Causal inference from complex longitudinal data. In: Berkane M, editor. Latent Variable Modeling and Applications to Causality. Springer; New York: 1997. pp. 69–117. [Google Scholar]
  17. Rubin D. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Education Psychology. 1974;66:688–701. [Google Scholar]
  18. Rush AJ, Fava M, Wisniewski SR. Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Controlled Clinical Trials. 2004;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]
  19. Schneider LS, Tariot PN, Lyketsos CG, et al. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE): alzheimer disease trial methodology. American Journal of Geriatric Psychiatry. 2001;9:346–360. [PubMed] [Google Scholar]
  20. Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Statistics in Medicine. 2000;19:1011–1028. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
  21. Thall PF, Wathen JK. Practical Bayesian adaptive randomisation in clinical trials. European Journal of Cancer. 2007;43:859–866. doi: 10.1016/j.ejca.2007.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Thall PF, Wooten LH, Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Statistics in Medicine. 2007;26:4687–4702. doi: 10.1002/sim.2894. [DOI] [PubMed] [Google Scholar]
  23. Thombs BD, de Jonge P, Coyne JC, Whooley MA, Frasure-Smith N, Mitchell AJ, Zuidersma M, Eze-Nliam C, Lima BB, Smith CG, Soderlund K, Ziegelstein RC. Depression screening and patient outcomes in cardiovascular care. Journal of the American Medical Association. 2008;300:2161–2171. doi: 10.1001/jama.2008.667. [DOI] [PubMed] [Google Scholar]
  24. Watkins CJCH. PhD Dissertation. Cambridge University; 1989. Learning from delayed rewards. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material

RESOURCES