Abstract
Dynamic treatment regimes are an emerging and important methodological area in health research, particularly in the management of chronic health conditions. This paradigm encompasses the ideological shift in research from the acute care model to the chronic care model. It allows individualization of treatment (type, dosage, timing) at each stage of intervention. Constructing evidence-based dynamic treatment regimes requires implementation of cutting-edge design and analysis tools. Here I briefly discuss some of these modern tools, namely the sequential multiple assignment randomized trial (SMART) design and a regression-based analysis approach called Q-learning.
Chronic disorders are one of today's most pressing public health issues in both the American1 and global2 arenas. For example, widely prevailing conditions such as hypertension, obesity, diabetes, nicotine addiction, alcohol and drug abuse, HIV infection, and depression are all chronic. In many cases, effective long-term care of patients with these chronic conditions requires ongoing medical intervention following the chronic care model,1,3 rather than the more traditional acute care model. Some of the key features of health care emphasized by the chronic care model are individualization of care according to patient needs, optimization of patient outcomes through a series of interventions, and health services based on evidence (as opposed to expert opinion only).
First, in this paradigm clinicians treat patients in multiple stages, individualizing treatment type or dosage according to ongoing measures of patient response, adherence, burden, side effects, and preference. Second, instead of determining a single course of treatment (static treatment), clinicians sequentially make decisions about what to do next to optimize patient outcomes given their case history (dynamic treatment). The primary motivations for considering sequences of treatments are high interpatient variability in response to treatment, probability of relapse, presence or emergence of comorbid conditions, time-varying severity of side effects, and reduction of costs and burden when intensive treatment is unnecessary.4
Third, although there exist traditional practice guidelines for clinicians that are primarily based on expert opinions, the chronic care model advocates that these regimes be more objective and evidence based. In fact, Wagner et al. described the chronic care model as “a synthesis of evidence-based system changes intended as a guide to quality improvement and disease management activities.”3(p69)
In this context, dynamic treatment regimes (DTRs) offer a way to operationalize the sequential decision-making process involved in adaptive clinical practice and thereby a potential way to improve it. Formally, a DTR is a sequence of decision rules, 1 per stage of intervention. Each decision rule takes a patient's individual characteristics and treatment history observed up to a given stage as input and offers a recommended treatment at that stage (recommendations can include treatment type, dosage, and timing). Conceptually, a DTR can be viewed as a decision support system, which is 1 of the 6 elements of the chronic care model.3
DTRs are developed to define the sequence of treatments that will result in the most favorable clinical outcome possible. A DTR is optimal if it optimizes the mean long-term outcome (e.g., the outcome observed at the end of the final stage of treatment). A concrete example of a DTR, originally described by Murphy,5 can serve as an illustration.
ADDICTION MANAGEMENT EXAMPLE
Consider a simple addiction management study involving alcohol-dependent participants, with only 2 stages of decision: choosing the initial treatment and choosing the secondary treatment. Initially the clinician may prescribe either an opiate antagonist (naltrexone) or cognitive–behavioral therapy (CBT). Participants are classified as treatment responders or nonresponders according to their level of heavy drinking in the subsequent 2 months while they are on initial treatment. If a participant is a nonresponder to naltrexone, the clinician must decide whether to switch to CBT or augment naltrexone with CBT and an enhanced motivational program (i.e., enhanced motivation + CBT + naltrexone). If a participant is a nonresponder to CBT, the clinician must decide whether to switch to naltrexone or augment CBT with naltrexone and an enhanced motivational program (enhanced motivation + CBT + naltrexone).
Responders to the initial treatment can be assigned either to telephone monitoring or to telephone counseling and monitoring. In this study setup, researchers can formulate a DTR that results in the highest percentage of days abstinent over a 1-year period. This DTR would consist of 2 decision rules: in the first rule, pretreatment information (e.g., level of addiction) is used to select the initial treatment, and in the second rule intermediate outcomes (e.g., adherence to initial treatment, self-management skill level, number of heavy drinking days while on initial treatment) can be used to choose the secondary treatment.
A natural question that arises at this point is how to actually develop evidence-based DTRs. One simple yet rigorous approach is to initially specify the decision rules in terms of certain unknown parameters and then to use patient data to estimate them. For example, in the addiction management study example, a sensible stage 1 decision rule is to prescribe naltrexone if the participant's baseline level of addiction (e.g., number of heavy drinking days in a prespecified period) is greater than a threshold value (ψ), and to prescribe CBT otherwise (ψ is an unknown parameter). If the decision rule is to be truly evidence based, ψ needs to be estimated from data in a principled way (rather than specified by expert opinion).
The data needed to develop an optimal DTR (e.g., to estimate ψ) can be obtained from either observational studies or randomized studies. It is well known that estimates based on observational data are often subject to selection or confounding bias, and hence randomized data are preferable, allowing more accurate estimations and stronger statistical inferences. This is especially important when dealing with DTRs, given that these biases can compound over stages.
One crucial point to note here is that developing DTRs is an exploratory (developmental) procedure rather than a confirmatory procedure. Randomized controlled trials are typically the gold standard for evaluating or confirming the efficacy of a newly developed intervention but not for developing the intervention per se. Thus, generating meaningful data for developing optimal DTRs is beyond the scope of the usual confirmatory randomized controlled trial; special design considerations are required. A special class of designs, sequential multiple assignment randomized trial (SMART)5–7 designs, are well suited for developing optimal DTRs.
SEQUENTIAL MULTIPLE ASSIGNMENT RANDOMIZED TRIAL DESIGNS
SMART designs involve an initial randomization of patients to possible treatment options, followed by rerandomizations at each subsequent stage of some or all of the patients to another treatment available at that stage. The rerandomizations at each subsequent stage may depend on information collected after previous treatments but prior to new treatment assignment (e.g., how well the patient responded to the previous treatment). Thus, even though a participant is randomized more than once, ethical constraints are not violated.
Examples of the use of the SMART design (or its precursors) include the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) for treatment of Alzheimer's disease,8 the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial,9 2-stage cancer trials,10,11 2-stage smoking cessation trials,12,13 and a 2-stage trial designed to reduce mood and neurovegetative symptoms among patients with malignant melanoma.14 SMART designs attempt to conform to the clinical practice procedures used in treating chronic disorders but retain the well-known virtues of randomization (compared with observational studies). Figure 1 presents a schematic of a SMART design for the addiction management example.
FIGURE 1.
Sequential multiple assignment randomized trial (SMART) design schematic for the addiction management example.
Note. CBT = cognitive−behavioral therapy; EM = enhanced motivation; NTX = naltrexone; R = randomization; TM = telephone monitoring; TMC = telephone counseling and monitoring.
A competing approach to determining an optimal DTR could be to conduct separate randomized controlled trials for separate stages, to find the optimal treatment at each stage based on the data from these trials, and then to combine these optimal treatments from individual stages to create a DTR. However, this design strategy is myopic and may often result in a suboptimal DTR.6 Many treatments have effects, such as improving the impact of a future treatment or alleviating long-term side effects that prevent a patient from being able to use an alternative useful treatment in the future, that do not occur until after the intermediate outcome (e.g., response to initial treatment) has been measured. SMART designs are capable of addressing these issues.
This point can be further elucidated with the addiction management example. Suppose telephone counseling and monitoring is more effective than telephone monitoring alone among CBT responders (i.e., the participant learns to use counseling during CBT and thus is able to take advantage of the counseling offered). Individuals who received naltrexone during the initial treatment would not have learned to use counseling, and thus among responders to naltrexone the addition of counseling would not improve abstinence levels relative to monitoring alone. If an individual is a CBT responder, it is best to offer telephone counseling and monitoring as the secondary treatment. If the individual is a naltrexone responder, however, it is best to offer the less expensive telephone monitoring as the secondary treatment.
In summary, even if CBT and naltrexone result in the same proportion of responders (or even if CBT appears less effective at the initial stage), CBT may be the best initial treatment as part of the DTR owing to the enhanced effect of telephone counseling and monitoring when it is preceded by CBT. This issue, often referred to as a delayed effect, is important to consider when determining an optimal DTR. Furthermore, even though the results of the initial stage may indicate that 1 treatment is less effective than another, the former treatment may elicit diagnostic information that allows the investigator to better match the subsequent treatment to each participant and thus improve the primary outcome. Also, participants who enroll and remain in single-stage trials may be different from those who enroll and remain in a SMART study (such cohort effects have been discussed by Murphy et al.15).
As is the case with any study, sample size calculation is a crucial part of SMART design. In a SMART study, one can investigate multiple research questions, both concerning entire DTRs and concerning certain components thereof. In the case of powering a SMART design, however, the investigator needs to choose a primary research question (primary hypothesis) and calculate sample size on the basis of that question. In addition, 1 or more secondary research questions can be investigated. Although SMART provides unbiased answers (free from confounding) to these secondary questions by virtue of randomization, it does not necessarily have the desired power to address such secondary hypotheses.
Hypotheses concerning components of a DTR in the addiction management study example are as follows: after control for secondary treatments, the initial naltrexone treatment will result in the same mean outcome (number of days abstinent) as the initial CBT treatment, and among nonresponders to the initial treatment (naltrexone or CBT), a switch to the other treatment (CBT or naltrexone, as the case may be) will be as effective as treatment augmentation (enhanced motivation + CBT + naltrexone; Figure 1). For the first primary hypothesis, the sample size formula is the same as that for a 2-group comparison. For the second primary hypothesis, the sample size formula is the same as that for a 2-group comparison of nonresponders, and it is thus a function of the nonresponse rate to initial treatment.
An example hypothesis concerning the entire DTR is that 2 DTRs differing in initial treatment will have the same mean outcome. In this case, the formula for the estimated mean outcome of a DTR given by Murphy5 is used in calculating the sample size. Oetting et al.16 provided sample size formulas for each of these choices of primary hypothesis under different working assumptions; Feng and Wahed17 and Li and Murphy18 provided sample size formulas for time-to-event (survival) outcomes. However, there are still open questions relating to sample size issues in a SMART design that warrant further research.
ANALYSIS OF SEQUENTIAL MULTIPLE ASSIGNMENT RANDOMIZED TRIAL DATA
For clarity, I consider SMARTs with only 2 stages of treatment, as in the addiction management example (however, the analysis method described subsequently can be generalized to more stages as well). Longitudinal data on a single patient participating in a 2-stage SMART study are given by the trajectory (O1,A1,O2,A2,Y), where O1 denotes all of the pretreatment information (possibly a vector) at the beginning of stage 1, A1 is the treatment assigned at stage 1, O2 denotes all of the intermediate observations (possibly a vector) made on the patient prior to treatment at the beginning of stage 2, A2 is the treatment assigned at stage 2, and Y is the primary outcome (either an end-of-the-study observation or a summary of observations made throughout the study).
Assume throughout that a higher value of Y is favorable. For example, in the addiction management study, O1 could be baseline level of addiction, A1 could be either naltrexone or CBT, A2 could be responder–nonresponder status (based on number of heavy drinking days during treatment), and O2 could be 1 of the 2 treatment possibilities available depending on A1 and O2 (Figure 1). Finally, the primary outcome Y could be the percentage of days abstinent over the study period.
For simplicity, consider binary treatments at each stage, coded −1 or 1, and continuous Y. Suppose that data are available for n patients, where the data trajectory for the ith patient is given by
To develop an analysis strategy for a 2-stage SMART design, one can initially conceptualize a simple, single-stage study with data trajectory (O1,A1,Y). To make treatment recommendations for different subgroups of patients (i.e., patients with different baseline information O1), one needs to understand the behavior of the conditional mean outcome E(Y |O1,A1) as a function of O1 and A1. A reasonable approach to doing so is to consider a regression model of Y on O1 and A1, keeping an interaction term between O1 and A1 in the model:
![]() |
The superscript T in this expression denotes the vector transpose. The first part of the model (i.e.,
), although relevant for predicting the outcome Y, is not relevant for making a treatment decision (because this part remains the same for either choice of A1). It is the second part,
, that governs the treatment decision. The parameters in this model can be estimated via the usual least squares method, with
, and
denoting the estimates. Then the estimated optimal treatment decision rule recommends the treatment that, for a given value of O1, maximizes the estimated conditional mean outcome. In other words, this rule recommends the treatment A1 = 1 if the quantity
and recommends the treatment A1 = −1 otherwise. If O1 is a single variable (in which case
), such as the number of heavy drinking days in a prespecified period prior to receipt of treatment, the rule amounts to prescribing A1 = 1 if
and A1 = −1 otherwise. In the addiction management example (considering only a single stage of intervention of naltrexone vs CBT), if naltrexone is denoted by A1 = 1 and CBT by A1 = −1, then the estimated decision rule prescribes naltrexone if the number of heavy drinking days is greater than the threshold value and prescribes CBT otherwise. Thus, using the regression approach, one can specify a very intuitive decision rule (as discussed in the addiction management example).
With this background, one can think about the analysis strategy for the 2-stage SMART with a patient's data trajectory given by (O1,A1,O2,A2,Y). A natural extension of the approach just outlined would be to model the conditional mean outcome E(Y |O1,A1,O2,A2) and run an all-at-once regression analysis. Unfortunately, this is not a good option because of the possibility of bias in the estimation of the stage 1 treatment effect arising as a consequence of what is known as Berkson's paradox.5,19 This phenomenon can be explained with the help of the addiction management example.
Suppose that there is an unobserved variable (U) that affects a patient's ability to respond to treatment. For simplicity, conceptualize U as the stability in a patient's life (U = 1 if stable and U = 0 otherwise). U can be expected to be positively correlated with the intermediate outcome responder–nonresponder status (O2) as well as with Y, percentage of days abstinent. Suppose that the initial treatments have differing effects on responder–nonresponder status (O2). Because the initial treatment assignment is randomized, U and A1 should be uncorrelated. However, there will be a conditional correlation between U and A1, given nonresponse to the initial treatment (i.e., given O2).
Intuitively, a nonresponder who received the better initial treatment is more likely to have an unstable life (U = 0). This phenomenon is called Berkson's paradox. Note that when running a regression of Y on (O1,A1,O2,A2), one conditions on O2. Conditionally on O2, the unobserved variable U and A1 will be correlated. This correlation, coupled with the correlation between U and Y, will induce a spurious (noncausal) correlation between A1 and Y (even though A1 is randomized). As a consequence, the stage 1 treatment effect will be estimated with bias.
Figure 2 (where O1 and A2 are excluded to simplify the diagram) may help provide a visual understanding of the situation. The direct (solid) arrow from A1 to Y is the true stage 1 treatment effect that the researcher needs to estimate. However, because U is unobserved and thus not included in the regression, the spurious effect arising from Berkson's paradox, represented by the dotted path from A1 to Y via O2 and U, contaminates the true effect of A1. Thus, combining the stage 1 and stage 2 variables in a single regression is potentially problematic. However, any method that proceeds in a stage-by-stage manner does not suffer from this problem. One such method is Q-learning, originally developed in computer science20,21 and later adapted to statistics.12,22
FIGURE 2.
Diagram displaying the spurious effect between A1 and Y as a consequence of Berkson's paradox.
Note. A1 = treatment assigned at stage 1; O2 = observations at beginning of stage 2; U = unobserved variable; Y = primary outcome.
Q-LEARNING
A useful first step is to formally define the case history of a patient at any given stage of intervention. At stage 1, the history (H1) simply consists of the baseline information (O1). At stage 2, the history (H2) consists of the baseline information (O1), the stage 1 treatment (A1), and stage 2 pretreatment variables (intermediate outcomes, O2). In general, a patient's case history consists of the collection of all previous and current observations and all previous treatments (for purposes of analysis, one can work with a lower-dimensional summary statistic of the entire history). The next step is to formally define what are called the Q-functions (a term coined in the computer science literature, where Q denotes “quality of treatment”) for the 2 stages as follows:
Note that although Q2 is the conditional mean function as in the simple single-stage study, the definition of Q1 is slightly more involved. It represents the conditional expectation of an unobserved pseudo-outcome given stage 1 history and treatment. The pseudo-outcome,
, represents a patient's best possible mean outcome if she or he is given the best treatment at stage 2.
Also note that Q1 is defined as a function of Q2, which means that Q2 needs to be specified first. In other words, the definitions of Q-functions move backward in time. Recall that in managing chronic disorders, the goal is to optimize the long-term mean outcome rather than any immediate (intermediate) outcome so that delayed effects can be addressed. Thus, unless long-term outcomes are modeled first (and thereby optimal future treatments are determined), one cannot make optimal decisions in the current stage. This intuition justifies the backward movement of the definition of Q-functions. This general approach is known as backward induction or dynamic programming.23 If the 2 Q-functions were known (e.g., as when the true multivariate distribution of the data is known), then the optimal DTR, for example d = (d1, d2), would be given by
In this equation, dj is a decision rule at the jth stage; for a patient with history Hj, this decision rule recommends the treatment that maximizes the Q-function at the jth stage (j = 1,2). In practice, however, the true Q-functions are rarely known and hence must be “learned” (estimated) from the data. This process is called Q-learning. Note that Q-functions are conditional expectations, and hence they can be estimated via a regression approach. Consider linear regression models for the Q-functions. The Q-function at stage j (j = 1,2) can be modeled as
![]() |
where the vector Hj includes the scalar “1” so that the model has an intercept term. The first term on the right side,
, does not change with treatment. The second term,
, denotes the interaction between history and treatment and is crucial for decision making. The Q-learning algorithm proceeds as follows:
Estimate the stage 2 parameters (β2, ψ2) by regressing Yi on (H2i, A2i), i = 1, …, n, using the model for Q2 via least squares; label the estimates
.Construct (the sample version of) the stage 1 pseudo-outcome:
.Estimate the stage 1 parameters (β1, ψ1) by regressing
on (H1i, A1i), i = 1, …, n using the model for Q1 via least squares; label the estimates
.
Once the 2 Q-functions have been estimated, the estimated optimal treatment decision rules are given by
![]() |
That is, for a patient with history Hj, the estimated optimal decision rules recommend the treatments that maximize the estimated Q-functions. Because Aj is binary (coded −1 or 1) and Qj is specified by the model in equation 6, it follows that Aj = 1 maximizes Qj if
, and Aj = −1 maximizes Qj otherwise.
A rule such as that just described is quite realistic and, again, can be illustrated with the addiction management example. It was noted in the addiction management example how a stage 1 rule, such as to prescribe naltrexone if the number of heavy drinking days is greater than a threshold and prescribe CBT otherwise, falls under this framework. Now focus on stage 2. O2 = 1 denotes responders to initial treatment, and O2 = 0 denotes nonresponders. For any patient, there are only 2 treatment options at stage 2 that depend on case history. More precisely, for responders to initial treatment (i.e., for patients with O2 = 1), A2 is either telephone monitoring (1) or telephone counseling and monitoring (−1). For nonresponders to naltrexone (i.e., for patients with A1 = 1, O2 = 0), A2 is either CBT (1) or enhanced motivation + CBT + naltrexone (−1). For nonresponders to CBT (i.e., for patients with A1 = −1, O2 = 0), A2 is either naltrexone (1) or enhanced motivation + CBT + naltrexone (−1; Figure 1).
Now suppose that the optimal stage 2 treatments are telephone monitoring (A2 = 1) for responders to naltrexone (A1 = 1, O2 = 1), telephone counseling and monitoring (A2 = −1) for responders to CBT (A1 = −1, O2 = 1), and enhanced motivation + CBT + naltrexone (A2 = −1) for all nonresponders (O2 = 0). Then the optimal rule can be compactly stated as “prescribe A2 = 1 if A1O2 > 0, and prescribe A2 = −1 otherwise.” Thus, the interaction term A1O2 serves as the only relevant history for decision making at stage 2 (i.e., H2 = A1O2), and the parameter is simply equal to 1. Applying Q-learning can lead investigators to such a rule.
In summary, Q-learning involves 2 separate regression analyses, 1 per stage, instead of an all-at-once regression of Y on (O1,A1,O2,A2). As a result, any spurious path between the outcome Y and the stage 1 treatment A1 via the unmeasured confounding variables is broken, and thus Berkson's paradox is avoided. This method can easily be generalized to more than 2 stages (i.e., to studies in which patients are randomized to treatments more than twice). Implementation of the algorithm is relatively easy as well; standard software can be used. Chakraborty et al.12 provided an application of Q-learning (the basic version discussed here as well as an improved version) to analyze a 2-stage smoking cessation trial13 with the goal of determining the optimal DTR (2-stage behavioral intervention).
DISCUSSION AND CONCLUSIONS
Dynamic treatment regimes offer an important methodological framework for developing evidence-based interventions to manage chronic disorders. As mentioned earlier, this framework encompasses the ideological shift in research from the acute care model to the chronic care model. Furthermore, instead of adhering to the old-school “one-size-fits-all” principle of health intervention, it allows individualization of treatment type, dosage, and timing at each intervention stage. The methodology for constructing optimal evidence-based DTRs has developed relatively recently at the interface of statistics and computer science. Hence, most public health researchers are unfamiliar with these new developments. I have attempted to briefly present the key concepts and tools needed to construct an optimal DTR.
This article is in no way intended to provide a complete review of all of the methods available for analyzing data from SMART designs (and longitudinal observational studies) and thereby constructing optimal DTRs. There exist other methods, of course, such as the likelihood-based methods developed by Thall et al.10,24,25 and the semiparametric methods developed by Murphy26 and Robins.27 Moodie et al.28 offered a useful discussion of DTRs, including the connection between different semiparametric methods of estimation and their application to observational data. Nonetheless, Q-learning is probably the simplest method and one that I hope will appeal to a wider readership. I believe that the DTR paradigm has great promise for application in a variety of medical and public health research projects, in particular the management of chronic diseases.
Acknowledgments
This work was supported by the Department of Biostatistics, Columbia University.
I acknowledge the statistical editor, Roger Vaughan, and the anonymous referees for useful comments on the article.
Human Participant Protection
Because of the conceptual nature of this article, no protocol approval was necessary.
References
- 1.Chronic Conditions: Making the Case for Ongoing Care: September 2004 Update. Baltimore, MD: Partnership for Solutions, Johns Hopkins University; 2004 [Google Scholar]
- 2.The World Health Report 1997: Conquering Suffering, Enriching Humanity. Geneva, Switzerland: World Health Organization; 1997 [PubMed] [Google Scholar]
- 3.Wagner EH, Austin BT, Davis C, Hindmarsh M, Schaefer J, Bonomi A. Improving chronic illness care: translating evidence into action. Health Aff (Millwood). 2001;20(6):64–78 [DOI] [PubMed] [Google Scholar]
- 4.Collins LM, Murphy SA, Bierman KL. A conceptual framework for adaptive preventive interventions. Prev Sci. 2004;5(3):185–196 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Murphy SA. An experimental design for the development of adaptive treatment strategies. Stat Med. 2005;24(10):1455–1481 [DOI] [PubMed] [Google Scholar]
- 6.Lavori PW. A design for testing clinical strategies: biased adaptive within-subject randomization. J R Stat Soc Ser A. 2000;163(1):29–38 [Google Scholar]
- 7.Lavori PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clin Trials. 2004;1(1):9–20 [DOI] [PubMed] [Google Scholar]
- 8.Schneider LS, Tariot PN, Lyketsos CG, et al. National Institute of Mental Health Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE): Alzheimer disease trial methodology. Am J Geriatr Psychiatry. 2001;9(4):346–360 [PubMed] [Google Scholar]
- 9.Rush AJ, Fava M, Wisniewski SR, et al. Sequenced Treatment Alternatives to Relieve Depression (STAR*D): rationale and design. Control Clin Trials. 2003;25(1):119–142 [DOI] [PubMed] [Google Scholar]
- 10.Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Stat Med. 2000;19(8):1011–1128 [DOI] [PubMed] [Google Scholar]
- 11.Wahed AS, Tsiatis AA. Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomized designs in clinical trials. Biometrics. 2004;60(1):124–133 [DOI] [PubMed] [Google Scholar]
- 12.Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19(3):317–343 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Strecher VJ, McClure JB, Alexander GL, et al. Web-based smoking cessation programs: results of a randomized trial. Am J Prev Med. 2008;34(5):373–381 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Auyeung SF, Long Q, Royster EB, et al. Sequential multiple-assignment randomized trial design of neurobehavioral treatment for patients with metastatic malignant melanoma undergoing high-dose interferon-alpha therapy. Clin Trials. 2009;6(5):480–490 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Murphy SA, Lynch KG, Oslin D, Mckay JR, TenHave T. Developing adaptive treatment strategies in substance abuse research. Drug Alcohol Depend. 2007;88(suppl 2):S24–S30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Oetting AI, Levy JA, Weiss RD, Murphy SA. Statistical methodology for a SMART design in the development of adaptive treatment strategies. : Shrout PE, Causality and Psychopathology: Finding the Determinants of Disorders and Their Cures. New York: Oxford University Press; 2010 [Google Scholar]
- 17.Feng W, Wahed AS. Sample size for two-stage studies with maintenance therapy. Stat Med. 2009;28(15):2028–2041 [DOI] [PubMed] [Google Scholar]
- 18.Li Z, Murphy SA. Sample Size Calculation for Comparing Two-Stage Treatment Strategies With Censored Data. Ann Arbor, MI: Dept of Statistics, University of Michigan; 2009 [Google Scholar]
- 19.Gail MH, Benichou J, Encyclopedia of Epidemiologic Methods. Chichester, England: John Wiley & Sons Inc; 2000 [Google Scholar]
- 20.Watkins CJCH. Learning from Delayed Rewards [PhD dissertation]. Cambridge, England: Cambridge University Press; 1989 [Google Scholar]
- 21.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998 [Google Scholar]
- 22.Murphy SA. A generalization error for Q-learning. J Mach Learn Res. 2005;6:1073–1097 [PMC free article] [PubMed] [Google Scholar]
- 23.Bellman RE. Dynamic Programming. Princeton, NJ: Princeton University Press; 1957 [Google Scholar]
- 24.Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J Am Stat Assoc. 2002;97(457):29–39 [Google Scholar]
- 25.Thall PF, Wooten LH, Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Stat Med. 2007;26(26):4687–4702 [DOI] [PubMed] [Google Scholar]
- 26.Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc Ser B. 2003;65(2):331–355 [Google Scholar]
- 27.Robins JM. Optimal structural nested models for optimal sequential decisions. : Lin DY, Heagerty P, Proceedings of the Second Seattle Symposium on Biostatistics. New York, NY: Springer Publishing Co; 2004:189–326 [Google Scholar]
- 28.Moodie EE, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63(2):447–455 [DOI] [PubMed] [Google Scholar]





