Tools for the Precision Medicine Era: How to Develop Highly Personalized Treatment Recommendations From Cohort and Registry Data Using Q-Learning

Elizabeth F Krakow; Michael Hemmer; Tao Wang; Brent Logan; Mukta Arora; Stephen Spellman; Daniel Couriel; Amin Alousi; Joseph Pidala; Michael Last; Silvy Lachance; Erica E M Moodie

doi:10.1093/aje/kwx027

. 2017 Jun 22;186(2):160–172. doi: 10.1093/aje/kwx027

Tools for the Precision Medicine Era: How to Develop Highly Personalized Treatment Recommendations From Cohort and Registry Data Using Q-Learning

PMCID: PMC6664807 PMID: 28472335

Abstract

Q-learning is a method of reinforcement learning that employs backwards stagewise estimation to identify sequences of actions that maximize some long-term reward. The method can be applied to sequential multiple-assignment randomized trials to develop personalized adaptive treatment strategies (ATSs)—longitudinal practice guidelines highly tailored to time-varying attributes of individual patients. Sometimes, the basis for choosing which ATSs to include in a sequential multiple-assignment randomized trial (or randomized controlled trial) may be inadequate. Nonrandomized data sources may inform the initial design of ATSs, which could later be prospectively validated. In this paper, we illustrate challenges involved in using nonrandomized data for this purpose with a case study from the Center for International Blood and Marrow Transplant Research registry (1995–2007) aimed at 1) determining whether the sequence of therapeutic classes used in graft-versus-host disease prophylaxis and in refractory graft-versus-host disease is associated with improved survival and 2) identifying donor and patient factors with which to guide individualized immunosuppressant selections over time. We discuss how to communicate the potential benefit derived from following an ATS at the population and subgroup levels and how to evaluate its robustness to modeling assumptions. This worked example may serve as a model for developing ATSs from registries and cohorts in oncology and other fields requiring sequential treatment decisions.

Keywords: adaptive treatment strategies, dynamic treatment regimes, graft-versus-host disease, machine learning, personalized medicine, prediction, Q-learning, registry data

A central problem in medical practice is that clinical trials testing one agent compared with some alternative might not elucidate the “best” treatment for a particular patient, because 1) treatment effects may vary within subgroups which may not have been considered in the trial; 2) the effect of patients’ time-varying characteristics might dwarf the initial effect of the baseline subgroup to which they belonged; and 3) there may be delayed synergism or antagonism between the treatment examined in the trial and heterogeneous treatments not included in the data capture or analysis (1). Therefore, it is challenging to apply traditional clinical trial methodology to address the question of “How can we optimize the sequence of specific treatments for specific patients?”.

An adaptive treatment strategy (ATS), also called a “dynamic treatment regime,” is a set of decision rules for allocating treatment where input includes fixed and time-varying patient characteristics and treatment history and output is a highly individualized recommendation made each time a treatment decision is needed. Elsewhere, we and others have described sequential multiple assignment randomized trial (SMART) methodology, which is uniquely suited to developing ATSs, with an emphasis on potential applications in the oncology field (1, 2).

In many situations, information needed to study ATSs in a randomized trial may not be readily available. Sample size calculation for SMARTs is a matter of active research (1). In the study of rare diseases, SMARTs may be infeasible because of the required sample size or the difficulty involved in coordinating a study across many institutions. Thus, exploratory analysis of nonrandomized data sources, such as registries and cohort studies, and reanalysis of multistage randomized controlled trials (RCTs) might be useful for devising ATSs and for planning practical SMARTs.

Our objective in this paper is to illustrate how one might use nonrandomized data from registries or cohorts to propose ATSs and how to evaluate the robustness of an estimated “optimal” ATS. To accomplish this, we took the example of sequential prevention and treatment of acute graft-versus-host disease (GVHD), a frequent complication of allogeneic hematopoietic cell transplantation (AHCT), and tested the feasibility of implementing Q-learning algorithms with Center for International Blood and Marrow Transplant Research (CIBMTR) registry data, with the goal of proposing an ATS for immunosuppressive management that would maximize disease-free survival 2 years post-AHCT (3).

MOTIVATING EXAMPLE: GVHD PREVENTION AND TREATMENT

In AHCT, immunosuppressive therapeutics are often administered sequentially to prevent and (if needed) treat GVHD. Their delayed effects, potential synergy or antagonism, and appropriateness given the complex, evolving characteristics of individual patients remain poorly characterized. Moreover, because of limited monetary resources, logistical challenges, and the heterogeneity (yet relative scarcity) of patients who develop the most severe posttransplant complications, such as steroid-refractory GVHD, there is a dearth of large RCTs for guiding practice; registry data have likewise proven difficult to exploit for developing “precision medicine” approaches. Therefore, the long-standing problem of how best to prevent and treat acute GVHD is an important question to address by applying novel approaches to the study of registry data.

CIBMTR registry studies influence AHCT practice worldwide. The CIBMTR's history, its current structure, and the information collected on its Comprehensive Report Forms are described elsewhere (4).

METHODS: Q-LEARNING

Q-learning is a reinforcement learning technique used to find the optimal sequence of actions from time-varying sets of possible actions. Although other frequentist (5–7), Bayesian (8, 9), and semiparametric (10, 11) approaches to estimating optimal ATSs have been developed, Q-learning is appealing because of simplicity: It proceeds via a sequence of predictions (often performed using regression) and hence relies on familiar analytical approaches, accounting for time-dependent confounding at each treatment stage by recursively modeling the outcome on the covariates measured prior to the treatment allocation at that stage. That is, for each patient at each decision point, starting with the final decision and then working backwards in time, it assigns values to the expected outcome resulting from each possible pairing of 1) treatment choice with 2) the observations made on that patient up to the given decision point. See Table 1 for the required assumptions. Q-learning using observational data and discrete outcomes is described in detail elsewhere (3, 12).

Table 1.

Assumptions Required for Estimating an Adaptive Treatment Strategy^a

Assumptions Common to All Methods	Description	Examples of Potential Violations
Stable unit treatment value	A subject's outcome is not influenced by other subjects’ treatment allocation (33).	Vaccination trials (due to “herd immunity”)
Stable unit treatment value		Intersubject contact in support groups
Only 1 version of treatment	There are not multiple versions of treatment that vary in effectiveness (33).	Clinician experience with new treatment over time improves outcomes for similarly allocated patients.
Only 1 version of treatment		Different clinicians implement what is ostensibly the same treatment very differently (e.g., variations in surgical skill, titrating to different plasma levels of a given drug, or employing different supportive care measures).
Sequential ignorability/no unmeasured confounders	Conditional on patient history, treatment received at stage j is independent of future potential observations, treatments, and outcome.	Patients who are more likely to die are more often assigned the more intense treatment option, where propensity to die cannot be accurately measured and so cannot be accurately controlled for.
Sequential ignorability/no unmeasured confounders		Missing confounder information
Censoring is noninformative	Conditional on patient history and measured covariates, the potential outcomes of those subjects who are censored follow the same distribution as those of subjects who are fully followed.	Patients who are lost to follow-up or ineligible for further treatment decisions are fundamentally different from those who remain in the study cohort over time.
Feasibility/positivity	Sufficient numbers of both treated and untreated subjects must exist at every level of the treatment and covariate history, at each decision stage.	Theoretical: The study design prohibits certain subjects from receiving a particular treatment.
Feasibility/positivity		Practical: Few patients with a given covariate pattern are included in the study sample or are assigned a particular treatment.
Time ordering	Treatment precedes outcome.	The severity of the subject's condition is not remeasured immediately prior to initiation of the next treatment, so an improvement that happened prior to that next treatment is later falsely attributed to that treatment.
Correct specification of all models	Unbiased and efficient estimated parameters	Failure to choose the correct functional form
Correct specification of all models	Unbiased and efficient estimated parameters	Failure to include an important tailoring variable^b

Open in a new tab

^a Elaboration on these assumptions is presented elsewhere (34, 35).

^b Realize, however, that “perfection is the enemy of the good.” While including all necessary tailoring variables is vital to estimating the true optimal treatment strategy, there might also be value in restricting the included variables to those that are convenient and cost-effective to measure in routine clinical practice.

Briefly, considering 2 sequential decisions and a binary endpoint, longitudinal data on a single patient are given by the trajectory (C₁, O₁, A₁, C₂, O₂, A₂, Y), where C_j and O_j (j = 1, 2) denote the set of covariates measured prior to treatment beginning at the jth interval and Y ϵ {0, 1} is the outcome observed at the end of interval 2. A_j is the set of actions (treatments), one of which could be assigned at the jth interval subsequent to observing O_j. Possible actions within the set are represented by a_kj. The current work uses a 2-level choice (i.e., A_j ϵ {0, 1}). Covariates represented by O_j interact with treatment (i.e., are treatment effect modifiers) and are called tailoring or prescriptive variables; they directly determine the optimal choice of a_kj from A_j. Those represented by C_j do not interact with treatment but either confound the relationship between treatment and outcome or are risk factors for the outcome (note that risk factors are required for correct outcome model specification). Any variable included in O_j may also be a confounding variable—that is, a shared independent cause of both treatment and outcome. The history at each interval is defined as H₁ = (C₁, O₁) and H₂ = (C₁, O₁, A₁, C₂, O₂). To estimate the optimal ATS for a discrete outcome in a random sample of size n drawn from a larger population (indexed by i = 1, . . . , n), the first step is estimating the parameters ${\hat{β}}_{2}$ and ${\hat{ψ}}_{2}$ of the conditional mean model for the outcome ${{f (E [Y | H}_{2 i}, A_{2 i}]) = Q}_{2} (H_{2 i}, A_{2 i}; β_{2}, ψ_{2})$ at the second decision point (j = 2) using generalized linear model regression with a strictly increasing link function $f (\cdot)$ , such as a log or logistic transformation; the requirement for an increasing function is to ensure unique maximization of the Q-function. The second step is to estimate the optimal interval 2 rule by substitution, such that ${\hat{a}}_{2}^{opt} (h_{2}) = \arg \max_{a_{2}} Q_{2} (h_{2}, a_{2}, {\hat{β}}_{2}, {\hat{ψ}}_{2}) .$ Note, then, that

Q_{2} (H_{2 i}, {\hat{a}}_{2}^{opt}) = {\begin{matrix} {\hat{β}}_{2}^{T} H_{2, β} if {\hat{a}}_{2}^{opt} (h_{2}) = 0 \\ {\hat{β}}_{2}^{T} H_{2, β} + ψ_{2}^{T} H_{2, ψ} otherwise, \end{matrix}

where H_2,β = (C₁, O₁, A₁, C₂) and H_2,ψ = O₂ are subsets of the stage 2 history vector, H₂. It is standard in the ATS literature to assume that greater values of the outcome are more desirable, so we have coded our outcome as 1 for disease-free survival and 0 for death or relapse. Had we chosen a more conventional coding, the optimal treatment would have been that which minimized the Q-function.

Having estimated the best stage-2 treatment, the “pseudo-outcome” ${\tilde{Y}}_{1 i}$ of interval 1 is generated for each Q₂ patient i by maximizing the logit of the probability of success in the second interval under the patient's own optimal A₂ treatment, which might be the treatment actually administered or the counterfactual option, and setting

\begin{matrix} {\tilde{Y}}_{1 i} = {f^{- 1} Q}_{2} (H_{2 i}, {\hat{a}}_{2}^{opt} (h_{2}); {\hat{β}}_{2}, {\hat{ψ}}_{2}) \\ = expit ({\hat{β}}_{2}^{T} H_{20, i} + {\hat{ψ}}_{2}^{T} H_{21, i} {\hat{a}}_{2}^{opt} (h_{2})), \end{matrix}

where $H_{21} = O_{2}$ is the set of second-interval variables that modify the effect of treatment a and $H_{20} \subseteq H_{2}$ is the set of all other variables needed to predict the outcome. ${\tilde{Y}}_{1 i}$ is not the binary predicted outcome under optimal treatment but rather the probability of the outcome given optimal treatment. The pseudo-outcome thus provides a way of ensuring that the first-stage treatment analysis compares patients fairly: Rather than take the observed outcome under the observed second-stage treatments (which varies across patients), we use the predicted probability of the positive outcome under the (estimated) best treatment. Thus, the stage 1 regression seeks to determine the relationship between stage 1 treatment and the final outcome measured sometime after the administration of stage 2 treatment, if we could force all patients to follow the same stage-2 treatment rule. In a continuous outcome setting, the pseudo-outcome has a direct interpretation in terms of an individual-level counterfactual outcome; in the discrete outcome setting, the pseudo-outcome can be viewed as an expected counterfactual (10, 13).

The interval 1 parameters $({\hat{β}}_{1}, {\hat{ψ}}_{1})$ are now estimated, using a generalized linear model to regress the pseudo-outcome ${\tilde{Y}}_{1 i}$ on all baseline/confounding variables, all interval 1 treatments, and all interactions between interval 1 treatments and potential tailoring variables. Finally, the optimal first-stage decision rule is found by calculating, patient by patient, which A₁ treatment would yield the best outcome, using the regression parameter estimates $({\hat{β}}_{1}, {\hat{ψ}}_{1})$ , with the result that ${\hat{a}}_{1}^{opt} (h_{1}) = \arg \max_{a_{1}} Q_{1} (h_{1}, a_{1}; {\hat{β}}_{1}, {\hat{ψ}}_{1})$ .

Note that ${\hat{ψ}}_{j}^{T} H_{j 1, i} = (ψ_{0} + ψ_{1} O_{1} + ψ_{2} O_{2} + . . . + ψ_{p} O_{n})$ may be positive for some values of (o₁, o₂,…,o_p) and zero or negative for other values of the linear combination of predictors. The consequence is that the net relationship of multiple covariates with the outcome can be accounted for simultaneously and used in aggregate for tailoring treatment to the given patient.

Tailoring variables

Typically, subject matter knowledge is used to select a parsimonious set of tailoring variables, $O_{j}$ , that are likely to modify the relationship between treatment and the outcome and are available in routine clinical practice. If many variables are included in $O_{j}$ , the model could become unstable; variable selection approaches (e.g., see Gunter et al. (14)) could be used to eliminate unnecessary variables, although this technique has not been adopted due to computational complexity.

Expected results

For each patient, Q-learning generates an estimate of ${\hat{a}}_{j}^{opt}$ so that, for example, if the patient did not receive optimal stage 1 treatment, the stage 2 decision rule could still be applied. The algorithm also estimates the probability of response (e.g., survival) under the optimal ATS on a patient-specific level. Importantly, the pseudo-outcomes under the different treatment options can be compared and aggregated across patients to obtain population-level estimates.

CASE STUDY ANALYSIS

Patient selection and treatment definition

The CIBMTR selected 11,141 patients who had undergone AHCT between 1995 and 2007 and had been retrieved for a study on chronic GVHD. Our study was approved by the CIBMTR Committee on Graft-versus-Host Disease and the McGill University Research Ethics Board. Eligible diagnoses included acute myeloid leukemia and myelodysplastic syndrome; 9,563 patients were retained for analysis. Web Appendix 1 (sections 1 and 2), available at https://academic.oup.com/aje, provides details of patient selection and the analysis.

Our analysis focused on 2 stages of treatment: 1) GVHD prophylaxis and 2) second-line, or salvage, treatment for persons who developed GVHD and experienced unsuccessful first-line treatment. The goal was to maximize each patient's probability of 2-year disease-free survival (2y-DFS), meaning freedom from death or cancer relapse, measured from the day of graft infusion. We did not attempt to learn about tailoring first-line therapy because there was little variability in first-line GVHD treatment (92% received corticosteroids and 85% received a calcineurin inhibitor). Over 30 different agents administered for GVHD prophylaxis or salvage treatment were assigned to one of 2 groups: “nonspecific, highly T-cell lymphodepleting” (NHTL) therapeutics or “standard” therapeutics. Thus, the treatment options considered are (A₁; if GVHD that requires salvage, then A₂), where A₁ is either NHTL prophylaxis or standard prophylaxis and A₂ is either NHTL salvage or standard salvage therapy.

Model development and fitting

The first interval (Q₁) was defined as the time from graft infusion to diagnosis of steroid-refractory acute GVHD or last follow-up, whichever occurred first. The second interval (Q₂) was defined as the time from diagnosis of steroid-refractory acute GVHD to last follow-up.

In developing the logistic regression models, interactions between treatment with graft source, conditioning intensity, disease status, human leukocyte antigen match, and donor relationship were a priori deemed important to include in the model. Some initial model exploration was performed to determine whether, for example, polynomial transformations or some potential confounders would be retained in the set C_j (see Web Appendix 1, section 6.3).

Missing data were addressed using inverse-probability-of-censoring weights for those individuals who were lost to follow-up and multiple imputation by chained equations for all other missingness.

Figure 1 shows the treatment-and-response trajectories experienced by patients in the cohort. The Q₂ model included 36 patients who were lost to follow-up after receiving acute GVHD salvage treatment but before the 2-year mark. Inverse-probability-of-censoring weights were calculated to account for their absence from the Q₂ model. To fit the weighted Q₂ model using data from fully observed salvage patients who relapsed or died before 2 years or were alive beyond 2 years, the observed 2y-DFS status with NHTL (if they received NHTL) salvage or with standard (if they received standard) salvage was used as the outcome.

Next, using that weighted model, we predicted pseudo-outcomes for all patients who required salvage treatment. The pseudo-outcomes fall within the interval [0, 1] and can be interpreted as the probability that a given patient will survive for 2 years disease-free given his/her baseline covariates and observed prophylaxis, assuming optimal salvage treatment. Each patient's Q₁ outcome was taken as follows: 1) For fully observed patients who did not require salvage treatment, the observed outcome (alive = 1 or deceased/relapsed = 0) was used. 2) For fully observed patients who required salvage treatment and were followed through 2 years and those who required salvage treatment but were lost to follow-up before 2 years, the larger of the 2 pseudo-outcomes from Q₂ (those being the outcome under NHTL salvage and the outcome under standard salvage)—that is, the predicted outcome under optimal interval treatment, $Q_{2} (H_{2 i}, {\hat{a}}_{2}^{opt})$ —was used.

The Q₁ model was fitted by means of regression with inverse-probability-of-censoring weights and used to predict the probability of 2y-DFS for each patient in the cohort under NHTL and under standard prophylaxis to determine which prophylaxis treatment led to the higher survival probability for each patient. All work was performed using R, versions 3.0.1–3.1.2 (R Foundation for Statistical Computing, Vienna, Austria) (15). Confidence intervals were based on percentiles computed using a nonparametric bootstrap with 500 resamples. Web Appendix 2, section 3.1, shows assessment of model fit.

RESULTS

Censoring

Type of GVHD prophylaxis or salvage did not significantly predict censoring. Several factors were associated with the probability of being censored prior to the second interval; fewer factors were associated with censoring in the second interval. The censoring weights ranged from approximately 1.0 to 4.5 in each interval (cf. Web Appendix 2, section 2.3).

Adaptive treatment strategies

Tables 2 and 3 show the results of Q-learning applied to the decision of whether to use NHTL therapy for prophylaxis and/or for salvage in multivariable analysis. The relative risks of 2y-DFS suggest that NHTL prophylaxis is detrimental at the population level (Table 2; relative risk = 0.88, 95% confidence interval (CI): 0.76, 0.98) but cannot confirm that NHTL salvage treatment is also detrimental (Table 3; relative risk = 0.80, 95% CI: 0.36, 1.10). Most candidate tailoring variables are not significantly different from 0. Notable exceptions are that NHTL prophylaxis appears detrimental in the setting of Cytomegalovirus (CMV) positivity and umbilical cord graft sources, as well as in reduced-intensity/nonmyeloablative transplant, perhaps because it increases serious infections in the former (16–18) and increases the incidence of cancer relapse in the latter (19–21). However, considering only the point estimates, certain combinations of values of potential tailoring variables “tip the balance” in favor of using NHTL prophylaxis or NHTL salvage. In this way, the algorithm could identify patients who might benefit more from one intervention than another (Figure 2).

Table 2.

Predictors of 2-Year Disease-Free Survival Among Patients Undergoing Allogenic Hematopoietic Cell Transplantation, According to the Type of Graft-Versus-Host Disease Prophylaxis Received, Center for International Blood and Marrow Transplant Research Registry, 1995–2007^a

Characteristic	Estimate Within Reference Category		NHTL vs. Standard Salvage, Given the Characteristic^b
Characteristic	RR	95% CI	RR	95% CI
Type of prophylaxis
Standard	1	Referent	NA	NA
NHTL	0.88	0.76, 0.98	NA	NA
Recipient's age, years
<10	1	Referent	0.88	0.76, 0.98
10–39	0.93	0.89, 0.99	NT	NT
≥40	0.84	0.79, 0.89	NT	NT
Karnofsky/Lansky performance status at time of transplant, %
≥80	1	Referent	0.88	0.76, 0.98
<80	0.77	0.69, 0.84	NT	NT
Disease status at time of transplant
Early	1	Referent	0.88	0.76, 0.98
Intermediate	0.93	0.88, 0.97	0.90	0.76, 1.04
Advanced	0.53	0.48, 0.59	0.87	0.66, 1.10
Donor relationship
Related	1	Referent	0.88	0.76, 0.98
Unrelated	0.92	0.88, 0.96	0.99	0.87, 1.10
HLA match
Well-matched	1	Referent	0.88	0.76, 0.98
Partially matched	0.90	0.84, 0.95	0.92	0.75, 1.06
Mismatched	0.84	0.75, 0.93	0.81	0.63, 1.00
Cytomegalovirus status
Negative-negative	1	Referent	0.88	0.76, 0.98
Donor or recipient positive	0.95	0.92, 0.99	0.83	0.72, 0.94
Sex match (donor-recipient)
Male-male, female-female, or male-female	1	Referent	0.88	0.76, 0.98
Female-male	1.01	0.97, 1.04	NT	NT
Graft source
Bone marrow	1	Referent	0.88	0.76, 0.98
Peripheral blood	1.01	0.98, 1.05	0.94	0.84, 1.03
Umbilical cord	1.08	0.99, 1.17	0.82	0.67, 0.96
Conditioning intensity
Myeloblative	1	Referent	0.88	0.76, 0.98
RIC/nonmyeloablative	0.93	0.88, 0.99	0.79	0.64, 0.94
Total-body irradiation
No	1	Referent	0.88	0.76, 0.98
Yes	0.99	0.95, 1.02	0.88	0.79, 1.05
Other components of GVHD prophylaxis
Absence of corticosteroid prophylaxis	1	Referent	0.88	0.76, 0.98
Corticorsteroids (not for nausea)	0.96	0.92, 1.00	NT	NT

Open in a new tab

Abbreviations: CI, confidence interval; GVHD, graft-versus-host disease; HLA, human leukocyte antigen; NA, not applicable; NHTL, nonspecific, highly T-cell lymphodepleting; NT, not tested; RIC, reduced-intensity conditioning; RR, relative risk.

^a The left-hand section of the table refers to the “main effects” terms, while the right-hand section refers to the combined main effect of NHTL and the interaction with each patient characteristic (covariate level).

^b This section shows the combined parameter estimates for tailoring variables with NHTL prophylaxis—that is, the result of the main effects and of interactions between the characteristic and the class of prophylaxis treatment, calculated as $RR = expit ({\hat{β}}_{0} + {\hat{β}}_{n} + ({\hat{ψ}}_{0} + {\hat{ψ}}_{n}) a) / expit ({\hat{β}}_{0} + {\hat{β}}_{n})$ , where ${\hat{β}}_{0}$ is the model intercept, ${\hat{β}}_{n}$ is the coefficient for the main effect of the candidate tailoring variable, ${\hat{ψ}}_{0}$ is the coefficient for the main effect of NHTL prophylaxis, ${\hat{ψ}}_{n}$ is the coefficient for the interaction of NHTL prophylaxis with the candidate tailoring variable, and action $a = 1$ for NHTL prophylaxis while $a = 0$ for standard prophylaxis; n indexes tailoring variables.

Table 3.

Predictors of 2-Year Disease-Free Survival Among Transplant Patients Proceeding to Acute Graft-Versus-Host Disease Salvage Treatment, Center for International Blood and Marrow Transplant Research Registry, 1995–2007^a,b

Characteristic	Estimate Within Reference Category		NHTL vs. Standard Salvage, Given the Characteristic^c
Characteristic	RR	95% CI	RR	95% CI
Type of salvage
Standard	1	Referent	NA	NA
NHTL	0.80	0.36, 1.10	NA	NA
Recipient's age, years
<10	1	Referent	0.80	0.36, 1.10
10–39	0.79	0.68, 0.89	NT	NT
≥40	0.66	0.53, 0.79	NT	NT
Karnofsky/Lansky performance status at time of transplant, %
≥80	1	Referent	0.80	0.36, 1.10
<80	0.80	0.49, 0.95	NT	NT
Disease status at time of transplant
Early	1	Referent	0.80	0.36, 1.10
Intermediate	0.97	0.90, 1.04	0.96	0.47, 1.26
Advanced	0.77	0.62, 0.90	0.89	0.26, 1.39
Donor relationship
Related	1	Referent	0.80	0.36, 1.10
Unrelated	0.75	0.61, 0.87	1.13	0.53, 1.64
HLA match
Well-matched	1	Referent	0.80	0.36, 1.10
Partially matched	1.01	0.93, 1.07	0.48	0.11, 0.86
Mismatched	0.86	0.64, 0.99	0.76	0.22, 1.38
Cytomegalovirus status
Negative-negative	1	Referent	0.80	0.36, 1.10
Donor or recipient positive	0.97	0.90, 1.02	0.68	0.24, 1.05
Graft source
Bone marrow	1	Referent	0.80	0.36, 1.10
Peripheral blood	1.01	0.94, 1.07	0.77	0.29, 1.09
Umbilical cord	1.08	0.96, 1.19	0.66	0.04, 1.10
Conditioning intensity
Myeloblative	1	Referent	0.80	0.36, 1.10
RIC/nonmyeloablative	1.02	0.94, 1.08	0.88	0.34, 1.14
Total-body irradiation
No	1	Referent	0.80	0.36, 1.10
Yes	1.06	1.01, 1.13		NT
GVHD prophylaxis
Standard prophylaxis	1	Referent	0.80	0.36, 1.10
Corticosteroids (not for nausea)	0.90	0.77, 0.99	0.73	0.16. 1.19
NHTL prophylaxis	1.08	1.02, 1.15	0.86	0.39, 1.08
Time from graft infusion to acute GVHD, months
<1	1	Referent	0.80	0.36, 1.10
≥1	0.97	0.89, 1.03	0.93	0.44, 1.20
Grade of acute GVHD at “start” of treatment (maximum grade on index form)
I or II	1	Referent	0.80	0.36, 1.10
III or IV	0.73	0.59, 0.86	0.48	0.11, 1.02
Use of ≥4 immunosuppressors to treat acute GVHD on index form
No	1	Referent	0.80	0.36, 1.10
Yes	0.94	0.82, 1.02	0.65	0.17, 1.09

Open in a new tab

^b Patients who developed chronic GVHD requiring systemic treatment before developing acute GVHD were censored from this analysis, as described in Figure 1.

^c This section shows the combined parameter estimates for tailoring variables with NHTL salvage—that is, the result of the main effects and of interactions between the characteristic and the class of salvage treatment, calculated as $RR = expit ({\hat{β}}_{0} + {\hat{β}}_{n} + ({\hat{ψ}}_{0} + {\hat{ψ}}_{n}) a) / expit ({\hat{β}}_{0} + {\hat{β}}_{n})$ , where ${\hat{β}}_{0}$ is the model intercept, ${\hat{β}}_{n}$ is the coefficient for the main effect of the candidate tailoring variable, ${\hat{ψ}}_{0}$ is the coefficient for the main effect of NHTL salvage, ${\hat{ψ}}_{n}$ is the coefficient for the interaction of NHTL salvage with the candidate tailoring variable, and action $a = 1$ for NHTL salvage while $a = 0$ for standard salvage; n indexes tailoring variables.

Figure 2. — Difference between the predicted probabilities of 2-year disease-free survival (2y-DFS) among transplant patients with the use of nonspecific highly T-cell lymphodepleting (NHTL) therapeutics versus standard prophylaxis treatment and with NHTL versus standard salvage treatment, Center for International Blood and Marrow Transplant Research registry, 1995–2007. The vertical axis displays the predicted probability of 2y-DFS with NHTL treatment minus that with standard treatment. Hence, positive values favor NHTL treatment and negative values favor standard treatment. The black plotline indicates superiority of NHTL treatment, while the gray plotline indicates superiority of standard treatment. Part A displays the results for the prophylaxis decision, and part B displays the results for the salvage decision. For 1,120 (12%) patients in the prophylaxis cohort and 606 (42%) salvage patients, the predicted difference in 2y-DFS was more than 10% with the different options (shown by the horizontal guide lines).

The Q₁ model predicted that 4,766 patients (50%) would have a higher probability of 2y-DFS with NHTL prophylaxis and 4,797 (50%) would fare better with standard prophylaxis (Figure 2A). These proportions were stable (49%–50%) across models employing somewhat different predictors and different parameterizations of the predictors. The magnitude of the benefit was modest. Among patients predicted to benefit more from standard prophylaxis, the median absolute difference in 2y-DFS probability predicted under standard prophylaxis versus NHTL prophylaxis was 5% (range, nearly 0% to 16%). Likewise, among patients predicted to benefit more from NHTL prophylaxis, the median absolute difference in 2y-DFS probability predicted under NHTL prophylaxis versus standard prophylaxis was 4% (range, nearly 0% to 16%). A total of 4,322 (45%) patients were predicted to have 2y-DFS differences greater than or equal to 5%, and 1,120 (12%) were predicted to have 2y-DFS differences greater than or equal to 10%, with alternative prophylaxis options.

For the 1,411 patients requiring salvage treatment for acute GVHD, the Q₂ model predicted that 506 patients (36%) would have a higher probability of 2y-DFS with NHTL prophylaxis and 905 (64%) would fare better with standard prophylaxis (Figure 2B). These proportions were fairly stable (33%–37% favoring NHTL) across models employing different predictors and different parameterizations of the predictors. Although intuitively the interaction section of Table 3 (last 2 columns) suggests that NHTL is uniformly detrimental (recall that the relative risks represent the relative risk of survival), when the patient-specific linear combinations of main and interaction terms are calculated, NHTL can trump standard salvage treatment. The contrast between predicted 2y-DFS for the alternative salvage choices was often substantial (but variability was great due to smaller sample size). Among patients predicted to benefit more from standard salvage, the absolute difference in 2y-DFS probability predicted under standard salvage versus NHTL salvage was a median of 9% (range, nearly 0% to 57%). Similarly, among patients predicted to benefit more from NHTL salvage, the absolute difference in 2y-DFS probability predicted under NHTL salvage versus standard salvage was a median of 7% (range, nearly 0% to 41%). For 898 (63%) patients, the predicted 2y-DFS with alternative salvage options differed by at least 5%, and for 606 (42%) patients, it differed by at least 10%.

Population-level survival under the “best” treatment

Figure 3 compares the anticipated proportion of patients achieving 2y-DFS if given the ATS-recommended prophylaxis and salvage treatment compared with the results of several fixed regimes, where everyone receives 1) NHTL prophylaxis and, if needed, NHTL salvage; 2) standard prophylaxis and, if needed, standard salvage; 3) standard prophylaxis and, if needed, NHTL salvage; and 4) NHTL prophylaxis and, if needed, standard salvage.

These results suggest that the personalized ATS affords slightly superior 2y-DFS for the population (41.97%, 95% CI: 41.16, 44.22) compared with the best fixed strategy, employing standard prophylaxis with NHTL salvage (39.43%, 95% CI: 37.95, 40.35). These population-level estimates are computed in the same data set as that used to build the models; an independent validation cohort would be needed to confirm these findings.

Stability of personalized recommendations

The ATS was derived following averaging across multiple imputed data sets. The stability of the patient-specific treatment recommendations across the different imputations provides limited reassurance about the robustness of these recommendations to small changes in covariate values (Web Appendix 2, section 3.2). Only 4% of patients received a conflicting prophylaxis recommendation in at least 1 of the 5 imputations, and 6% received a conflicting salvage treatment recommendation in at least 1 of the 5 imputations.

Post-hoc consideration of sample size

Despite the very large registry available to us, post-hoc analyses suggested that the wide confidence intervals reported here may be attributed to lack of power, as the detection of interactions—which is what we seek in estimating an ATS—requires very large samples (Web Appendix 2, section 3.3).

DISCUSSION

To develop an ATS with observational data, we suggest a multistep process (Appendix Table 1). This work illustrates how this process can yield highly personalized treatment recommendations. For example, our model predicted that a 9-year-old, CMV-negative male recipient of an unrelated female, partially matched, CMV-negative peripheral blood graft for treatment of advanced acute myeloid leukemia after cyclophosphamide plus total-body irradiation myeloablative conditioning would have a 16% greater chance of 2y-DFS if given NHTL prophylaxis. A 20-year-old CMV-positive female recipient of a human leukocyte antigen-identical, CMV-negative sibling peripheral blood graft for treatment of early acute myeloid leukemia after reduced-intensity busulfan-fludarabine was predicted to have superior 2y-DFS with standard prophylaxis (45%) as compared with NHTL prophylaxis (33%).

Our goal was to introduce readers to the concept of ATSs and one of the methods used to estimate such treatment sequences, Q-learning. Q-learning is easily implemented and flexible, adaptable to continuous (22, 23), discrete (12), and time-to-event (24) outcomes, as well as continuous or multicategory exposures. Analysis is not restricted to regression: Q-learning can be implemented using any form of prediction (e.g., trees).

Many alternatives to Q-learning are available. G-computation (whether Bayesian (25, 26) or frequentist (27) in terms of implementation) is conceptually straightforward, but it requires estimation of the joint distribution of all tailoring and confounding variables at each interval; for this reason, the approach has not found favor in practice, as misspecification of any of these potentially high-dimensional models can lead to incorrect conclusions (10). Semiparametric approaches include g-estimation (10), dynamic marginal structural models (28–30), and dynamic weighted ordinary least squares (31); these approaches offer advantages, as each can be implemented in a doubly robust manner. Of these, g-estimation is perhaps the most difficult to understand; however, an R package has been developed with which to implement g-estimation and dynamic weighted ordinary least squares with an arbitrary number of treatment intervals for either binary or continuous (dose) treatments (32). In contrast, the dynamic marginal structural model approach and related approaches rely on familiar techniques, such as inverse weighting, but they require considering preprocessing of the data prior to analysis and are difficult to implement with a large or varying number of tailoring variables.

This article is not intended to propose a definitive ATS for acute GVHD prophylaxis and treatment. To do so would require a more recent cohort treated with contemporary practices and measurement of key confounders that are currently missing from CIBMTR Comprehensive Report Forms (cf. Web Appendix 2, section 3.4). That caveat aside, the estimated increased probability of survival from following the optimal ATS to prevent and treat GVHD is large enough (≥10%) for a substantial fraction of patients to justify further research in this area, even though, on a population level, following the optimal ATS compared with the “worst” policy of “NHTL prophylaxis/standard salvage for everyone” is predicted to improve 2y-DFS by approximately 4% (41.97% 2y-DFS vs. 37.83% 2y-DFS). A 2-arm RCT randomizing 1:1 between the ATS and this “worst” policy with 80% power to detect such a difference (type I error rate 0.05, pooled standard deviation 0.016) would require approximately 4,500 patients. Web Appendix 2, section 3.3, offers further guidance on power/sample size, taking into account tailoring variables. Simulations can also provide a useful tool for understanding the needed statistical power and potential gains from ATSs. However, neither RCTs nor large SMARTs are easily executable in the AHCT field, as most transplant centers perform few allogeneic transplants (from <25 to about 100) annually. Happily, answering relevant questions is feasible in the registry setting; the CIBMTR currently includes more than 425,000 patients, banks more than 42,000 paired patient-donor blood samples, and curates thousands of patients annually. An ATS estimated from registry data could be validated to some extent through additional observational studies. For example, results from institutions that implemented the proposed ATS could be compared with those from institutions that did not, provided there exist sufficient numbers of institutions whose practices are compatible with the proposed ATS. Alternatively, institutions might implement the ATS for a year (or more), withdraw it for a year (or more), and then reimplement it for a year (or more). If outcomes are better for the patients who were treated in the years when the algorithm was in place, and that benefit disappeared in the years when the algorithm was withdrawn, that would serve as circumstantial evidence of the ATS's value. One logistical difficulty with such an approach is that a patient should always be treated according to his or her initial treatment strategy, such that if treatment began in an “ATS year” and continued into a “withdrawn year,” the ATS should continue to be followed for that patient. That is, the ATS needs to be available to “study” patients at all times.

The originality of this work lies in having used registry data in combination with a highly innovative analytical approach. In reality, clinicians engage in sequential treatment decision-making over time based on accruing observations about the individual patient; providing data-driven evidence to support this form of decision-making is critical. Like the CIBMTR data, clinical practice encompasses far more diverse patient characteristics than are found in the typically homogeneous patient populations on which RCTs generally focus. Finally, in clinical practice, there are usually many possible treatments at each decision point, but RCTs usually limit these possibilities sharply and thus fail to capture the complexity of real-life clinical decisions. Q-learning and similar approaches can be harnessed to make decisions in the face of uncertainty more rationally by making the relationship between a combination of patient-specific characteristics and the outcome explicit and formalizing the decision-making process in an evidence-based manner.

Supplementary Material

Web Material

Click here for additional data file.^{(1.1MB, pdf)}

ACKNOWLEDGMENTS

Author affiliations: Department of Epidemiology, Biostatistics and Occupational Health, Faculty of Medicine, McGill University, Montreal, Quebec, Canada (Elizabeth F. Krakow, Erica E.M. Moodie); Division of Hematology and Oncology, Department of Medicine, Hôpital Maisonneuve-Rosemont Research Center, University of Montreal, Montreal, Quebec, Canada (Elizabeth F. Krakow, Silvy Lachance); Center for International Blood and Marrow Transplant Research, Department of Medicine, Medical College of Wisconsin, Milwaukee, Wisconsin (Michael Hemmer, Tao Wang, Brent Logan); Division of Biostatistics, Institute for Health and Society, Medical College of Wisconsin, Milwaukee, Wisconsin (Tao Wang, Brent Logan); Division of Hematology, Oncology, and Transplantation, Department of Medicine, University of Minnesota Medical Center, Minneapolis, Minnesota (Mukta Arora); Center for International Blood and Marrow Transplant Research, National Marrow Donor Program/Be the Match, Minneapolis, Minnesota (Stephen Spellman); Division of Hematology/BMT, Department of Internal Medicine, University of Utah Health Care and Huntsman Cancer Center, University of Utah, Salt Lake City, Utah (Daniel Couriel); Division of Cancer Medicine, Department of Stem Cell Transplantation, University of Texas M.D. Anderson Cancer Center, Houston, Texas (Amin Alousi); Department of Blood and Marrow Transplantation, Moffitt Cancer Center, Tampa, Florida (Joseph Pidala); and US Department of Defense, Fort Meade, Maryland (Michael Last).

E.F.K. was supported by the Cole Foundation. E.E.M.M. was supported by the Fonds de recherche du Québec–Santé.

The Center for International Blood and Marrow Transplant Research is supported primarily by public health service grant/cooperative agreement 5U24CA076518 from the National Cancer Institute (NCI), the National Heart, Lung and Blood Institute (NHLBI), and the National Institute of Allergy and Infectious Diseases; grant/cooperative agreement 4U10HL069294 from the NHLBI and NCI; contract HHSH250201200016C with the Health Resources and Services Administration (HRSA/DHHS); grants N00014-17-1-2388 and N0014-17-1-2850 from the Office of Naval Research; and grants from the following: Actinium Pharmaceuticals, Inc.; Amgen, Inc.; Amneal Biosciences; Angiocrine Bioscience, Inc.; anonymous donation to the Medical College of Wisconsin; Astellas Pharma US; Atara Biotherapeutics, Inc.; Be the Match Foundation; bluebird bio, Inc.; Bristol-Myers Squibb Oncology; Celgene Corporation; Cerus Corporation; Chimerix, Inc.; Fred Hutchinson Cancer Research Center; Gamida Cell, Ltd.; Gilead Sciences, Inc.; HistoGenetics, Inc.; Immucor; Incyte Corporation; Janssen Scientific Affairs, LLC; Jazz Pharmaceuticals, Inc.; Juno Therapeutics; Karyopharm Therapeutics, Inc.; Kite Pharma, Inc.;Medac, GmbH; MedImmune; The Medical College of Wisconsin; Mediware; Merck&Co, Inc.; Mesoblast; MesoScale Diagnostics, Inc.; Millennium, the Takeda Oncology Co.;Miltenyi Biotec, Inc.; NationalMarrow Donor Program; Neovii Biotech NA, Inc.; Novartis Pharmaceuticals Corporation; Otsuka Pharmaceutical Co, Ltd. -Japan; PCORI; Pfizer, Inc; Pharmacyclics, LLC; PIRCHE AG; Sanofi Genzyme; Seattle Genetics; Shire; Spectrum Pharmaceuticals, Inc.; St. Baldricks Foundation; Sunesis Pharmaceuticals, Inc.; Swedish Orphan Biovitrum, Inc.; Takeda Oncology; Telomere Diagnostics, Inc.; and the University of Minnesota.

We thank the Audiovisual Service of Hôpital Maisonneuve-Rosemont and Helen Crawford of the Fred Hutchinson Cancer Research Center for assistance with figure preparation.

The views expressed in this article do not reflect the official policy or position of the National Institutes of Health, the Department of the Navy, the Department of Defense, HRSA, or any other agency of the US Government.

Conflict of interest: none declared.

Appendix Table 1.

Steps for Transparently Developing an Adaptive Treatment Strategy With Q-Learning Using Nonrandomized Data

Steps

Preanalysis steps
1. Establish clear inclusion/exclusion criteria. If a training and validation cohort will be used, set aside the validation cohort subjects.
2. Determine the choice and form of the model(s) and of the variables to be entered into the model.^a
3. Determine how missing data will be handled (e.g., multiple imputation by chained equations).
4. Determine how patients who are lost to follow-up or are ineligible to “enter” the models for specific treatment decisions are handled (e.g., inverse-probability-of-censoring weights).
Key steps in a Q-learning analysis
1. Estimate the parameters ${\hat{β}}_{k}$ and ${\hat{ψ}}_{k}$ of the conditional mean model for the outcome ${{f (E [Y | H}_{ki}, A_{ki}]) = Q}_{2} (H_{ki}, A_{ki}; β_{k} {, ψ}_{k})$ at the last decision point (j = k).
2. Estimate the optimal interval k rule by substitution, so that ${\hat{a}}_{k}^{opt} (h_{k}) = \arg \max_{a_{k}} Q_{2} (h_{k}, a_{k}, {\hat{β}}_{k}, {\hat{ψ}}_{k}) .$
3. Predict the “pseudo-outcomes” of interval (j = k − 1) by assuming that each patient i will receive optimal treatment for interval j = k.
4. Estimate the interval (k − 1) parameters ${\hat{β}}_{k - 1}$ and ${\hat{ψ}}_{k - 1}$ by regressing the pseudo-outcomes from step “c” on the interval (k − 1) treatments, potential tailoring variables, and confounding variables.
5. Estimate the optimal interval (k − 1) rule by substitution, so that ${\hat{a}}_{k - 1}^{opt} (h_{k - 1}) = \arg \max_{a_{k - 1}} Q_{2} (h_{k - 1}, a_{k - 1}, {\hat{β}}_{k - 1}, {\hat{ψ}}_{k - 1}) .$
6. Continue working backwards as in steps “c”–“e” until the rule for interval j = 1 has been specified. For example, the next step would be to predict the “pseudo-outcomes” of interval (j = k − 2) by assuming that each patient i will receive optimal treatment for intervals j = {k, k − 1}.
Validation and sensitivity
1. Assess model fit within the current data set (e.g., comparing predicted outcomes with observed outcomes, cross-validation, or using independent validation data).
2. Assess the stability of recommendations—for example, describe the stability of personalized recommendations across sets of patients who share similar salient covariates or, if multiple imputations were used, across imputations.
3. Evaluate whether the study is adequately powered.
Clinical translation
1. Assess whether the projected magnitude of benefit (and harms or costs) from implementing the ATS on a population level, and for subgroups within the population, would justify moving forward with its implementation. Consider incremental benefit over current approaches.
2. If the ATS seems robust as judged by step 3 and worthwhile as judged by step 4a, test it prospectively in a SMART or RCT.^b

Open in a new tab

Abbreviations: ATS, adaptive treatment strategy; RCT, randomized controlled trial; SMART, sequential multiple assignment randomized trial.

^a In basic science, we feel more confident when the same result is reached through multiple methods—for example, if the importance of a cellular pathway to a particular disease is manifested by overexpression of its component genes as well as its component proteins. Similarly, if the same treatment recommendations are reached through different models (or even different machine learning methods), we might feel more confident about the results.

^b If a SMART or RCT is not feasible, an ecological study could be conducted. By ecological study, we mean that results from institutions that implemented the proposed ATS could be compared with those from institutions that did not. (See main text for elaboration.) Alternatively, an institution might choose to implement the ATS for a year (or more), withdraw it for a year (or more), and then reimplement it for a year (or more). If outcomes were better for the patients who were treated in the years when the algorithm was in place, and that benefit disappeared in the year the algorithm was withdrawn, that finding would serve as circumstantial evidence that the ATS was of value.

REFERENCES

1. Kidwell KM. SMART designs in cancer research: past, present, and future. Clin Trials. 2014;11(4):445–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Krakow EF, Moodie EE. Tools for the precision medicine era: how to develop personalized treatment strategies using SMARTs In: El Naqa I, ed. A Guide to Outcome Modeling in Radiotherapy and Oncology: Listening to the Data. Abingdon, United Kingdom: Taylor & Francis. In press. [Google Scholar]
3. Moodie EE, Chakraborty B, Kramer MS. Q-learning for estimating optimal dynamic treatment rules from observational data. Can J Stat. 2012;40(4):629–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Horowitz M. The role of registries in facilitating clinical research in BMT: examples from the Center for International Blood and Marrow Transplant Research. Bone Marrow Transplant. 2008;42(suppl 1):S1–S2. [DOI] [PubMed] [Google Scholar]
5. Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific; 1996. [Google Scholar]
6. Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Stat Med. 2000;19(8):1011–1028. [DOI] [PubMed] [Google Scholar]
7. Bellman R. Dynamic Programming. Mineola, NY: Dover Publications; 2003. [Google Scholar]
8. Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J Am Stat Assoc. 2002;97(457):29–39. [Google Scholar]
9. Zajonc T. Bayesian inference for dynamic treatment regimes: mobility, equity, and efficiency in student tracking. J Am Stat Assoc. 2012;107(497):80–92. [Google Scholar]
10. Robins J. Optimal structural nested models for optimal sequential decisions In: Lin DY, Heagerty PJ, eds. Proceedings of the Second Seattle Symposium on Biostatistics. New York, NY: Springer Publishing Company; 2004. [Google Scholar]
11. Murphy SA. Optimal dynamic treatment regimes [with discussion]. J R Stat Soc Series B Stat Methodol. 2003;65(2):331–366. [Google Scholar]
12. Moodie EE, Dean N, Sun YR. Q-learning: flexible learning about useful utilities. Stat Biosci. 2014;6(2):223–243. [Google Scholar]
13. Moodie EE, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63(2):447–455. [DOI] [PubMed] [Google Scholar]
14. Gunter L, Zhu J, Murphy SA. Variable selection for qualitative interactions. Stat Methodol. 2011;1(8):42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. R Foundation for Statistical Computing R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2008. [Google Scholar]
16. Chakrabarti S, Mackinnon S, Chopra R, et al. High incidence of cytomegalovirus infection after nonmyeloablative stem cell transplantation: potential role of Campath-1H in delaying immune reconstitution. Blood. 2002;99(12):4357–4363. [DOI] [PubMed] [Google Scholar]
17. Chakrabarti S, Mautner V, Osman H, et al. Adenovirus infections following allogeneic stem cell transplantation: incidence and outcome in relation to graft manipulation, immunosuppression, and immune recovery. Blood. 2002;100(5):1619–1627. [DOI] [PubMed] [Google Scholar]
18. Brunstein CG, Weisdorf DJ, DeFor T, et al. Marked increased risk of Epstein-Barr virus-related complications with the addition of antithymocyte globulin to a nonmyeloablative conditioning prior to unrelated umbilical cord blood transplantation. Blood. 2006;108(8):2874–2880. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Siddiqi T, Blaise D. Does antithymocyte globulin have a place in reduced-intensity conditioning for allogeneic hematopoietic stem cell transplantation. Hematology Am Soc Hematol Educ Program. 2012;2012(1):246–250. [DOI] [PubMed] [Google Scholar]
20. Soiffer RJ, Lerademacher J, Ho V, et al. Impact of immune modulation with anti-T-cell antibodies on the outcome of reduced-intensity allogeneic hematopoietic stem cell transplantation for hematologic malignancies. Blood. 2011;117(25):6963–6970. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Tauro S, Craddock C, Peggs K, et al. Allogeneic stem-cell transplantation using a reduced-intensity conditioning regimen has the capacity to produce durable remissions and long-term disease-free survival in patients with high-risk acute myeloid leukemia and myelodysplasia. J Clin Oncol. 2005;23(36):9387–9393. [DOI] [PubMed] [Google Scholar]
22. Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19(3):317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Murphy SA. A generalization error for Q-learning. J Mach Learn Res. 2005;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]
24. Huang X, Ning J. Analysis of multi-stage treatments for recurrent diseases. Stat Med 2012;31(24):2805–2821. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Arjas E, Saarela O.. Optimal dynamic regimes: presenting a case for predictive inference. Int J Biostat. 2010;6(2):Article 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Saarela O, Arjas E, Stephens DA, Moodie EE. Predictive Bayesian inference and dynamic treatment regimes. Biom J 2015;57(6):941–958. [DOI] [PubMed] [Google Scholar]
27. Chakraborty B, Moodie EE. G-computation: parametric estimation of optimal DTRs In: Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine. New York, NY: Springer Publishing Company; 2013:101–112. [Google Scholar]
28. van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):Article 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Orellana L, Rotnitzky A, Robins JM. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. Int J Biostat. 2010;6(2):Article 8. [PubMed] [Google Scholar]
30. Orellana L, Rotnitzky A, Robins JM. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results. Int J Biostat. 2010;6(2):Article 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Wallace MP, Moodie EE. Doubly-robust dynamic treatment regimen estimation via weighted least squares. Biometrics. 2015;71(3):636–644. [DOI] [PubMed] [Google Scholar]
32. Wallace M, Moodie EE, Stephens DA DTRreg: DTR estimation and inference via g-estimation, dynamic WOLS, and Q-learning. (R Package “DTRreg,” version 1.1). https://cran.r-project.org/web/packages/DTRreg/index.html. Published March 9, 2016. Accessed July 11, 2016.
33. VanderWeele TJ, Hernán MA. Causal inference under multiple versions of treatment. J Causal Inference. 2013;1(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Chakraborty B, Moodie EE. The data: observational studies and sequentially randomized trials In: Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine. New York, NY: Springer Publishing Company; 2013:9–30. [Google Scholar]
35. Kidwell KM. DTRs and SMARTs: definitions, designs, and applications In: Kosorok MR, Moodie EE, eds. Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2016:10–12. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Material

Click here for additional data file.^{(1.1MB, pdf)}

[kwx027C1] 1. Kidwell KM. SMART designs in cancer research: past, present, and future. Clin Trials. 2014;11(4):445–456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C2] 2. Krakow EF, Moodie EE. Tools for the precision medicine era: how to develop personalized treatment strategies using SMARTs In: El Naqa I, ed. A Guide to Outcome Modeling in Radiotherapy and Oncology: Listening to the Data. Abingdon, United Kingdom: Taylor & Francis. In press. [Google Scholar]

[kwx027C3] 3. Moodie EE, Chakraborty B, Kramer MS. Q-learning for estimating optimal dynamic treatment rules from observational data. Can J Stat. 2012;40(4):629–645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C4] 4. Horowitz M. The role of registries in facilitating clinical research in BMT: examples from the Center for International Blood and Marrow Transplant Research. Bone Marrow Transplant. 2008;42(suppl 1):S1–S2. [DOI] [PubMed] [Google Scholar]

[kwx027C5] 5. Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific; 1996. [Google Scholar]

[kwx027C6] 6. Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Stat Med. 2000;19(8):1011–1028. [DOI] [PubMed] [Google Scholar]

[kwx027C7] 7. Bellman R. Dynamic Programming. Mineola, NY: Dover Publications; 2003. [Google Scholar]

[kwx027C8] 8. Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J Am Stat Assoc. 2002;97(457):29–39. [Google Scholar]

[kwx027C9] 9. Zajonc T. Bayesian inference for dynamic treatment regimes: mobility, equity, and efficiency in student tracking. J Am Stat Assoc. 2012;107(497):80–92. [Google Scholar]

[kwx027C10] 10. Robins J. Optimal structural nested models for optimal sequential decisions In: Lin DY, Heagerty PJ, eds. Proceedings of the Second Seattle Symposium on Biostatistics. New York, NY: Springer Publishing Company; 2004. [Google Scholar]

[kwx027C11] 11. Murphy SA. Optimal dynamic treatment regimes [with discussion]. J R Stat Soc Series B Stat Methodol. 2003;65(2):331–366. [Google Scholar]

[kwx027C12] 12. Moodie EE, Dean N, Sun YR. Q-learning: flexible learning about useful utilities. Stat Biosci. 2014;6(2):223–243. [Google Scholar]

[kwx027C13] 13. Moodie EE, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63(2):447–455. [DOI] [PubMed] [Google Scholar]

[kwx027C14] 14. Gunter L, Zhu J, Murphy SA. Variable selection for qualitative interactions. Stat Methodol. 2011;1(8):42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C15] 15. R Foundation for Statistical Computing R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2008. [Google Scholar]

[kwx027C16] 16. Chakrabarti S, Mackinnon S, Chopra R, et al. High incidence of cytomegalovirus infection after nonmyeloablative stem cell transplantation: potential role of Campath-1H in delaying immune reconstitution. Blood. 2002;99(12):4357–4363. [DOI] [PubMed] [Google Scholar]

[kwx027C17] 17. Chakrabarti S, Mautner V, Osman H, et al. Adenovirus infections following allogeneic stem cell transplantation: incidence and outcome in relation to graft manipulation, immunosuppression, and immune recovery. Blood. 2002;100(5):1619–1627. [DOI] [PubMed] [Google Scholar]

[kwx027C18] 18. Brunstein CG, Weisdorf DJ, DeFor T, et al. Marked increased risk of Epstein-Barr virus-related complications with the addition of antithymocyte globulin to a nonmyeloablative conditioning prior to unrelated umbilical cord blood transplantation. Blood. 2006;108(8):2874–2880. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C19] 19. Siddiqi T, Blaise D. Does antithymocyte globulin have a place in reduced-intensity conditioning for allogeneic hematopoietic stem cell transplantation. Hematology Am Soc Hematol Educ Program. 2012;2012(1):246–250. [DOI] [PubMed] [Google Scholar]

[kwx027C20] 20. Soiffer RJ, Lerademacher J, Ho V, et al. Impact of immune modulation with anti-T-cell antibodies on the outcome of reduced-intensity allogeneic hematopoietic stem cell transplantation for hematologic malignancies. Blood. 2011;117(25):6963–6970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C21] 21. Tauro S, Craddock C, Peggs K, et al. Allogeneic stem-cell transplantation using a reduced-intensity conditioning regimen has the capacity to produce durable remissions and long-term disease-free survival in patients with high-risk acute myeloid leukemia and myelodysplasia. J Clin Oncol. 2005;23(36):9387–9393. [DOI] [PubMed] [Google Scholar]

[kwx027C22] 22. Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat Methods Med Res. 2010;19(3):317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C23] 23. Murphy SA. A generalization error for Q-learning. J Mach Learn Res. 2005;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]

[kwx027C24] 24. Huang X, Ning J. Analysis of multi-stage treatments for recurrent diseases. Stat Med 2012;31(24):2805–2821. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C25] 25. Arjas E, Saarela O.. Optimal dynamic regimes: presenting a case for predictive inference. Int J Biostat. 2010;6(2):Article 10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C26] 26. Saarela O, Arjas E, Stephens DA, Moodie EE. Predictive Bayesian inference and dynamic treatment regimes. Biom J 2015;57(6):941–958. [DOI] [PubMed] [Google Scholar]

[kwx027C27] 27. Chakraborty B, Moodie EE. G-computation: parametric estimation of optimal DTRs In: Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine. New York, NY: Springer Publishing Company; 2013:101–112. [Google Scholar]

[kwx027C28] 28. van der Laan MJ, Petersen ML. Causal effect models for realistic individualized treatment and intention to treat rules. Int J Biostat. 2007;3(1):Article 3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C29] 29. Orellana L, Rotnitzky A, Robins JM. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. Int J Biostat. 2010;6(2):Article 8. [PubMed] [Google Scholar]

[kwx027C30] 30. Orellana L, Rotnitzky A, Robins JM. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part II: proofs of results. Int J Biostat. 2010;6(2):Article 9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C31] 31. Wallace MP, Moodie EE. Doubly-robust dynamic treatment regimen estimation via weighted least squares. Biometrics. 2015;71(3):636–644. [DOI] [PubMed] [Google Scholar]

[kwx027C32] 32. Wallace M, Moodie EE, Stephens DA DTRreg: DTR estimation and inference via g-estimation, dynamic WOLS, and Q-learning. (R Package “DTRreg,” version 1.1). https://cran.r-project.org/web/packages/DTRreg/index.html. Published March 9, 2016. Accessed July 11, 2016.

[kwx027C33] 33. VanderWeele TJ, Hernán MA. Causal inference under multiple versions of treatment. J Causal Inference. 2013;1(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kwx027C34] 34. Chakraborty B, Moodie EE. The data: observational studies and sequentially randomized trials In: Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine. New York, NY: Springer Publishing Company; 2013:9–30. [Google Scholar]

[kwx027C35] 35. Kidwell KM. DTRs and SMARTs: definitions, designs, and applications In: Kosorok MR, Moodie EE, eds. Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2016:10–12. [Google Scholar]

PERMALINK

Tools for the Precision Medicine Era: How to Develop Highly Personalized Treatment Recommendations From Cohort and Registry Data Using Q-Learning

Elizabeth F Krakow

Michael Hemmer

Tao Wang

Brent Logan

Mukta Arora

Stephen Spellman

Daniel Couriel

Amin Alousi

Joseph Pidala

Michael Last

Silvy Lachance

Erica E M Moodie

Abstract

MOTIVATING EXAMPLE: GVHD PREVENTION AND TREATMENT

METHODS: Q-LEARNING

Table 1.

Tailoring variables

Expected results

CASE STUDY ANALYSIS

Patient selection and treatment definition

Model development and fitting

Figure 1.

RESULTS

Censoring

Adaptive treatment strategies

Table 2.

Table 3.

Figure 2.

Population-level survival under the “best” treatment

Figure 3.

Stability of personalized recommendations

Post-hoc consideration of sample size

DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Appendix Table 1.

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases