Abstract
The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best possible healthcare for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient’s health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, most existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show that the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes.
Keywords: Markov decision processes, Precision medicine, Reinforcement learning, Type 1 diabetes
1. Introduction
The use of mobile devices in clinical care, called mobile health (mHealth), provides an effective and scalable platform to assist patients in managing their illness (Free et al., 2013; Steinhubl et al., 2013). Advantages of mHealth interventions include real-time communication between a patient and their health-care provider as well as systems for delivering training, teaching, and social support (Kumar et al., 2013). Mobile technologies can also be used to collect rich longitudinal data to estimate optimal dynamic treatment regimes and to deliver treatment that is deeply tailored to each individual patient. We propose a new estimator of an optimal treatment regime that is suitable for use with longitudinal data collected in mHealth applications.
A dynamic treatment regime provides a framework to administer individualized treatment over time through a series of decision rules. Dynamic treatment regimes have been well-studied in the statistical and biomedical literature (Murphy, 2003; Robins, 2004; Moodie et al., 2007; Kosorok and Moodie, 2015; Chakraborty and Moodie, 2013) and furthermore, statistical considerations in mHealth have been studied by, for example, Liao et al. (2016) and Klasnja et al. (2015). Although mobile technology has been successfully utilized in clinical areas such as diabetes (Quinn et al., 2011; Maahs et al., 2012), smoking cessation (Ali et al., 2012), and obesity (Bexelius et al., 2010), mHealth poses some unique challenges that preclude direct application of existing methodologies for dynamic treatment regimes. For example, mHealth applications typically have no definite time horizon in the sense that treatment decisions are made continually throughout the life of the patient with no fixed time point for the final treatment decision; estimation of an optimal treatment strategy must utilize data collected over a much shorter time period than that over which treatment would be applied in practice; the momentary signal may be weak and may not directly measure the outcome of interest; and estimation of optimal treatment strategies must be done online as data accumulate.
This work is motivated in part by our involvement in a study of mHealth as a management tool for type 1 diabetes. Type 1 diabetes is an autoimmune disease wherein the pancreas produces insufficient levels of insulin, a hormone needed to regulate blood glucose concentration. Patients with type 1 diabetes are continually engaged in management activities including monitoring glucose levels, timing and dosing insulin injections, and regulating diet and physical activity. Increased glucose monitoring and attention to self-management facilitate more frequent treatment adjustments and have been shown to improve patient outcomes (Levine et al., 2001; Haller et al., 2004; Ziegler et al., 2011). Thus, patient outcomes have the potential to be improved by diabetes management tools which are deeply tailored to the continually evolving health status of each patient. Mobile technologies can be used to collect data on physical activity, glucose, and insulin at a fine granularity in an outpatient setting (Maahs et al., 2012). There is great potential for using these data to create comprehensive and accessible mHealth interventions for clinical use. We envision application of this work for use before the artificial pancreas (Weinzimer et al., 2008; Kowalski, 2015; Bergenstal et al., 2016) becomes widely available.
In our motivating example as well as other mHealth applications, the goal is to treat a chronic disease over the long term. However, data will typically be collected over a short time period. Because the time frame of data collection is much shorter than the time frame of application, standard methods for longitudinal data analysis, such as generalized estimating equations or mixed models, cannot be used. We assume that the data collected in the field consist of a sample from a stationary Markov process, which allows us to estimate a dynamic treatment regime that will lead to good outcomes over the long term using data collected over a much shorter time period.
The sequential decision making process can be modeled as a Markov decision process (Puterman, 2014) and the optimal treatment regime can be estimated using reinforcement learning algorithms such as Q-learning (Murphy, 2005; Zhao et al., 2009; Tang and Kosorok, 2012; Schulte et al., 2014). Ertefaie (2014) proposed a variant of greedy gradient Q-learning (GGQ) to estimate optimal dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010). In GGQ, the form of the estimated Q-function dictates the form of the estimated optimal treatment regime. Thus, one must choose between a parsimonious model for the Q-function at the risk of model misspecification or a complex Q-function that yields unintelligible treatment regimes. Furthermore, GGQ requires modeling a non-smooth function of the data, which creates complications (Laber et al., 2014; Linn et al., 2017). Applications of mHealth require methods that can both estimate a policy from a fixed sample of retrospective data (offline estimation) and estimate a policy that is updated as data accumulate (online estimation). Online estimation has been given considerable attention in the field of reinforcement learning, particularly in engineering applications (Doya, 2000; Kober and Peters, 2012), with special consideration given to algorithms that provide fast updates in situations where new data accumulate multiple times per second. In our mHealth examples, treatment decisions may be made multiple times per day or even every hour or minute; however, rapid updating of the estimated policy at the scale of engineering applications is not needed. Therefore, we are able to focus on different aspects of the problem without needing to ensure very fast estimation. We propose an alternative estimation method for infinite horizon dynamic treatment regimes that is suited to mHealth applications. Our approach, which we call V-learning, involves estimating the optimal policy among a prespecified class of policies (Zhang et al., 2012, 2013). It requires minimal assumptions about the data-generating process and permits estimating a randomized decision rule that can be implemented online as data accumulate.
In Section 2, we describe the setup and present our method for offline estimation using data from a micro-randomized trial or observational study. In Section 3, we extend our method for application to online estimation with accumulating data. Theoretical results, including consistency and asymptotic normality of the proposed estimators, are presented in Section 4. We compare the proposed method to GGQ using simulated data in Section 5. A case study using data from patients with type 1 diabetes is presented in Section 6 and we conclude with a discussion in Section 7. Proofs of technical results are in the Appendix.
2. Offline estimation from observational data
We assume that the available data are , which comprise n independent, identically distributed trajectories (S1, A1, S2, …, ST, AT, ST+1), where: denotes a summary of patient information collected up to and including time t; denotes the treatment assigned at time t; and denotes the (possibly random) patient follow-up time. In the motivating example of type 1 diabetes, St could contain a patient’s blood glucose, dietary intake, and physical activity in the hour leading up to time t and At could denote an indicator that an insulin injection is taken at time t. We assume that the data-generating model is a time-homogeneous Markov process so that St+1 ╨ (At−1, St+1, …, A1, S1)|(At, St) and the conditional density p(st+1|at, st) is the same for all t ≥ 1. Let Lt ∈ {0, 1} denote an indicator that the patient is still in follow-up at time t, i.e., Lt = 1 if the patient is being followed at time t and zero otherwise. We assume that Lt is contained in St so that P(Lt+1 = 1| At, St,…, A1, S1) = P(Lt+1 = 1|At, St). It is not necessary for time points to be evenly spaced or homogeneous across patients. For example, we can define the decision times as those times at which data are observed and include time since the previous observation in St. Defining the decision times in this way allows us to effectively handle intermittent missing data, and thus we can assume that Lt = 0 implies Lt+1 = 0 with probability one. Furthermore, we assume a known utility function so that Ut = u (St+1, At, St) measures the ‘goodness’ of choosing treatment At in state St and subsequently transitioning to state St+1. In our motivating example, the utility at time t could be a measure of how infrequently the patient’s blood glucose concentration deviates from the optimal range over the hour preceding and following time t. The goal is to select treatments to maximize expected cumulative utility; treatment selection is formalized using a treatment regime (Schulte et al., 2014; Kosorok and Moodie, 2015) and the utility associated with any regime is defined using potential outcomes (Rubin, 1978).
Let denote the space of probability distributions over . A treatment regime in this context is a function so that, under π, a decision maker presented with state St= st at time t will select action with probability π(at; st). Define , and . The set of potential outcomes is
where is the potential state and is the potential follow-up status at time t under treatment sequence . Thus, the potential utility at time t is . For any π, define to be a sequence of independent, −valued stochastic processes indexed by dom St such that . The potential follow-up time under π is
where . The potential utility under π at time t is
where . Thus, utility is set to zero after a patient is lost to follow-up. However, in certain situations, utility may be constructed so as to take a negative value at the time point when the patient is lost to follow-up, e.g., if the patient discontinues treatment because of a negative effect associated with the intervention. Define the state-value function (Sutton and Barto, 1998), where γ ∈ (0,1) is a fixed constant that captures the trade-off between short- and long-term outcomes. For any distribution on dom S1, define the value function with respect to reference distribution as ; throughout, we assume that this reference distribution is fixed. The reference distribution can be thought of as a distribution of initial states and we estimate it from the data in the implementation in Sections 5 and 6. For a prespecified class of regimes, Π, the optimal regime, , satisfies for all π ∈ Π. The goal is to estimate using data collected from n patients, where patient i is followed for Ti time points, i = 1, …, n. Thus, Ti represents the number of treatment decisions made for patient i in the observed data; however, because the observed data are assumed to come from a time-homogeneous Markov chain, the estimated policy could be applied in a population indefinitely.
To construct an estimator of , we make a series of assumptions that connect the potential outcomes in W* with the data-generating model.
Assumption 1. Strong ignorability, At ╨ W* | St for all t.
Assumption 2. Consistency, for all t and .
Assumption 3. Positivity, there exists c0 > 0 so that P(At = at|St = st) ≥ c0 for all , and all t.
In addition, we implicitly assume that there is no interference among the experimental units. These assumptions are common in the context of estimating dynamic treatment regimes (Robins, 2004; Hernan and Robins, 2010; Schulte et al., 2014). Assumption 1 implies that there are no unmeasured confounders and assumptions 1 and 3 hold by construction in a micro-randomized trial (Klasnja et al., 2015; Liao et al., 2016).
Let μt(at; st) = P(At = at|St = st) for each t ≥ 1. In a micro-randomized trial, μt(at; st) is a known randomization probability; in an observational study, it must be estimated from the data. The following lemma characterizes for any regime, π, in terms of the data-generating model (see also Lemma 4.1 of Murphy et al., 2001). A proof is provided in the appendix.
Lemma 2.1. Let π denote an arbitrary regime and γ ∈ (0, 1) a discount factor. Then, under assumptions 1–3 and provided interchange of the sum and integration is justified, the state-value function of π at st is
(1) |
The preceding result will form the basis for an estimating equation for . Write the right hand side of (1) as
from which it follows that
Subsequently, for any function ψ defined on dom St, the state-value function satisfies
(2) |
which is an importance-weighted variant of the well-known Bellman optimality equation (Sutton and Barto, 1998).
Let V(π, s; θπ) denote a model for V(π, s) indexed by . We assume that the map θπ ⟼ V(π, s;θπ) is differentiable everywhere for each fixed s and π. Let denote the gradient of V(π, s; θπ) and define
(3) |
Given a positive definite matrix and penalty function , define , where λn is a tuning parameter. Subsequently, is the estimated state-value function under π in state s. Thus, given a reference distribution, , the estimated value of a regime, π, is and the estimated optimal regime is . The idea of V-learning is to use estimating equation (3) to estimate the value of any policy and maximize estimated value over a class of policies; we will discuss strategies for this maximization below.
V-learning requires a parametric class of policies. Assuming that there are K possible treatments, a1, … , aK, we can define a parametric class of policies as follows. Define for j = 1, … , K − 1, and . This defines a class of randomized policies, Π, parametrized by , where βk is a vector of parameters for the k-th treatment. Under a policy in this class defined by β, actions are selected stochastically according to the probabilities π(aj; s, β), j = 1, … , K. In the case of a binary treatment, a policy in this class reduces to π(1; s, β) = exp(s⊺β)/ {1 + exp(s⊺β)} and π(0; s, β) = 1/ {1 + exp(s⊺β)} for a p × 1 vector β. This class of policies is used in the implementation in Sections 5 and 6.
V-learning also requires a class of models for the state value function indexed by a parameter, θπ. We use a basis function approximation (Hastie et al., 2009; Long et al., 2010). Let Φ = (ϕ1, … , ϕq)⊺ be a vector of prespecified basis functions and let Let Under this working model,
(4) |
Computational efficiency is gained from the linearity of in θπ; flexibility can be achieved through the choice of Φ. We recommend Gaussian basis functions as they offer the highest degree of flexibility. However, along with Gaussian basis functions, we also examine the performance of V-learning using linear and polynomial basis functions in Sections 5 and 6 as these offer reasonable alternatives.
The algorithm for V-learning is given in Algorithm 1 below. The algorithm can be terminated when is below some small threshold. The update in step can be achieved with a variety of existing optimization methods. In our implementation, we use the BFGS algorithm (Dai, 2002) as implemented in the optim function in R software (R Core Team, 2016). Because the objective function is not necessarily convex, care must be taken when selecting the starting point, β1 (step 2). In our implementation, we use simulated annealing as implemented in the optim function in R to find an appropriate starting point.
Algorithm 1: V-learning. |
---|
1 Initialize a class of policies, , and a model, V (π, s; θπ); |
2 Set k = 1 and initialize β1 to a starting value in ; |
3 while Not converged do |
4 Estimate ; |
5 Evaluate ; |
6 Set for some step size, αk, where is the gradient of ; |
7 end |
2.1. Greedy gradient Q-learning
Here we briefly discuss an existing method for infinite horizon dynamic treatment regimes, which will be used for comparison in the simulation studies in Section 5.
Ertefaie (2014) introduced greedy gradient Q-learning (GGQ) for estimating dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010; Murphy et al., 2016).
Define . The Bellman optimality equation (Sutton and Barto, 1998) is
(5) |
Let Q(s, a; ηopt) be a parametric model for Qopt(s, a) indexed by . In our implementation, we model Q(s, a; ηopt) as a linear function with interactions between all state variables and treatment. The Bellman optimality equation motivates the estimating equation
(6) |
For a positive definite matrix, Ω, we estimate ηopt using . The estimated optimal policy in state s selects action . This optimization problem is non-convex and non-differentiable in ηopt. However, it can be solved with a generalization of the greedy gradient Q-learning algorithm of Maei et al. (2010), and hence is referred to as GGQ by Ertefaie (2014) and in the following.
The performance of GGQ has been demonstrated in the context of chronic diseases with large sample sizes and a moderate number of time points. However, in mHealth applications, it is common to have small sample sizes and a large number of time points, with decisions occurring at a fine granularity. In GGQ, the estimated policy depends directly and, therefore, depends on modeling the transition probabilities of the data-generating process. Furthermore, estimating equation (6) contains a non-smooth max operator, which makes estimation difficult without large amounts of data (Laber et al., 2014; Linn et al., 2017). V-learning only requires modeling the policy and the value function rather than the data-generating process and directly maximizes estimated value over a class of policies, thereby avoiding the non-smooth max operator in the estimating equation (compare equations (3) and (6)); these attributes may prove advantageous in mHealth settings.
3. Online estimation from accumulating data
Suppose we have accumulating data , where and represent the state and action for patient i = 1, … , n at time t ≥ 1. At each time t, we estimate an optimal policy in a class, Π, using data collected up to time t, take actions according to the estimated optimal policy, and estimate a new policy using the resulting states. Let be the estimated policy at time t, i.e., is estimated after observing state St+1 and before taking action At+1. If Π is a class of randomized policies, we can select an action for a patient presenting with St+1 = st+1 according to , i.e., we draw At+1 according to the distribution . If a class of deterministic policies is of interest, we can inject some randomness into to facilitate exploration. One way to do this is an ϵ-greedy strategy (Sutton and Barto, 1998), which selects the estimated optimal action with probability 1 – ϵ and otherwise samples equally from all other actions. Because an ϵ-greedy strategy can be used to introduce randomness into a deterministic policy, we can assume a class of randomized policies.
At each time t ≥ 1, let where Ω, λn, and are as defined in Section 2 and
(7) |
with some initial randomized policy. We note that estimating equation (7) is similar to (3), except that replaces μυ as the data-generating policy. Given the estimator of the value of π at time t, , the estimated optimal policy at time t is . In practice, we may choose to update the policy in batches rather than at every time point. An alternative way to encourage exploration through the action space is to choose for some sequence αt ≥ 0, where is a measure of uncertainty in . An example of this is upper confidence bound sampling, or UCB (Lai and Robbins, 1985).
In some settings, when the data-generating process may vary across patients, it may be desirable to allow each patient to follow an individualized policy that is estimated using only that patient’s data. Suppose that n patients are followed for an initial T1 time points after which the policy is estimated. Then, suppose that patient i follows until time T2, when a policy is estimated using only the states and actions observed for patient i. This procedure is then carried out until time TK for some fixed K with each patient following their own individual policy which is adapted to match the individual over time. We may also choose to adapt the randomness of the policy at each estimation. For example, we could select ϵ1 > ϵ2 > … > ϵK and, following estimation k, have patient i follow policy with probability 1 – ϵk and policy with probability ϵk. In this way, patients become more likely to follow their own individualized policy and less likely to follow the initial policy over time, reflecting increasing confidence in the individualized policy as more data become available. The same class of policies and model for the state value function can be used as in Section 2.
4. Theoretical results
In this section, we establish asymptotic properties of and for offline estimation. Because the proposed online estimation procedure involves performing offline estimation repeatedly in smaller batches, our theoretical results apply to each smaller batch as the number of observations increases. Developing more general theory for online estimation is an interesting topic for future research. Throughout, we assume assumptions 1–3 from Section 2.
Let . Thus, we consider the special case where the penalty function is the squared Euclidean norm of θπ. We will assume that λn = oP(n−1/2). All of our results hold for any positive definite matrix, Ω. Assume the working model for the state value function introduced in Section 2, i.e., . For fixed π, denote the true , i.e., . Let so that . Define , where denotes the empirical measure of the observed data. Let be a parametric class of policies and let where .
Our main results are summarized in Theorems 4.2 and 4.3 below. Because each patient trajectory is a stationary Markov chain, we need to use asymptotic theory based on stationary processes; consequently, some of the required technical conditions are more difficult to verify than those for i.i.d. data. Define the bracketing integral for a class of functions, , by where the bracketing number for , , is the number of Lr(P) ϵ-brackets needed such that each element of is contained in at least one bracket (see Chapter 2 of Kosorok, 2008). For any stationary sequence of possibly dependent random variables, {Xt}t≥1; let be the σ-field generated by Xb, … ,Xc and define . We say that the chain {Xt}t≥1 is absolutely regular if ζ(k) → 0 as k → 0 (also called β-mixing in Chapter 11 of Kosorok, 2008). We make the following assumptions.
Assumption 4. There exists a 2 < ρ < ∞ such that
, and .
The sequence {(St, At)}t≥1 is absolutely regular with .
The bracketing integral of the class of policies, J[] (∞,Π, L3ρ(P)} < ∞.
Assumption 5. There exists some c1 > 0 such that
for all
Assumption 6. The map has a unique and well separated maximum over β in the interior of ; let β0 denote the maximizer.
Assumption 7. The following condition holds: as δ ↓0.
Remark 4.1. Assumption 4 requires certain finite moments and that the dependence between observations on the same patient vanishes as observations become further apart. In Lemma 8.2 in the appendix, we verify part 3 of assumption 4 and assumption 7 for the class of policies introduced in Section 2. However, note that the theory holds for any class of policies satisfying the given assumptions, not just the class considered here. Assumption 5 is needed to show the existence of a unique uniformly over Π and can be verified empirically by checking that certain data-dependent matrices are invertible. Assumption requires that the true optimal decision in each state is unique (see assumption A. 8 of Ertefaie, 2014) and is a standard assumption in M-estimation (see chapter 14 of Kosorok, 2008). Assumption 7 requires smoothness on the class of policies.
The main results of this section are stated below. Theorem 4.2 states that there exists a unique solution to uniformly over Π and that the estimator converges weakly to a mean zero Gaussian process in .
Theorem 4.2. Under the given assumptions, the following hold.
For all π ∈ Π, there exists such that has a zero at . Moreover, and .
- Let be a tight, mean zero Gaussian process indexed by Π with covariance where
and
Then, in . Let be as defined in part 2. Then, in .
Theorem 4.3 below gives us that the estimated optimal policy converges in probability to the true optimal policy over Π and that the estimated value of the estimated optimal policy converges to the true value of the estimated optimal policy.
Theorem 4.3. Under the given assumptions, the following hold.
Let and . We have that .
Let and β0 be defined as in part 1. Then, .
Let Then, .
- A consistent estimator for is
where
and
Proofs of the above results are in the Appendix along with a result on bracketing entropy that is needed for the proof of Theorem 4.2 and a proof that the class of policies introduced above satisfies the necessary bracketing integral assumption.
5. Simulation experiments
In this section, we examine the performance of V-learning on simulated data. Section 5.1 contains results for offline estimation and Section 5.2 contains results for online estimation. All simulation results are averaged across 50 replications.
5.1. Offline simulations
Our implementation of V-learning follows the setup in Section 2. Maximizing is done using a combination of simulated annealing and the BFGS algorithm as implemented in the optim function in R software (R Core Team, 2016). We note that is differentiable in π, thereby avoiding some of the computational complexity of GGQ. However, the objective is not necessarily convex. In order to avoid local maxima, simulated annealing with 1000 function evaluations is used to find a neighborhood of the maximum; this solution is then used as the starting value for the BFGS algorithm.
We use the class of policies introduced in Section 2. Although we maximize the value over a class of randomized policies, the true optimal policy is deterministic. To prevent the coefficients of from diverging to infinity, we add an L2 penalty when maximizing over β. To prevent overfitting, we use an L2 penalty when computing , i.e., . We let Ω be the identity matrix. Simulation results with alternate choices of and Ω are given in the appendix. Tuning parameters can be used to control the amount of randomness in the estimated policy. For example, increasing the penalty when computing is one way to encourage exploration through the action space because β = 0 defines a policy where each action is selected with equal probability.
We consider three different models for the state-value function: (i) linear; (ii) second degree polynomial; and (iii) Gaussian radial basis functions (RBF). The Gaussian RBF is . We use τ = 0.25 and κ = 0, 0.25, 0.5, 0.75, 1 to create a basis of functions and apply this basis to the state variables after scaling them to be between 0 and 1. Each model also implicitly contains an intercept.
We begin with the following simple generative model. Let the two-dimensional state vector be , i = 1, … , n, t = 1, … , T. We initiate the state variables as independent standard normal random variables and let them evolve according to and , where takes values in {0, 1} and and are independent N (0, 1/4) random variables. Define the utility function by . At each time t, we must make a decision to treat or not with the goal of maximizing the components of S while treating as few times as possible. Treatment has a positive effect on S1 and a negative effect on S2. We generate from a Bernoulli distribution with mean 1/2. In estimation, we assume that the generating model for treatment is known, as would be the case in a micro-randomized trial.
We generate samples of n patients with T time points per patient from the given generative model after an initial burn-in period of 50 time points. The burn-in period ensures that our simulated data are sampled from an approximately stationary distribution. We estimate policies using V-learning with three different types of basis functions and GGQ. After estimating optimal policies, we simulate 100 patients following each estimated policy for 100 time points and take the mean utility under each policy as an estimate of the value of that policy. Estimated values are found in Table 1 with Monte Carlo standard errors along with the observed value. Recall that larger values are better. The policies estimated using V-learning produce better outcomes than the observational policy and the policy estimated using GGQ. V-learning produces the best outcomes using Gaussian basis functions. In the Appendix, we give results for the same simulation settings given in Table 1 for alternate choices of and Ω. Table 8 presents results with the LASSO penalty when Ω is the identity matrix, Table 9 presents results with the L2 penalty when , and Table 10 presents results with the LASSO penalty when .
Table 1:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ | Observed |
---|---|---|---|---|---|---|
25 | 24 | 0.118 (0.0892) | 0.091 (0.0825) | 0.110 (0.0979) | 0.014 (0.0311) | −0.005 |
36 | 0.108 (0.0914) | 0.115 (0.0911) | 0.112 (0.0919) | 0.029 (0.0280) | −0.004 | |
48 | 0.106 (0.0705) | 0.071 (0.0974) | 0.103 (0.0757) | 0.031 (0.0350) | 0.000 | |
50 | 24 | 0.124 (0.0813) | 0.109 (0.1045) | 0.118 (0.0879) | 0.016 (0.0355) | −0.005 |
36 | 0.126 (0.0818) | 0.134 (0.0878) | 0.136 (0.0704) | 0.027 (0.0276) | 0.003 | |
48 | 0.101 (0.0732) | 0.109 (0.0767) | 0.115 (0.0763) | 0.020 (0.0245) | 0.000 | |
100 | 24 | 0.117 (0.0895) | 0.135 (0.0973) | 0.140 (0.0866) | 0.019 (0.0257) | 0.011 |
36 | 0.113 (0.0853) | 0.105 (0.1033) | 0.139 (0.0828) | 0.021 (0.0312) | 0.012 | |
48 | 0.111 (0.0762) | 0.143 (0.0853) | 0.114 (0.0699) | 0.031 (0.0306) | −0.001 |
Next, we simulate cohorts of patients with type 1 diabetes to mimic the mHealth study of Maahs et al. (2012). Maahs et al. (2012) followed a small sample of youths with type 1 diabetes and recorded data at a fine granularity using mobile devices. Blood glucose levels were tracked in real time using continuous glucose monitoring, physical activity was measured continuously using accelerometers, and insulin injections were logged by an insulin pump. Dietary data were recorded by 24-hour recall over phone interviews.
In our simulation study, we divide each day of follow-up into 60 minute intervals. Thus, for one day of follow-up, we observe T = 24 time points per simulated patient and a treatment decision is made every hour. Our hypothetical mHealth study is designed to estimate an optimal dynamic treatment regime for the timing of insulin injections based on patient blood glucose, physical activity, and dietary intake with the goal of controlling future blood glucose as close as possible to the optimal range. To this end, we define the utility at time t as a weighted sum of hypo- and hyperglycemic episodes in the 60 minutes preceding and following time t. Weights are −3 when glucose ≤ 70 (hypoglycemic), −2 when glucose ≤ 150 (hyperglycemic), −1 when 70 < glucose ≤ 80 or 120 < glucose ≤ 150 (borderline hypo- and hyperglycemic), and 0 when 80 < glucose ≤ 120 (normal glycemia). Utility at each time point ranges from −6 to 0 with larger utilities (closer to 0) being more preferable. For example, a patient who presents with an average blood glucose of 155 mg/dL over time interval t–1, takes an action to correct their hyperglycemia, and presents with an average blood glucose of 145 mg/dL over time interval t would receive a utility of Ut = −3. Weights were chosen to reflect the relative clinical consequences of high and low blood glucose. For example, acute hypoglycemia, characterized by blood glucose levels below 70 mg/dL, is an emergency situation that can result in coma or death.
Simulated data are generated as follows. At each time point, patients are randomly chosen to receive an insulin injection with probability 0.3, consume food with probability 0.2, partake in mild physical activity with probability 0.4, and partake in moderate physical activity with probability 0.2. Grams of food intake and counts of physical activity are generated from normal distributions with parameters estimated from the data of Maahs et al. (2012). Initial blood glucose level for each patient is drawn from a normal distribution with mean 100 and standard deviation 25. Define the covariates for patient i collected at time t by , where is average blood glucose level, is total dietary intake, and is total counts of physical activity as would be measured by an accelerometer. Glucose levels evolve according to
(8) |
where Int is an indicator of an insulin injection received at time t and e ~ N(0, σ2). We use the parameter vector α = (α1, … , α7)⊺ = (0.9, 0.1, 0.1, −0.01, −0.01, −2, −4)⊺, μ = 100, and σ = 5.5 based on a linear model fit to the data of Maahs et al. (2012). The known lag-time in the effect of insulin is reflected by α6 = −2 and α7 = −4. Selecting α1 < 1 ensures the existence of a stationary distribution.
We define the state vector for patient i at time t to contain average blood glucose, total dietary intake, and total physical activity measured over previous time intervals; we include blood glucose and physical activity for the previous two time intervals and dietary intake for the previous four time intervals. Let n denote number of patients and T denote number of time points per patient. Our choices for n and T are based on what is feasible for an mHealth outpatient study (dietary data was collected on two days by Maahs et al., 2012). For each replication, the optimal treatment regime is estimated with V-learning using three different types of basis functions and GGQ. The generative model for insulin treatment is not assumed to be known and we estimate it using logistic regression. We record mean outcomes in an independent sample of 100 patients followed for 100 time points with treatments generated according to each estimated optimal regime. Simulation results (estimated values under each regime and Monte Carlo standard errors along with observed values) are found in Table 2. Again, V-learning with Gaussian basis functions performs the best out of all methods, generally producing large values and small standard errors. V-learning with the linear model underperforms and GGQ underperforms somewhat more so.
Table 2:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ | Observed |
---|---|---|---|---|---|---|
25 | 24 | −2.716 (1.2015) | −2.335 (0.9818) | −2.018 (1.2011) | −3.870 (0.9225) | −2.316 |
36 | −2.700 (1.2395) | −2.077 (1.0481) | −1.760 (0.8468) | −3.644 (0.8745) | −2.261 | |
48 | −2.496 (1.1986) | −2.236 (1.1978) | −1.751 (0.9887) | −2.405 (1.1025) | −2.365 | |
50 | 24 | −2.545 (1.1865) | −2.069 (1.0395) | −1.605 (0.8064) | −3.368 (1.0186) | −2.263 |
36 | −2.644 (1.1719) | −2.004 (0.9074) | −1.778 (0.8496) | −3.099 (0.9722) | −2.336 | |
48 | −2.469 (1.1635) | −2.073 (0.9870) | −2.102 (1.2078) | −2.528 (0.9571) | −2.308 | |
100 | 24 | −2.350 (1.1171) | −2.128 (1.0520) | −1.612 (0.7203) | −3.272 (0.8636) | −2.299 |
36 | −2.547 (1.1852) | −2.116 (0.8518) | −1.672 (0.8643) | −3.232 (0.7951) | −2.321 | |
48 | −2.401 (1.0643) | −2.204 (1.0400) | −1.494 (0.5413) | −2.820 (0.8442) | −2.351 |
5.2. Online simulations
In practice, it may be useful for patients to follow a dynamic treatment regime that is updated as new data are collected. Here we consider a hypothetical study wherein n patients are followed for an initial period of T′ time points, an optimal policy is estimated, and patients are followed for an additional T – T′ time points with the estimated optimal policy being continuously updated. At each time point, t ≥ T′, actions are taken according to the most recently estimated policy. Recall that V-learning produces a randomized decision rule from which to sample actions at each time point. When selecting an action based on a GGQ policy, we incorporate an ϵ-greedy strategy by selecting the action recommended by the estimated policy with probability 1 – ϵ and otherwise randomly selecting one of the other actions. At the tth estimation, we use ϵ = 0.5t, allowing ϵ to decrease over time to reflect increasing confidence in the estimated policy. A burn-in period of 50 time points is discarded to ensure that we are sampling from a stationary distribution. We estimate the first policy after 12 time points and a new policy is estimated every 6 time points thereafter. After T time points, we estimate the value as the average utility over all patients and all time points after the initial period.
Table 3 presents mean outcomes under policies estimated online using data generated according to the simple two covariate generative model introduced at the beginning of Section 5.1. There is some variability across n and T regarding which type of basis function is best, but V-learning with a polynomial basis generally produces the best outcomes. GGQ performs well in large samples.
Table 3:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ |
---|---|---|---|---|---|
25 | 24 | 0.0053 | 0.0149 | −0.0100 | −0.0081 |
36 | 0.0525 | 0.0665 | 0.0310 | 0.0160 | |
48 | 0.0649 | 0.0722 | 0.0416 | 0.0493 | |
50 | 24 | 0.0164 | 0.0117 | 0.0037 | 0.0058 |
36 | 0.0926 | 0.0791 | 0.0666 | 0.0227 | |
48 | 0.1014 | 0.0894 | 0.0512 | 0.0434 | |
100 | 24 | 0.0036 | −0.0157 | 0.0200 | 0.0239 |
36 | 0.0766 | 0.0626 | 0.0907 | 0.0540 | |
48 | 0.0728 | 0.0781 | 0.0608 | 0.0818 |
Next, we study the performance of online V-learning in simulated mHealth studies of type 1 diabetes by following the generative model described in (8). Mean outcomes are found in Table 4. Gaussian V-learning performs the best out of all methods. Across all variants of V-learning, outcomes improve with increased follow-up time.
Table 4:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ |
---|---|---|---|---|---|
25 | 24 | −2.3887 | −1.9713 | −1.8860 | −3.2027 |
36 | −2.3784 | −2.1535 | −1.7857 | −3.5127 | |
48 | −2.2190 | −2.0679 | −1.6999 | −3.2280 | |
50 | 24 | −2.3405 | −2.2313 | −1.7761 | −2.8976 |
36 | −2.2829 | −2.0922 | −1.6016 | −3.1589 | |
48 | −2.1587 | −1.9669 | −1.5948 | −2.8729 | |
100 | 24 | −2.3229 | −2.2295 | −1.9138 | −3.0865 |
36 | −2.2927 | −2.1608 | −1.9030 | −3.3483 | |
48 | −2.2096 | −2.0454 | −1.8252 | −2.9428 |
Finally, we consider online simulations using individualized policies as outlined at the end of Section 3. Consider the simple two covariate generative model introduced above but let state variables evolve according to and where μi is a subject-specific term drawn uniformly between 0.4 and 0.9. Including μi ensures that the optimal policy differs across patients. Table 5 contains mean outcomes for online simulation where a universal policy is estimated using data from all patients and where individualized policies are estimated using only a single patient’s data. Because data are generated in a such a way that the optimal policy varies across patients, individualized policies achieve better outcomes than universal policies.
Table 5:
n | T | Universal policy | Patient-specific policy |
---|---|---|---|
25 | 24 | 0.0282 | 0.1813 |
36 | 0.1025 | 0.1700 | |
48 | 0.0977 | 0.1944 | |
50 | 24 | 0.0164 | 0.2771 |
36 | 0.0768 | 0.2617 | |
48 | 0.0752 | 0.3038 | |
100 | 24 | 0.0160 | 0.4230 |
36 | 0.0960 | 0.2970 | |
48 | 0.1140 | 0.3197 |
6. Case study: Type 1 diabetes
Machine learning is currently under consideration in type 1 diabetes through studies to build and test a “closed loop” system that joins continuous blood glucose monitoring and subcutaneous insulin infusion through an underlying algorithm. Known as the artificial pancreas, this technology has been shown to be safe in preliminary studies and is making headway from small hospital-based safety studies to large-scale outpatient effectiveness studies (Ly et al., 2014, 2015). Despite the success of the artificial pancreas, the rate of uptake may be limited and widespread use may not occur for many years (Kowalski, 2015). The proposed method may be useful for implementing mHealth interventions for use alongside the artificial pancreas or before it is widely available.
Studies have shown that data on food intake and physical activity to inform optimal decision making can be collected in an inpatient setting (see, e.g., Cobry et al., 2010; Wolever and Mullan, 2011). However, Maahs et al. (2012) demonstrated that rich data on the effect of food intake and physical activity can be collected in an outpatient setting using mobile technology. Here, we apply the proposed methodology to the observational data collected by Maahs et al. (2012).
The full data consist of N = 31 patients with type 1 diabetes, aged 12–18. Glucose levels were monitored using continuous glucose monitoring and physical activity tracked using accelerometers for five days. Dietary data were self-reported by the patient in telephone-based interviews for two days. Patients were treated using either an insulin pump or multiple daily insulin injections. We use data on a subset of n = 14 patients treated with an insulin pump for whom full follow-up is available on days when dietary information was recorded. This represents 28 patient-days of data, with which we use V-learning to estimate an optimal treatment policy. An advantage of mHealth is the ability to collect data passively, limiting the amount of missing data. There is no intermittent missingness in this data set.
The setup closely follows the simulation experiments in Section 5.1. Patient state at each time, t, is taken to be average glucose level and total counts of physical activity over the two previous 60 minute intervals and total food intake in grams over the four previous 60 minute intervals. The goal is to learn a policy to determine when to administer insulin injections based on prior blood glucose, dietary intake, and physical activity. The utility at time t is a weighted sum of glycemic events over the 60 minutes preceding and following time t with weights defined in Section 5.1. A treatment regime with large value will minimize the number of hypo- and hyperglycemic episodes weighted to reflect the clinical importance of each. We note that because V(π, s; θπ) is linear in θπ, we can evaluate with only the mean of Φ(S) under . These were estimated from the data. Because we cannot simulate data following a given policy to estimate its value, we report the parametric value estimate . Interpreting the parametric value estimate is difficult because of the effect the discount factor has on estimated value. We cannot compare parametric value estimates to mean outcomes observed in the data. Instead, we use as an estimate of value under the observational policy.
We estimate optimal treatment strategies for two different action spaces. In the first, the only decision made at each time is whether or not to administer an insulin injection, i.e., the action space contains a single binary action. In the second, the action space contains all possible combinations of insulin injection, physical activity, and food intake. This corresponds to a hypothetical mHealth intervention where insulin injections are administered via an insulin pump and suggestions for physical activity and food intake are administered via a mobile app.
Table 6 contains parametric value estimates for policies estimated using V-learning for the two action spaces outlined above with different basis functions and discount factors. These results indicate that improvements in glycemic control can come from personalized and dynamic treatment strategies that account for food intake and physical activity. Improvement results from a dynamic insulin regimen (binary action space), and, in most cases, further improvement results from a comprehensive mHealth intervention including suggestions for diet and exercise delivered via mobile app in addition to insulin therapy (multiple action space). When considering multiple actions, the policy estimated using a polynomial basis and γ = 0.7 achieves a 64% increase in value and the policy estimated using a Gaussian basis and γ = 0.8 achieves a 68% increase in value over the observational policy. Although the small sample size is a weakness of this study, these results represent a significant improvement in value despite the sample size.
Table 6:
Action space | Basis | γ = 0.7 | γ = 0.8 | γ = 0.9 |
---|---|---|---|---|
Binary | Linear | −6.20 | −9.35 | −15.99 |
Polynomial | −3.91 | −9.03 | −17.50 | |
Gaussian | −3.44 | −13.09 | −25.52 | |
Multiple | Linear | −6.47 | −9.92 | −0.49 |
Polynomial | −2.44 | −6.80 | −14.48 | |
Gaussian | −8.45 | −3.58 | −21.18 | |
Observational policy | −6.77 | −11.28 | −21.79 |
Finally, we use an example hyperglycemic patient to illustrate how an estimated policy would be applied in practice. One patient in the data presented at a specific time with an average blood glucose of 229 m g/dL over the previous hour and an average blood glucose of 283 m g/dL over the hour before that. The policy estimated with γ = 0.7 and a polynomial basis recommends each action according to the probabilities in Table 7. Because this patient presented with blood glucose levels that are higher than the optimal range, the policy recommends actions that would lower the patient’s blood glucose levels, assigning a probability of 0.79 to insulin and a probability of 0.21 to insulin combined with activity.
Table 7:
Action | Probability |
---|---|
No action | < 0.0001 |
Physical activity | < 0.0001 |
Food intake | < 0.0001 |
Food and activity | < 0.0001 |
Insulin | 0.7856 |
Insulin and activity | 0.2143 |
Insulin and food | 0.0002 |
Insulin, food, and activity | < 0.0001 |
7. Conclusion
The emergence of mHealth has provided great potential for the estimation and implementation of dynamic treatment regimes. Mobile technologies can be used both in the collection of rich longitudinal data to inform decision making and the delivery of deeply tailored interventions. The proposed method, V-learning, addresses a number of challenges associated with estimating dynamic treatment regimes in mHealth applications. V-learning directly estimates a policy which maximizes the value over a class of policies and requires minimal assumptions on the data-generating process. Furthermore. V-learning permits estimation of a randomized decision rule which can be used in place of existing strategies (e.g., ϵ-greedy) to encourage exploration in online estimation. A randomized decision rule can also provide patients with multiple treatment options. Estimation of an optimal policy for different populations can be handled through the use of different reference distributions.
V-learning and mobile technologies have the potential to improve patient outcomes in a variety of clinical areas. We have demonstrated, for example, that the proposed method can be used to estimate treatment regimes to reduce the number of hypoand hyperglycemic episodes in patients with type 1 diabetes. The proposed method could also be useful for other mHealth applications as well as applications outside of mHealth. For example, V-learning could be used to estimate dynamic treatment regimes for chronic illnesses using electronic health records data. Future research in this area may include increasing flexibility through use of a semiparametric model for the state-value function. Alternatively, nonlinear models for the state-value function may be informed by underlying theory or mathematical models of the system of interest. Data-driven selection of tuning parameters for the proposed method may help to improve performance. Developing theory for alternative penalty functions, such as the LASSO penalty, is another important step. Accounting for patient availability and feasibility of a sequence of treatments can be done by setting constraints on the class of policies. This will ensure that the resulting mHealth intervention is able to be implemented and that the recommended decisions are consistent with domain knowledge.
It would also be worthwhile to generalize our asymptotic results to permit nonstationarity. While we believe that stationarity is generally a reasonable assumption for moderate stretches of time—including when we do online estimation using moderately large batches of observations wherein both the patient dynamics and treatment policy remain approximately constant over each batch—stationarity would not in general hold when either the patient dynamics or treatment policy change more rapidly. For example, online estimation with treatment policy changes after each observation could induce nonstationarity. We conjecture that our asymptotic results will continue to hold in this setting, as our simulation studies in Section 5.2 seem to indicate.
8. Acknowledgments
We thank the editor, associate editor, and reviewers for helpful comments which led to a significantly improved paper.
Appendix
Proofs
Proof of Lemma 2.1. Let π be an arbitrary policy and 7 γ ∈ (0, 1) a fixed constant. Suppose we observe a state St = st at time t and let be the sequence of actions resulting in St = st, i.e., . Let be a potential sequence of actions taken from time t to time t + k. We have that
where we let π(at; st) = 0 for all at and st whenever t > T*(π). The last equality uses the consistency and strong ignorability assumptions.
Proof of Theorem 4.2. Proof of part 1: We first note that must solve
or
which is equivalent to where . We have that
by assumption 3, part 1 of assumption 4 and the Cauchy-Schwarz inequality. Let be arbitrary and note that
by the Cauchy-Schwarz inequality, where u⊗2 = uu⊺. This implies that
where we simplify notation by defining and . We have that
by the Cauchy-Schwarz inequality, the fact that by time-homogeneity, and part 1 of assumption 4. Also, A ≥ A − B and A − B ≥ c1∥c∥2 by assumption 5. Thus,
which finally implies that w1(π) is invertible and thus is well-defined uniformly over π ∈ Π. Using the fact that c⊺w1(π)c ≥ k0∥c∥2 for a constant k0 > 0, we can show that for some constant k1 > 0, where ∥ · ∥ is the usual matrix norm when applied to a matrix. Therefore, . Finally, it follows from assumptions 5 and 7 that .
Proof of part 2: Define
Let G be an envelope for , for example . By part 1 of assumption 4, . Part 4 of Lemma 8.1 below gives us that is Donsker. Since Π satisfies J[] {∞, Π, L3ρ(P)} < ∞ to, we have that
satisfies by parts 1 and 2 of Lemma 8.1 below. Moreover, F(at, st) = ∥Φ(st)∥·∥Φ(st)−γΦ(st+1)∥/μt(at; s⊺) is an envelope for with < ∞ by assumption 3 and part 1 of assumption 4. Thus, is Donsker. Let
Similar arguments yield that is Donsker.
Now, let and . Let . We have that . Thus,
where oP(1) doesn’t depend on π, because uniformly over π ∈ Π by assumption 3 and part 1 of assumption 4, by part 1 of this theorem, and because λn = oP(n−1/2). Using arguments similar to those in the previous paragraph, one can show that is Donsker, where B* is any finite collection of elements of . By part 1 of this theorem, there exists a bounded, closed set B0 such that for all π ∈ Π. Let . Note that
where R* = OP(1) by the Donsker property of and R* doesn’t depend on π. Thus, is stochastically equicontinuous on B0. Combined with the Donsker property of for arbitrary B*, we have that the class is Donsker. Using Slutsky’s Theorem, Theorem 11.24 of Kosorok (2008), the fact that is Glivenko-Cantelli, and the fact that , we have that , in where is a mean zero Gaussian process indexed by Π with covariance .
Proof of part 3: We have that
in by Slutsky’s Theorem.
Proof of Theorem 4.3. Proof of part 1: Following part 3 of Theorem 4.2, we have that . Combining this with the unique and well separated maximum condition (assumption 6), continuity of in β, and Theorem 2.12 of Kosorok (2008), yields the result in part 1. Part 2 follows from part 1 of this theorem, part 1 of Theorem 4.2, and the continuous mapping theorem. Part 3 follows from parts 2 and 3 of Theorem 4.2. The proof of part 4 follows standard arguments. ☐
Lemma 8.1. Let and be function classes with respective envelopes F and G. Let . For any 1 ≤ r, s1, s2 ≤ ∞ with ,
.
.
For any .
If is a finite class, , where denotes the cardinality of .
Proof of Lemma 8.1. Proof of part 1: Let 1 ≤ r, s1, s2 ≤ ∞ with and let (ℓF, uF) and (ℓG, uG) be and ϵ-brackets, respectively. Choose ℓF, ≤ f1, f2 ≤ uF and ℓG ≤ g1, g2 ≤ uG and consider the bracket for any f2g2 defined by f1g1 ± (F|uG − ℓG| + G|uF − ℓF|). Note that f1g1+F|uG−ℓG|+G|uF−ℓF|−f2g2 ≥ F|uG−ℓG|+G|uF−ℓF|−F|g1−g2|−G|f1−f2| ≥ 0, because f2g2 − f1g1 = f2g2 + f2g1 − f2g1 − f1g1 ≤ F|g1 − g2| + G|f1 − f2|. Similarly, f2g2 + F|uG − ℓG| + G|uF − ℓF| − f1g1 ≥ 0. Thus, these brackets hold all f2g2 for f2 ∈ (ℓF, uF) and g2 ∈ (ℓG, uG). Now, ∥F|uG − ℓG| + G|uF − ℓF |∥r ≤ ∥F∥rs1ϵ + ∥G∥rs2ϵ by Minkowski’s inequality and Hölder’s inequality, and it follows that
Next we note that
and thus
The proof of part 2 follows from Lemma 9.25 part (i) of Kosorok (2008) after a change of variables. Proof of part 3: First note that
whence it follows that
where the second inequality uses the fact that a + b ≤ 2ab for all a, b ≥ 1.
Proof of part 4: If is finite, then . Thus,
which completes the proof.
Lemma 8.2. Define the class of functions
for a compact set and 2 ≤ J < ∞ where a = (a1, …, aJ)⊺. Then, there exists a b0 < ∞ such that for any 1 ≤ r ≤ ∞, , which is finite whenever ∥S∥r < ∞. Furthermore, .
Proof of Lemma 8.2. For , define and because is compact. By the mean value theorem, for any , there exists a point on the line segment between and such that
which implies that
(9) |
It follows from equation (9) that assumption 7 holds for this particular class of policies. Now, by Theorem 9.23 of Kosorok (2008). Furthermore, , and thus
which proves the result.
Additional simulation results
Table 8:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ | Observed |
---|---|---|---|---|---|---|
25 | 24 | 0.123 (0.0773) | 0.117 (0.1067) | 0.128 (0.0960) | 0.025 (0.0330) | −0.005 |
36 | 0.117 (0.0900) | 0.120 (0.0933) | 0.138 (0.0992) | 0.030 (0.0341) | −0.004 | |
48 | 0.122 (0.0782) | 0.103 (0.1002) | 0.141 (0.0878) | 0.028 (0.0301) | 0.000 | |
50 | 24 | 0.109 (0.0727) | 0.122 (0.0954) | 0.153 (0.0692) | 0.028 (0.0321) | −0.005 |
36 | 0.137 (0.0782) | 0.127 (0.1061) | 0.141 (0.0816) | 0.024 (0.0285) | 0.003 | |
48 | 0.110 (0.0761) | 0.127 (0.0860) | 0.147 (0.0778) | 0.029 (0.0347) | 0.000 | |
100 | 24 | 0.125 (0.0802) | 0.129 (0.0854) | 0.164 (0.0609) | 0.027 (0.0289) | −0.001 |
36 | 0.151 (0.0739) | 0.148 (0.0822) | 0.131 (0.0897) | 0.025 (0.0356) | −0.002 | |
48 | 0.131 (0.0726) | 0.132 (0.0814) | 0.169 (0.0666) | 0.030 (0.0325) | −0.001 |
Table 9:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ | Observed |
---|---|---|---|---|---|---|
25 | 24 | 0.136 (0.0862) | 0.118 (0.0925) | 0.153 (0.0785) | 0.022 (0.0272) | −0.005 |
36 | 0.147 (0.0768) | 0.132 (0.1057) | 0.124 (0.0909) | 0.026 (0.0355) | −0.004 | |
48 | 0.128 (0.0897) | 0.146 (0.0826) | 0.116 (0.1067) | 0.020 (0.0317) | 0.000 | |
50 | 24 | 0.113 (0.0954) | 0.129 (0.0964) | 0.123 (0.1052) | 0.027 (0.0275) | −0.005 |
36 | 0.116 (0.0973) | 0.149 (0.0940) | 0.152 (0.0798) | 0.029 (0.0289) | 0.003 | |
48 | 0.109 (0.0899) | 0.132 (0.0998) | 0.124 (0.0932) | 0.024 (0.0285) | 0.000 | |
100 | 24 | 0.167 (0.0652) | 0.155 (0.0743) | 0.144 (0.0897) | 0.025 (0.0291) | −0.002 |
36 | 0.167 (0.0731) | 0.153 (0.0988) | 0.155 (0.0851) | 0.027 (0.0311) | −0.002 | |
48 | 0.137 (0.0868) | 0.175 (0.0615) | 0.148 (0.0978) | 0.026 (0.0332) | −0.001 |
Table 10:
n | T | Linear VL | Polynomial VL | Gaussian VL | GGQ | Observed |
---|---|---|---|---|---|---|
25 | 24 | 0.123 (0.0750) | 0.123 (0.0912) | 0.140 (0.0930) | 0.025 (0.0301) | −0.005 |
36 | 0.139 (0.0780) | 0.110 (0.1013) | 0.138 (0.0813) | 0.024 (0.0344) | −0.004 | |
48 | 0.135 (0.0690) | 0.110 (0.1177) | 0.143 (0.0718) | 0.023 (0.0282) | 0.000 | |
50 | 24 | 0.118 (0.0705) | 0.124 (0.0994) | 0.137 (0.0802) | 0.030 (0.0287) | −0.006 |
36 | 0.117 (0.0827) | 0.123 (0.0972) | 0.121 (0.0804) | 0.030 (0.0292) | 0.003 | |
48 | 0.128 (0.0807) | 0.113 (0.1085) | 0.137 (0.0921) | 0.023 (0.0282) | 0.000 | |
100 | 24 | 0.131 (0.0563) | 0.123 (0.1015) | 0.167 (0.0472) | 0.029 (0.0295) | −0.001 |
36 | 0.132 (0.0735) | 0.148 (0.0851) | 0.161 (0.0670) | 0.029 (0.0334) | −0.002 | |
48 | 0.149 (0.0612) | 0.137 (0.1003) | 0.156 (0.0687) | 0.023 (0.0267) | −0.001 |
Contributor Information
Daniel J. Luckett, Department of Biostatistics, University of North Carolina at Chapel Hill.
Eric B. Laber, Department of Statistics, North Carolina State University
Anna R. Kahkoska, Department of Nutrition, University of North Carolina at Chapel Hill
David M. Maahs, Department of Pediatrics, Stanford University
Elizabeth Mayer-Davis, Department of Nutrition, University of North Carolina at Chapel Hill.
Michael R. Kosorok, Department of Biostatistics, University of North Carolina at Chapel Hill
References
- Ali AA, Hossain SM, Hovsepian K, Rahman MM, Plarre K, and Kumar S (2012). mpuff: Automated detection of cigarette smoking puffs from respiration measurements. In Proceedings of the 11th International Conference on Information Processing in Sensor Networks, pp. 269–280. ACM. [Google Scholar]
- Bergenstal RM, Garg S, Weinzimer SA, Buckingham BA, Bode BW, Tamborlane WV, and Kaufman FR (2016). Safety of a hybrid closed-loop insulin delivery system in patients with type 1 diabetes. Journal of the American Medical Association 316 (13), 1407–1408. [DOI] [PubMed] [Google Scholar]
- Bexelius C, Löf M, Sandin S, Lagerros YT, Forsum E, and Litton J-E (2010). Measures of physical activity using cell phones: Validation using criterion methods. Journal of Medical Internet Research 12 (1), e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakraborty B and Moodie EE (2013). Statistical Methods for Dynamic Treatment Regimes. Springer. [Google Scholar]
- Cobry E, McFann K, Messer L, Gage V, VanderWel B, Horton L, and Chase HP (2010). Timing of meal insulin boluses to achieve optimal postprandial glycemic control in patients with type 1 diabetes. Diabetes Technology & Therapeutics 12 (3), 173–177. [DOI] [PubMed] [Google Scholar]
- Dai Y-H (2002). Convergence properties of the BFGS algoritm. SIAM Journal on Optimization 13 (3), 693–701. [Google Scholar]
- Doya K (2000). Reinforcement learning in continuous time and space. Neural computation 12 (1), 219–245. [DOI] [PubMed] [Google Scholar]
- Ertefaie A (2014). Constructing dynamic treatment regimes in infinite-horizon settings. arXiv preprint arXiv:1406.0764.
- Free C, Phillips G, Watson L, Galli L, Felix L, Edwards P, Patel V, and Haines A (2013). The effectiveness of mobile-health technologies to improve health care service delivery processes: A systematic review and meta-analysis. PLoS Med 10 (1), e1001363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller MJ, Stalvey MS, and Silverstein JH (2004). Predictors of control of diabetes: Monitoring may be the key. The Journal of Pediatrics 144 (5), 660–661. [DOI] [PubMed] [Google Scholar]
- Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2 ed.). New York: Springer. [Google Scholar]
- Hernan MA and Robins JM (2010). Causal Inference. CRC Boca Raton, FL. [Google Scholar]
- Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, and Murphy SA (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34 (S), 1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kober J and Peters J (2012). Reinforcement learning in robotics: A survey In Reinforcement Learning, pp. 579–610. Springer. [Google Scholar]
- Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. New York: Springer. [Google Scholar]
- Kosorok MR and Moodie EE (2015). Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine, Volume 21 SIAM. [Google Scholar]
- Kowalski A (2015). Pathway to artificial pancreas systems revisited: Moving downstream. Diabetes Care 38 (6), 1036–1043. [DOI] [PubMed] [Google Scholar]
- Kumar S, Nilsen WJ, Abernethy A, Atienza A, Patrick K, Pavel M, Riley WT, Shar A, Spring B, Spruijt-Metz D, et al. (2013). Mobile health technology evaluation: The mhealth evidence workshop. American Journal of Preventive Medicine 45 (2), 228–236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for Q-learning. Biometrika 101 (4), 831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lai TL and Robbins H (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), 4–22. [Google Scholar]
- Levine B-S, Anderson BJ, Butler DA, Antisdel JE, Brackett J, and Laffel LM (2001). Predictors of glycemic control and short-term adverse outcomes in youth with type 1 diabetes. The Journal of Pediatrics 139 (2), 197–203. [DOI] [PubMed] [Google Scholar]
- Liao P, Klasnja P, Tewari A, and Murphy SA (2016). Sample size calculations for micro-randomized trials in mhealth. Statistics in Medicine 35 (12), 1944–1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linn KA, Laber EB, and Stefanski LA (2017). Interactive Q-learning for quantiles. Journal of the American Statistical Association 112 (518), 638–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, and Gonzalez-Recio O (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research 92 (3), 209–225. [DOI] [PubMed] [Google Scholar]
- Ly TT, Breton MD, Keith-Hynes P, De Salvo D, Clinton P, Benassi K, Mize B, Chernavvsky D, Place J, Wilson DM, et al. (2014). Overnight glucose control with an automated, unified safety system in children and adolescents with type 1 diabetes at diabetes camp. Diabetes Care 37 (8), 2310–2316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ly TT, Roy A, Grosman B, Shin J, Campbell A, Monirabbasi S, Liang B, von Eyben R, Shanmugham S, Clinton P, et al. (2015). Day and night closed-loop control using the integrated medtronic hybrid closed-loop system in type 1 diabetes at diabetes damp. Diabetes Care 38 (7), 1205–1211. [DOI] [PubMed] [Google Scholar]
- Maahs DM, Mayer-Davis E, Bishop FK, Wang L, Mangan M, and McMurray RG (2012). Outpatient assessment of determinants of glucose excursions in adolescents with type 1 diabetes: Proof of concept. Diabetes Technology & Therapeutics 14 (8), 658–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maei HR, Szepesvári C, Bhatnagar S, and Sutton RS (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 719–726. [Google Scholar]
- Moodie EE, Richardson TS, and Stephens DA (2007). Demystifying optimal dynamic rreatment regimes. Biometrics 63 (2), 447–455. [DOI] [PubMed] [Google Scholar]
- Murphy S, Deng Y, Laber E, Maei H, Sutton R, and Witkiewitz K (2016). A batch, off-policy, actor-critic algorithm for optimizing the average reward. arXiv preprint arXiv:1607.05047. [Google Scholar]
- Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B 65 (2), 331–355. [Google Scholar]
- Murphy SA (2005). A generalization error for Q-learning. Journal of Machine Learning Research 6 (Jul), 1073–1097. [PMC free article] [PubMed] [Google Scholar]
- Murphy SA, van der Laan MJ, and Robins JM (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), 1410–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Puterman ML (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. [Google Scholar]
- Quinn CC, Shardell MD, Terrin ML, Barr EA, Ballew SH, and Gruber-Baldini AL (2011). Cluster-randomized trial of a mobile phone personalized behavioral intervention for blood glucose control. Diabetes Care 34 (9), 1934–1942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
- Robins JM (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium on Biostatitics, pp. 189–326. Springer. [Google Scholar]
- Rubin D (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics 6 (1), 34–58. [Google Scholar]
- Schulte PJ, Tsiatis AA, Laber EB, and Davidian M (2014). Q- and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science 29 (4), 640–661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinhubl SR, Muse ED, and Topol EJ (2013). Can mobile health technologies transform health care? Journal of the American Medical Association 310 (22), 2395–2396. [DOI] [PubMed] [Google Scholar]
- Sutton R and Barto A (1998). Reinforcment Learning: An Introduction. The MIT Press. [Google Scholar]
- Tang Y and Kosorok MR (2012). Developing adaptive personalized therapy for cystic fibrosis using reinforcement rearning. The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Working paper 30.
- Weinzimer SA, Steil GM, Swan KL, Dziura J, Kurtz N, and Tamborlane WV (2008). Fully automated closed-loop insulin delivery versus semiautomated hybrid control in pediatric patients with type 1 diabetes using an artificial pancreas. Diabetes Care 31 (5), 934–939. [DOI] [PubMed] [Google Scholar]
- Wolever T and Mullan Y (2011). Sugars and fat have different effects on postprandial glucose responses in normal and type 1 diabetic subjects. Nutrition, Metabolism and Cardiovascular Diseases 21 (9), 719–725. [DOI] [PubMed] [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 (4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB, and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 (3), 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Kosorok MR, and Zeng D (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28 (26), 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ziegler R, Heidtmann B, Hilgard D, Hofer S, Rosenbauer J, and Holl R (2011). Frequency of SMBG correlates with hba1c and acute complications in children and adolescents with type 1 diabetes. Pediatric Diabetes 12 (1), 11–17. [DOI] [PubMed] [Google Scholar]