Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2019 Apr 17;115(530):692–706. doi: 10.1080/01621459.2018.1537919

Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

Daniel J Luckett 1,*, Eric B Laber 2, Anna R Kahkoska 3, David M Maahs 4, Elizabeth Mayer-Davis 5, Michael R Kosorok 6
PMCID: PMC7500510  NIHMSID: NIHMS990912  PMID: 32952236

Abstract

The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best possible health­care for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient’s health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, most existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an out­patient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show that the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes.

Keywords: Markov decision processes, Precision medicine, Reinforcement learning, Type 1 diabetes

1. Introduction

The use of mobile devices in clinical care, called mobile health (mHealth), provides an effective and scalable platform to assist patients in managing their illness (Free et al., 2013; Steinhubl et al., 2013). Advantages of mHealth interventions include real-time communication between a patient and their health-care provider as well as systems for delivering training, teaching, and social support (Kumar et al., 2013). Mobile technologies can also be used to collect rich longitudinal data to estimate optimal dynamic treatment regimes and to deliver treatment that is deeply tailored to each individual patient. We propose a new estimator of an optimal treatment regime that is suitable for use with longitudinal data collected in mHealth applications.

A dynamic treatment regime provides a framework to administer individualized treatment over time through a series of decision rules. Dynamic treatment regimes have been well-studied in the statistical and biomedical literature (Murphy, 2003; Robins, 2004; Moodie et al., 2007; Kosorok and Moodie, 2015; Chakraborty and Moodie, 2013) and furthermore, statistical considerations in mHealth have been studied by, for example, Liao et al. (2016) and Klasnja et al. (2015). Although mobile technology has been successfully utilized in clinical areas such as diabetes (Quinn et al., 2011; Maahs et al., 2012), smoking cessation (Ali et al., 2012), and obesity (Bexelius et al., 2010), mHealth poses some unique challenges that preclude direct application of existing methodologies for dynamic treatment regimes. For example, mHealth applications typically have no definite time horizon in the sense that treatment decisions are made continually throughout the life of the patient with no fixed time point for the final treatment decision; estimation of an optimal treatment strategy must utilize data collected over a much shorter time period than that over which treatment would be applied in practice; the momentary signal may be weak and may not directly measure the outcome of interest; and estimation of optimal treatment strategies must be done online as data accumulate.

This work is motivated in part by our involvement in a study of mHealth as a management tool for type 1 diabetes. Type 1 diabetes is an autoimmune disease wherein the pancreas produces insufficient levels of insulin, a hormone needed to regulate blood glucose concentration. Patients with type 1 diabetes are continually engaged in management activities including monitoring glucose levels, timing and dosing insulin injections, and regulating diet and physical activity. Increased glucose monitoring and attention to self-management facilitate more frequent treatment adjustments and have been shown to improve patient outcomes (Levine et al., 2001; Haller et al., 2004; Ziegler et al., 2011). Thus, patient outcomes have the potential to be improved by diabetes management tools which are deeply tailored to the continually evolving health status of each patient. Mobile technologies can be used to collect data on physical activity, glucose, and insulin at a fine granularity in an outpatient setting (Maahs et al., 2012). There is great potential for using these data to create comprehensive and accessible mHealth interventions for clinical use. We envision application of this work for use before the artificial pancreas (Weinzimer et al., 2008; Kowalski, 2015; Bergenstal et al., 2016) becomes widely available.

In our motivating example as well as other mHealth applications, the goal is to treat a chronic disease over the long term. However, data will typically be collected over a short time period. Because the time frame of data collection is much shorter than the time frame of application, standard methods for longitudinal data analysis, such as generalized estimating equations or mixed models, cannot be used. We assume that the data collected in the field consist of a sample from a stationary Markov process, which allows us to estimate a dynamic treatment regime that will lead to good outcomes over the long term using data collected over a much shorter time period.

The sequential decision making process can be modeled as a Markov decision process (Puterman, 2014) and the optimal treatment regime can be estimated using reinforcement learning algorithms such as Q-learning (Murphy, 2005; Zhao et al., 2009; Tang and Kosorok, 2012; Schulte et al., 2014). Ertefaie (2014) proposed a variant of greedy gradient Q-learning (GGQ) to estimate optimal dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010). In GGQ, the form of the estimated Q-function dictates the form of the estimated optimal treatment regime. Thus, one must choose between a parsimonious model for the Q-function at the risk of model misspecification or a complex Q-function that yields unintelligible treatment regimes. Furthermore, GGQ requires modeling a non-smooth function of the data, which creates complications (Laber et al., 2014; Linn et al., 2017). Applications of mHealth require methods that can both estimate a policy from a fixed sample of retrospective data (offline estimation) and estimate a policy that is updated as data accumulate (online estimation). Online estimation has been given considerable attention in the field of reinforcement learning, particularly in engineering applications (Doya, 2000; Kober and Peters, 2012), with special consideration given to algorithms that provide fast updates in situations where new data accumulate multiple times per second. In our mHealth examples, treatment decisions may be made multiple times per day or even every hour or minute; however, rapid updating of the estimated policy at the scale of engineering applications is not needed. Therefore, we are able to focus on different aspects of the problem without needing to ensure very fast estimation. We propose an alternative estimation method for infinite horizon dynamic treatment regimes that is suited to mHealth applications. Our approach, which we call V-learning, involves estimating the optimal policy among a prespecified class of policies (Zhang et al., 2012, 2013). It requires minimal assumptions about the data-generating process and permits estimating a randomized decision rule that can be implemented online as data accumulate.

In Section 2, we describe the setup and present our method for offline estimation using data from a micro-randomized trial or observational study. In Section 3, we extend our method for application to online estimation with accumulating data. Theoretical results, including consistency and asymptotic normality of the proposed estimators, are presented in Section 4. We compare the proposed method to GGQ using simulated data in Section 5. A case study using data from patients with type 1 diabetes is presented in Section 6 and we conclude with a discussion in Section 7. Proofs of technical results are in the Appendix.

2. Offline estimation from observational data

We assume that the available data are {(Si1,Ai1,Si2,,SiTi,AiTi,SiTi+1)}i=1n, which comprise n independent, identically distributed trajectories (S1, A1, S2, …, ST, AT, ST+1), where: Stp denotes a summary of patient information collected up to and including time t; AtA denotes the treatment assigned at time t; and T+ denotes the (possibly random) patient follow-up time. In the motivating example of type 1 diabetes, St could contain a patient’s blood glucose, dietary intake, and physical activity in the hour leading up to time t and At could denote an indicator that an insulin injection is taken at time t. We assume that the data-generating model is a time-homogeneous Markov process so that St+1 ╨ (At−1, St+1, …, A1, S1)|(At, St) and the conditional density p(st+1|at, st) is the same for all t ≥ 1. Let Lt ∈ {0, 1} denote an indicator that the patient is still in follow-up at time t, i.e., Lt = 1 if the patient is being followed at time t and zero otherwise. We assume that Lt is contained in St so that P(Lt+1 = 1| At, St,…, A1, S1) = P(Lt+1 = 1|At, St). It is not necessary for time points to be evenly spaced or homogeneous across patients. For example, we can define the decision times as those times at which data are observed and include time since the previous observation in St. Defining the decision times in this way allows us to effectively handle intermittent missing data, and thus we can assume that Lt = 0 implies Lt+1 = 0 with probability one. Furthermore, we assume a known utility function u:p×A×p so that Ut = u (St+1, At, St) measures the ‘goodness’ of choosing treatment At in state St and subsequently transitioning to state St+1. In our motivating example, the utility at time t could be a measure of how infrequently the patient’s blood glucose concentration deviates from the optimal range over the hour preceding and following time t. The goal is to select treatments to maximize expected cumulative utility; treatment selection is formalized using a treatment regime (Schulte et al., 2014; Kosorok and Moodie, 2015) and the utility associated with any regime is defined using potential outcomes (Rubin, 1978).

Let B(A) denote the space of probability distributions over A. A treatment regime in this context is a function π:domStB(A) so that, under π, a decision maker presented with state St= st at time t will select action atA with probability π(at; st). Define a¯t=(a1,,at)At, and a¯=(a1,a2,)A. The set of potential outcomes is

W={S1,S2(a1),,ST(a¯)(a¯T(a¯)1):T(a¯)=inf{t1:Lt(a¯t1)=0},a¯A},

where St(a¯t1) is the potential state and Lt(a¯t1) is the potential follow-up status at time t under treatment sequence a¯t1. Thus, the potential utility at time t is Ut(a¯t)=u{S(t+1)(a¯t),at,St(a¯t1)}. For any π, define {ξπt()}t1 to be a sequence of independent, A −valued stochastic processes indexed by dom St such that P{ξπt(st)=at}=π(at;st). The potential follow-up time under π is

T(π)t1a¯tAtt1{supa_t+1T(a¯t,a_t+1)=t}υ=1t1[ξπυ{Sυ(a¯υ1)}=aυ],

where a_t+1=(at+1,at+2,). The potential utility under π at time t is

Ut(π)={a¯tAtUt(a¯t)υ=1t1[ξπυ{Sυ(a¯υ1)}=aυ],ifT(π)t0,otherwise,

where S1(a¯0)=S1. Thus, utility is set to zero after a patient is lost to follow-up. However, in certain situations, utility may be constructed so as to take a negative value at the time point when the patient is lost to follow-up, e.g., if the patient discontinues treatment because of a negative effect associated with the intervention. Define the state-value function V(π,st)=E{k0γkU(t+k)(π)|St=st} (Sutton and Barto, 1998), where γ ∈ (0,1) is a fixed constant that captures the trade-off between short- and long-term outcomes. For any distribution R on dom S1, define the value function with respect to reference distribution R as VR(π)=V(π,s)dR(s); throughout, we assume that this reference distribution is fixed. The reference distribution can be thought of as a distribution of initial states and we estimate it from the data in the implementation in Sections 5 and 6. For a prespecified class of regimes, Π, the optimal regime, πRoptΠ, satisfies VR(πRopt)VR(π) for all π ∈ Π. The goal is to estimate πRopt using data collected from n patients, where patient i is followed for Ti time points, i = 1, …, n. Thus, Ti represents the number of treatment decisions made for patient i in the observed data; however, because the observed data are assumed to come from a time-homogeneous Markov chain, the estimated policy could be applied in a population indefinitely.

To construct an estimator of πRopt, we make a series of assumptions that connect the potential outcomes in W* with the data-generating model.

Assumption 1. Strong ignorability, AtW* | St for all t.

Assumption 2. Consistency, St=St(A¯t1) for all t and T=T(A¯).

Assumption 3. Positivity, there exists c0 > 0 so that P(At = at|St = st) ≥ c0 for all atA,stdom St, and all t.

In addition, we implicitly assume that there is no interference among the experimental units. These assumptions are common in the context of estimating dynamic treatment regimes (Robins, 2004; Hernan and Robins, 2010; Schulte et al., 2014). Assumption 1 implies that there are no unmeasured confounders and assumptions 1 and 3 hold by construction in a micro-randomized trial (Klasnja et al., 2015; Liao et al., 2016).

Let μt(at; st) = P(At = at|St = st) for each t ≥ 1. In a micro-randomized trial, μt(at; st) is a known randomization probability; in an observational study, it must be estimated from the data. The following lemma characterizes VR(π) for any regime, π, in terms of the data-generating model (see also Lemma 4.1 of Murphy et al., 2001). A proof is provided in the appendix.

Lemma 2.1. Let π denote an arbitrary regime and γ ∈ (0, 1) a discount factor. Then, under assumptions 1–3 and provided interchange of the sum and integration is justified, the state-value function of π at st is

V(π,st)=k0E[γkUt+k{υ=0kπ(Aυ+t;Sυ+t)μυ+t(Aυ+t;Sυ+t)}|St=st]. (1)

The preceding result will form the basis for an estimating equation for VR(π). Write the right hand side of (1) as

V(π,St)=E{π(At;St)μt(At;St)(Ut+γk0E[γkUt+k+1{υ=0kπ(Aυ+t+1;Sυ+t+1)μυ+t+1(Aυ+t+1;Sυ+t+1)}|St+1])|St}=E[π(At;St)μt(At;St){Ut+γV(π,St+1)}|St],

from which it follows that

0=E[π(At;St)μt(At;St){Ut+γV(π,St+1)V(π,St)}|St].

Subsequently, for any function ψ defined on dom St, the state-value function satisfies

0=E[π(At;St)μt(At;St){Ut+γV(π,St+1)V(π,St)}ψ(St)], (2)

which is an importance-weighted variant of the well-known Bellman optimality equation (Sutton and Barto, 1998).

Let V(π, s; θπ) denote a model for V(π, s) indexed by θπΘq. We assume that the map θπV(π, s;θπ) is differentiable everywhere for each fixed s and π. Let θπV(π,s;θπ) denote the gradient of V(π, s; θπ) and define

Λn(π,θπ)=1ni=1nt=1Tiπ(Ait;Sit)μt(Ait;Sit){Uit+γV(π,Sit+1;θπ)V(π,Sit;θπ)}θπV(π,Sit;θπ). (3)

Given a positive definite matrix Ωq×q and penalty function P:q+, define θ^nπ=argminθπΘ{Λn(π,θπ)ΩΛn(π,θπ)+λnP(θπ)}, where λn is a tuning parameter. Subsequently, V(π,s;θ^nπ) is the estimated state-value function under π in state s. Thus, given a reference distribution, R, the estimated value of a regime, π, is V^n,R(π)=V(π,s;θ^nπ)dR(S) and the estimated optimal regime is π^n=argmaxπΠV^n,R(π). The idea of V-learning is to use estimating equation (3) to estimate the value of any policy and maximize estimated value over a class of policies; we will discuss strategies for this maximization below.

V-learning requires a parametric class of policies. Assuming that there are K possible treatments, a1, … , aK, we can define a parametric class of policies as follows. Define π(aj;s,β)=exp(sβj)/{1+k1K1exp(sβk)} for j = 1, … , K − 1, and π(aK;s)=1/{1+k=1K1exp(sβk)}. This defines a class of randomized policies, Π, parametrized by β=(β1,,βK1), where βk is a vector of parameters for the k-th treatment. Under a policy in this class defined by β, actions are selected stochastically according to the probabilities π(aj; s, β), j = 1, … , K. In the case of a binary treatment, a policy in this class reduces to π(1; s, β) = exp(sβ)/ {1 + exp(sβ)} and π(0; s, β) = 1/ {1 + exp(sβ)} for a p × 1 vector β. This class of policies is used in the implementation in Sections 5 and 6.

V-learning also requires a class of models for the state value function indexed by a parameter, θπ. We use a basis function approximation (Hastie et al., 2009; Long et al., 2010). Let Φ = (ϕ1, … , ϕq) be a vector of prespecified basis functions and let Φ(sit)={ϕ1(sit),,ϕq(sit)}. Let V(π,sit;θπ)=Φ(sit)θπ Under this working model,

Λn(π,θπ)=[n1i=1nt=1Tiπ(Ait;Sit)μt(Ait;Sit){γΦ(Sit)Φ(Sit+1)Φ(Sit)Φ(Sit)}]θπ+n1i=1nt=1Ti{π(Ait;Sit)μt(Ait;Sit)UitΦ(Sit)}. (4)

Computational efficiency is gained from the linearity of V(π,sit;θπ) in θπ; flexibility can be achieved through the choice of Φ. We recommend Gaussian basis functions as they offer the highest degree of flexibility. However, along with Gaussian basis functions, we also examine the performance of V-learning using linear and polynomial basis functions in Sections 5 and 6 as these offer reasonable alternatives.

The algorithm for V-learning is given in Algorithm 1 below. The algorithm can be terminated when βkβk1 is below some small threshold. The update in step can be achieved with a variety of existing optimization methods. In our implementation, we use the BFGS algorithm (Dai, 2002) as implemented in the optim function in R software (R Core Team, 2016). Because the objective function is not necessarily convex, care must be taken when selecting the starting point, β1 (step 2). In our implementation, we use simulated annealing as implemented in the optim function in R to find an appropriate starting point.

Algorithm 1: V-learning.
1 Initialize a class of policies, Π={πβ:βB}, and a model, V (π, s; θπ);
2 Set k = 1 and initialize β1 to a starting value in B;
3 while Not converged do
4  Estimate θ^nπβk=argminθπβkΘ{Λn(πβk,θπβk)ΩΛn(πβk,θπβk)+λnP(θπβk)};
5  Evaluate V^n,R(πβk)=V(πβk,S;θ^nπβk)dR(s);
6  Set βk+1=βk+αkβkV^n,R(πβk) for some step size, αk, where βkV^n,R(πβk) is the gradient of V^n,R(πβk);
7 end

2.1. Greedy gradient Q-learning

Here we briefly discuss an existing method for infinite horizon dynamic treatment regimes, which will be used for comparison in the simulation studies in Section 5.

Ertefaie (2014) introduced greedy gradient Q-learning (GGQ) for estimating dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010; Murphy et al., 2016).

Define Qπ(st,at)=E{k0γkUt+k(π)|St=st,At=at}. The Bellman optimality equation (Sutton and Barto, 1998) is

Qopt(st,at)=E{Ut+γmaxaAQopt(St+1,a)|St=st,At=at}. (5)

Let Q(s, a; ηopt) be a parametric model for Qopt(s, a) indexed by ηoptHq. In our implementation, we model Q(s, a; ηopt) as a linear function with interactions between all state variables and treatment. The Bellman optimality equation motivates the estimating equation

Dn(ηopt)=1ni=1nt=1Ti{Uit+γmaxaAQ(Sit+1,a;ηopt)Q(Sit,Ait;ηopt)}ηoptQ(Sit,Ait;ηopt). (6)

For a positive definite matrix, Ω, we estimate ηopt using η^nopt=argminηHDn(η)ΩDn(η). The estimated optimal policy in state s selects action π^n(s)=maxaAQ(s,a;η^nopt). This optimization problem is non-convex and non-differentiable in ηopt. However, it can be solved with a generalization of the greedy gradient Q-learning algorithm of Maei et al. (2010), and hence is referred to as GGQ by Ertefaie (2014) and in the following.

The performance of GGQ has been demonstrated in the context of chronic diseases with large sample sizes and a moderate number of time points. However, in mHealth applications, it is common to have small sample sizes and a large number of time points, with decisions occurring at a fine granularity. In GGQ, the estimated policy depends directly η^nopt and, therefore, depends on modeling the transition probabilities of the data-generating process. Furthermore, estimating equation (6) contains a non-smooth max operator, which makes estimation difficult without large amounts of data (Laber et al., 2014; Linn et al., 2017). V-learning only requires modeling the policy and the value function rather than the data-generating process and directly maximizes estimated value over a class of policies, thereby avoiding the non-smooth max operator in the estimating equation (compare equations (3) and (6)); these attributes may prove advantageous in mHealth settings.

3. Online estimation from accumulating data

Suppose we have accumulating data {(Si1,Ai1,Si2,)}i=1n, where Sit and Ait represent the state and action for patient i = 1, … , n at time t ≥ 1. At each time t, we estimate an optimal policy in a class, Π, using data collected up to time t, take actions according to the estimated optimal policy, and estimate a new policy using the resulting states. Let π^nt be the estimated policy at time t, i.e., π^nt is estimated after observing state St+1 and before taking action At+1. If Π is a class of randomized policies, we can select an action for a patient presenting with St+1 = st+1 according to π^nt(;st+1), i.e., we draw At+1 according to the distribution P(At+1=a)=π^nt(α;st+1). If a class of deterministic policies is of interest, we can inject some randomness into π^nt to facilitate exploration. One way to do this is an ϵ-greedy strategy (Sutton and Barto, 1998), which selects the estimated optimal action with probability 1 – ϵ and otherwise samples equally from all other actions. Because an ϵ-greedy strategy can be used to introduce randomness into a deterministic policy, we can assume a class of randomized policies.

At each time t ≥ 1, let θ^n,tπ=argminθπΘ{Λn,t(π,θπ)ΩΛn,t(π,θπ)+λnP(θπ)} where Ω, λn, and P are as defined in Section 2 and

Λn,t(π,θπ)=1ni=1nυ=1tπ(Aiυ;Siυ)π^nυ1(Aiυ;Siυ){Uiυ+γV(π,Siυ+1;θπ)V(π,Siυ;θπ)}θπV(π,Siυ;θπ) (7)

with π^n0 some initial randomized policy. We note that estimating equation (7) is similar to (3), except that π^nυ1 replaces μυ as the data-generating policy. Given the estimator of the value of π at time t, V^n,R,t(π)=V(π,s;θ^n,tπ)dR(s), the estimated optimal policy at time t is π^nt=argmaxπΠV^n,R,t(π). In practice, we may choose to update the policy in batches rather than at every time point. An alternative way to encourage exploration through the action space is to choose π^nt=argmaxπΠ{V^n,R,t(π)+αtψ^t(π)} for some sequence αt 0, where ψ^t(π) is a measure of uncertainty in V^n,R,t(π). An example of this is upper confidence bound sampling, or UCB (Lai and Robbins, 1985).

In some settings, when the data-generating process may vary across patients, it may be desirable to allow each patient to follow an individualized policy that is estimated using only that patient’s data. Suppose that n patients are followed for an initial T1 time points after which the policy π^n1 is estimated. Then, suppose that patient i follows π^n1 until time T2, when a policy π^i2 is estimated using only the states and actions observed for patient i. This procedure is then carried out until time TK for some fixed K with each patient following their own individual policy which is adapted to match the individual over time. We may also choose to adapt the randomness of the policy at each estimation. For example, we could select ϵ1 > ϵ2 > … > ϵK and, following estimation k, have patient i follow policy π^ik with probability 1 – ϵk and policy π^n1 with probability ϵk. In this way, patients become more likely to follow their own individualized policy and less likely to follow the initial policy over time, reflecting increasing confidence in the individualized policy as more data become available. The same class of policies and model for the state value function can be used as in Section 2.

4. Theoretical results

In this section, we establish asymptotic properties of θ^nπ and π^n for offline estimation. Because the proposed online estimation procedure involves performing offline estimation repeatedly in smaller batches, our theoretical results apply to each smaller batch as the number of observations increases. Developing more general theory for online estimation is an interesting topic for future research. Throughout, we assume assumptions 1–3 from Section 2.

Let θ^nπ=argminθπΘ{Λn(π,θπ)ΩΛn(π,θπ)+λn(θπ)θπ}. Thus, we consider the special case where the penalty function is the squared Euclidean norm of θπ. We will assume that λn = oP(n−1/2). All of our results hold for any positive definite matrix, Ω. Assume the working model for the state value function introduced in Section 2, i.e., V(π,sit;θπ)=Φ(sit)θπ. For fixed π, denote the true θπbyθ0π, i.e., V(π,s)=Φ(s)θ0π. Let ν=Φ(s)dR(s) so that VR(π)=νθ0π. Define V^n,R^(π)={EnΦ(S)}θ^nπ, where En denotes the empirical measure of the observed data. Let Π={πβ:βB} be a parametric class of policies and let π^n=πβ^n where β^n=argmaxβBV^n,R^(πβ).

Our main results are summarized in Theorems 4.2 and 4.3 below. Because each patient trajectory is a stationary Markov chain, we need to use asymptotic theory based on stationary processes; consequently, some of the required technical conditions are more difficult to verify than those for i.i.d. data. Define the bracketing integral for a class of functions, F, by J[]{δ,F,Lr(P)}=0δlogN[]{ϵ,F,Lr(P)}dϵ, where the bracketing number for F, N[]{ϵ,F,Lr(P)}, is the number of Lr(P) ϵ-brackets needed such that each element of F is contained in at least one bracket (see Chapter 2 of Kosorok, 2008). For any stationary sequence of possibly dependent random variables, {Xt}t≥1; let Mbc be the σ-field generated by Xb, … ,Xc and define ζ(k)=E[supm1{|P(B|M1m)P(B)|:BMm+k}]. We say that the chain {Xt}t≥1 is absolutely regular if ζ(k) → 0 as k → 0 (also called β-mixing in Chapter 11 of Kosorok, 2008). We make the following assumptions.

Assumption 4. There exists a 2 < ρ < ∞ such that

  1. E|Ut|3ρ<,EΦ(St)3ρ<, and ESt3ρ<.

  2. The sequence {(St, At)}t≥1 is absolutely regular with k=1k2/(ρ2)ζ(k)<.

  3. The bracketing integral of the class of policies, J[] (∞,Π, L3ρ(P)} < ∞.

Assumption 5. There exists some c1 > 0 such that

infπΠcE[π(At;St)μt(At;St){Φ(St)Φ(St)γ2Φ(St+1)Φ(St+1)}]cc1c2

for all cq

Assumption 6. The map βVR(πβ) has a unique and well separated maximum over β in the interior of B; let β0 denote the maximizer.

Assumption 7. The following condition holds: supβ1β2δEπβ1(A;S)πβ2(A;S)0 as δ ↓0.

Remark 4.1. Assumption 4 requires certain finite moments and that the dependence between observations on the same patient vanishes as observations become further apart. In Lemma 8.2 in the appendix, we verify part 3 of assumption 4 and assumption 7 for the class of policies introduced in Section 2. However, note that the theory holds for any class of policies satisfying the given assumptions, not just the class considered here. Assumption 5 is needed to show the existence of a unique θ0π uniformly over Π and can be verified empirically by checking that certain data-dependent matrices are invertible. Assumption requires that the true optimal decision in each state is unique (see assumption A. 8 of Ertefaie, 2014) and is a standard assumption in M-estimation (see chapter 14 of Kosorok, 2008). Assumption 7 requires smoothness on the class of policies.

The main results of this section are stated below. Theorem 4.2 states that there exists a unique solution to 0=EΛn(π,θπ) uniformly over Π and that the estimator θ^n converges weakly to a mean zero Gaussian process in (Π).

Theorem 4.2. Under the given assumptions, the following hold.

  1. For all π ∈ Π, there exists aθ0πq such that EΛn(π,θπ) has a zero at θπ=θ0π. Moreover, supπΠθ0π< and supβ1β2δθ0πβ1θ0πβ20asδ0.

  2. Let G(π) be a tight, mean zero Gaussian process indexed by Π with covariance E{G(π1)G(π2)}=w1(π1)1w0(π1,π2)w1(π2) where
    w0(π1,π2)=E[π1(At;St)π2(At;St)μt(At;St)2{Ut+γΦ(St+1)θ0π1Φ(St)θ0π1}{Ut+γΦ(St+1)θ0π2Φ(St)θ0π2}Φ(St)Φ(St)]
    and
    w1(π)=E[π(At;St)μt(At;St)Φ(St){Φ(St)γΦ(St+1)}].
    Then, n(θ^nπθ0π)G(π) in (Π).
  3. Let G(π) be as defined in part 2. Then, n{V^n,R^(π)VR(π)}νG(π) in (Π).

Theorem 4.3 below gives us that the estimated optimal policy converges in probability to the true optimal policy over Π and that the estimated value of the estimated optimal policy converges to the true value of the estimated optimal policy.

Theorem 4.3. Under the given assumptions, the following hold.

  1. Let β^n=argmaxβBV^n,R^(πβ) and β0=argmaxβBVR(πβ). We have that β^nβ0P0.

  2. Let β^n and β0 be defined as in part 1. Then, |VR(πβ^n)VR(πβ^0)|P0.

  3. Let σ02=νw1(πβ0)1w0(πβ0,πβ0)w1(πβ0)ν. Then, n{V^n,R^(πβ^n)VR(πβ^n)}N(0,σ02).

  4. A consistent estimator for σ02 is
    σ^n2={EΦ(St)}w^1(πβ^n)1w^0(πβ^n,πβ^n)w^1(πβ^n){EnΦ(St)},
    where
    w^0(π1,π2)=En[π1(At;St)π2(At;St)μt(At;St)2{Ut+γΦ(St+1)θ^nπ1Φ(St)θ^0π1}{Ut+γΦ(St+1)θ^nπ2Φ(St)θ^nπ2}Φ(St)Φ(St)]
    and
    w^1(π)=En[π(At;St)μt(At;St)Φ(St){Φ(St)γΦ(St+1)}].

Proofs of the above results are in the Appendix along with a result on bracketing entropy that is needed for the proof of Theorem 4.2 and a proof that the class of policies introduced above satisfies the necessary bracketing integral assumption.

5. Simulation experiments

In this section, we examine the performance of V-learning on simulated data. Section 5.1 contains results for offline estimation and Section 5.2 contains results for online estimation. All simulation results are averaged across 50 replications.

5.1. Offline simulations

Our implementation of V-learning follows the setup in Section 2. Maximizing V^n,R(π) is done using a combination of simulated annealing and the BFGS algorithm as implemented in the optim function in R software (R Core Team, 2016). We note that V^n,R(π) is differentiable in π, thereby avoiding some of the computational complexity of GGQ. However, the objective is not necessarily convex. In order to avoid local maxima, simulated annealing with 1000 function evaluations is used to find a neighborhood of the maximum; this solution is then used as the starting value for the BFGS algorithm.

We use the class of policies introduced in Section 2. Although we maximize the value over a class of randomized policies, the true optimal policy is deterministic. To prevent the coefficients of β^n from diverging to infinity, we add an L2 penalty when maximizing over β. To prevent overfitting, we use an L2 penalty when computing θ^nπ, i.e., P(θπ)=(θπ)θπ. We let Ω be the identity matrix. Simulation results with alternate choices of P and Ω are given in the appendix. Tuning parameters can be used to control the amount of randomness in the estimated policy. For example, increasing the penalty when computing β^n is one way to encourage exploration through the action space because β = 0 defines a policy where each action is selected with equal probability.

We consider three different models for the state-value function: (i) linear; (ii) second degree polynomial; and (iii) Gaussian radial basis functions (RBF). The Gaussian RBF is ϕ(x;κ,τ2)=exp{(xκ)2/2τ2}. We use τ = 0.25 and κ = 0, 0.25, 0.5, 0.75, 1 to create a basis of functions and apply this basis to the state variables after scaling them to be between 0 and 1. Each model also implicitly contains an intercept.

We begin with the following simple generative model. Let the two-dimensional state vector be Sit=(Si,1t,Si,2t), i = 1, … , n, t = 1, … , T. We initiate the state variables as independent standard normal random variables and let them evolve according to Si,1t=(3/4)(2Ait11)Si,1t1+(1/4)Si,1t1Si,2t1+ϵ1t and Si,2t=(3/4)(12Ait1)Si,2t1+(1/4)Si,1t1Si,2t1+ϵ2t, where Ait takes values in {0, 1} and ϵ1t and ϵ2t are independent N (0, 1/4) random variables. Define the utility function by Uit=u(Sit+1,Ait,Sit)=2Si,1t+1+Si,2t+1(1/4)(2Ait1). At each time t, we must make a decision to treat or not with the goal of maximizing the components of S while treating as few times as possible. Treatment has a positive effect on S1 and a negative effect on S2. We generate Ait from a Bernoulli distribution with mean 1/2. In estimation, we assume that the generating model for treatment is known, as would be the case in a micro-randomized trial.

We generate samples of n patients with T time points per patient from the given generative model after an initial burn-in period of 50 time points. The burn-in period ensures that our simulated data are sampled from an approximately stationary distribution. We estimate policies using V-learning with three different types of basis functions and GGQ. After estimating optimal policies, we simulate 100 patients following each estimated policy for 100 time points and take the mean utility under each policy as an estimate of the value of that policy. Estimated values are found in Table 1 with Monte Carlo standard errors along with the observed value. Recall that larger values are better. The policies estimated using V-learning produce better outcomes than the observational policy and the policy estimated using GGQ. V-learning produces the best outcomes using Gaussian basis functions. In the Appendix, we give results for the same simulation settings given in Table 1 for alternate choices of P and Ω. Table 8 presents results with the LASSO penalty when Ω is the identity matrix, Table 9 presents results with the L2 penalty when Ω={EnSt(St)}1, and Table 10 presents results with the LASSO penalty when Ω={EnSt(St)}1.

Table 1:

Monte Carlo value estimates for offline simulations with γ = 0.9.

n T Linear VL Polynomial VL Gaussian VL GGQ Observed
25 24 0.118 (0.0892) 0.091 (0.0825) 0.110 (0.0979) 0.014 (0.0311) −0.005
36 0.108 (0.0914) 0.115 (0.0911) 0.112 (0.0919) 0.029 (0.0280) −0.004
48 0.106 (0.0705) 0.071 (0.0974) 0.103 (0.0757) 0.031 (0.0350) 0.000
50 24 0.124 (0.0813) 0.109 (0.1045) 0.118 (0.0879) 0.016 (0.0355) −0.005
36 0.126 (0.0818) 0.134 (0.0878) 0.136 (0.0704) 0.027 (0.0276) 0.003
48 0.101 (0.0732) 0.109 (0.0767) 0.115 (0.0763) 0.020 (0.0245) 0.000
100 24 0.117 (0.0895) 0.135 (0.0973) 0.140 (0.0866) 0.019 (0.0257) 0.011
36 0.113 (0.0853) 0.105 (0.1033) 0.139 (0.0828) 0.021 (0.0312) 0.012
48 0.111 (0.0762) 0.143 (0.0853) 0.114 (0.0699) 0.031 (0.0306) −0.001

Next, we simulate cohorts of patients with type 1 diabetes to mimic the mHealth study of Maahs et al. (2012). Maahs et al. (2012) followed a small sample of youths with type 1 diabetes and recorded data at a fine granularity using mobile devices. Blood glucose levels were tracked in real time using continuous glucose monitoring, physical activity was measured continuously using accelerometers, and insulin injections were logged by an insulin pump. Dietary data were recorded by 24-hour recall over phone interviews.

In our simulation study, we divide each day of follow-up into 60 minute intervals. Thus, for one day of follow-up, we observe T = 24 time points per simulated patient and a treatment decision is made every hour. Our hypothetical mHealth study is designed to estimate an optimal dynamic treatment regime for the timing of insulin injections based on patient blood glucose, physical activity, and dietary intake with the goal of controlling future blood glucose as close as possible to the optimal range. To this end, we define the utility at time t as a weighted sum of hypo- and hyperglycemic episodes in the 60 minutes preceding and following time t. Weights are −3 when glucose ≤ 70 (hypoglycemic), −2 when glucose ≤ 150 (hyperglycemic), −1 when 70 < glucose ≤ 80 or 120 < glucose ≤ 150 (borderline hypo- and hyperglycemic), and 0 when 80 < glucose ≤ 120 (normal glycemia). Utility at each time point ranges from −6 to 0 with larger utilities (closer to 0) being more preferable. For example, a patient who presents with an average blood glucose of 155 mg/dL over time interval t–1, takes an action to correct their hyperglycemia, and presents with an average blood glucose of 145 mg/dL over time interval t would receive a utility of Ut = −3. Weights were chosen to reflect the relative clinical consequences of high and low blood glucose. For example, acute hypoglycemia, characterized by blood glucose levels below 70 mg/dL, is an emergency situation that can result in coma or death.

Simulated data are generated as follows. At each time point, patients are randomly chosen to receive an insulin injection with probability 0.3, consume food with probability 0.2, partake in mild physical activity with probability 0.4, and partake in moderate physical activity with probability 0.2. Grams of food intake and counts of physical activity are generated from normal distributions with parameters estimated from the data of Maahs et al. (2012). Initial blood glucose level for each patient is drawn from a normal distribution with mean 100 and standard deviation 25. Define the covariates for patient i collected at time t by (Glit,Diit,Exit), where Glit is average blood glucose level, Diit is total dietary intake, and Exit is total counts of physical activity as would be measured by an accelerometer. Glucose levels evolve according to

Glt=μ(1α1)+α1Glt1+α2Dit1+α3Dit2+α4Ext1+α5Ext2+α6Int1+α7Int2+e, (8)

where Int is an indicator of an insulin injection received at time t and e ~ N(0, σ2). We use the parameter vector α = (α1, … , α7) = (0.9, 0.1, 0.1, −0.01, −0.01, −2, −4), μ = 100, and σ = 5.5 based on a linear model fit to the data of Maahs et al. (2012). The known lag-time in the effect of insulin is reflected by α6 = −2 and α7 = −4. Selecting α1 < 1 ensures the existence of a stationary distribution.

We define the state vector for patient i at time t to contain average blood glucose, total dietary intake, and total physical activity measured over previous time intervals; we include blood glucose and physical activity for the previous two time intervals and dietary intake for the previous four time intervals. Let n denote number of patients and T denote number of time points per patient. Our choices for n and T are based on what is feasible for an mHealth outpatient study (dietary data was collected on two days by Maahs et al., 2012). For each replication, the optimal treatment regime is estimated with V-learning using three different types of basis functions and GGQ. The generative model for insulin treatment is not assumed to be known and we estimate it using logistic regression. We record mean outcomes in an independent sample of 100 patients followed for 100 time points with treatments generated according to each estimated optimal regime. Simulation results (estimated values under each regime and Monte Carlo standard errors along with observed values) are found in Table 2. Again, V-learning with Gaussian basis functions performs the best out of all methods, generally producing large values and small standard errors. V-learning with the linear model underperforms and GGQ underperforms somewhat more so.

Table 2:

Monte Carlo value estimates for simulated T1D cohorts with γ = 0.9.

n T Linear VL Polynomial VL Gaussian VL GGQ Observed
25 24 −2.716 (1.2015) −2.335 (0.9818) −2.018 (1.2011) −3.870 (0.9225) −2.316
36 −2.700 (1.2395) −2.077 (1.0481) −1.760 (0.8468) −3.644 (0.8745) −2.261
48 −2.496 (1.1986) −2.236 (1.1978) −1.751 (0.9887) −2.405 (1.1025) −2.365
50 24 −2.545 (1.1865) −2.069 (1.0395) −1.605 (0.8064) −3.368 (1.0186) −2.263
36 −2.644 (1.1719) −2.004 (0.9074) −1.778 (0.8496) −3.099 (0.9722) −2.336
48 −2.469 (1.1635) −2.073 (0.9870) −2.102 (1.2078) −2.528 (0.9571) −2.308
100 24 −2.350 (1.1171) −2.128 (1.0520) −1.612 (0.7203) −3.272 (0.8636) −2.299
36 −2.547 (1.1852) −2.116 (0.8518) −1.672 (0.8643) −3.232 (0.7951) −2.321
48 −2.401 (1.0643) −2.204 (1.0400) −1.494 (0.5413) −2.820 (0.8442) −2.351

5.2. Online simulations

In practice, it may be useful for patients to follow a dynamic treatment regime that is updated as new data are collected. Here we consider a hypothetical study wherein n patients are followed for an initial period of T′ time points, an optimal policy is estimated, and patients are followed for an additional TT′ time points with the estimated optimal policy being continuously updated. At each time point, tT′, actions are taken according to the most recently estimated policy. Recall that V-learning produces a randomized decision rule from which to sample actions at each time point. When selecting an action based on a GGQ policy, we incorporate an ϵ-greedy strategy by selecting the action recommended by the estimated policy with probability 1 – ϵ and otherwise randomly selecting one of the other actions. At the tth estimation, we use ϵ = 0.5t, allowing ϵ to decrease over time to reflect increasing confidence in the estimated policy. A burn-in period of 50 time points is discarded to ensure that we are sampling from a stationary distribution. We estimate the first policy after 12 time points and a new policy is estimated every 6 time points thereafter. After T time points, we estimate the value as the average utility over all patients and all time points after the initial period.

Table 3 presents mean outcomes under policies estimated online using data generated according to the simple two covariate generative model introduced at the beginning of Section 5.1. There is some variability across n and T regarding which type of basis function is best, but V-learning with a polynomial basis generally produces the best outcomes. GGQ performs well in large samples.

Table 3:

Value estimates for online simulations with γ = 0.9.

n T Linear VL Polynomial VL Gaussian VL GGQ
25 24 0.0053 0.0149 −0.0100 −0.0081
36 0.0525 0.0665 0.0310 0.0160
48 0.0649 0.0722 0.0416 0.0493
50 24 0.0164 0.0117 0.0037 0.0058
36 0.0926 0.0791 0.0666 0.0227
48 0.1014 0.0894 0.0512 0.0434
100 24 0.0036 −0.0157 0.0200 0.0239
36 0.0766 0.0626 0.0907 0.0540
48 0.0728 0.0781 0.0608 0.0818

Next, we study the performance of online V-learning in simulated mHealth studies of type 1 diabetes by following the generative model described in (8). Mean outcomes are found in Table 4. Gaussian V-learning performs the best out of all methods. Across all variants of V-learning, outcomes improve with increased follow-up time.

Table 4:

Value estimates for online estimation of simulated T1D cohorts with γ = 0.9

n T Linear VL Polynomial VL Gaussian VL GGQ
25 24 −2.3887 −1.9713 −1.8860 −3.2027
36 −2.3784 −2.1535 −1.7857 −3.5127
48 −2.2190 −2.0679 −1.6999 −3.2280
50 24 −2.3405 −2.2313 −1.7761 −2.8976
36 −2.2829 −2.0922 −1.6016 −3.1589
48 −2.1587 −1.9669 −1.5948 −2.8729
100 24 −2.3229 −2.2295 −1.9138 −3.0865
36 −2.2927 −2.1608 −1.9030 −3.3483
48 −2.2096 −2.0454 −1.8252 −2.9428

Finally, we consider online simulations using individualized policies as outlined at the end of Section 3. Consider the simple two covariate generative model introduced above but let state variables evolve according to Si,1t=μi(2Ait11)Si,1t1+(1/4)Si,1t1Si,2t1+ϵ1t and Si,2t=μi(12Ait1)Si,2t1+(1/4)Si,1t1Si,2t1+ϵ2t where μi is a subject-specific term drawn uniformly between 0.4 and 0.9. Including μi ensures that the optimal policy differs across patients. Table 5 contains mean outcomes for online simulation where a universal policy is estimated using data from all patients and where individualized policies are estimated using only a single patient’s data. Because data are generated in a such a way that the optimal policy varies across patients, individualized policies achieve better outcomes than universal policies.

Table 5:

Value estimates for online V-learning simulations with universal and patient-specific policies when γ = 0.9.

n T Universal policy Patient-specific policy
25 24 0.0282 0.1813
36 0.1025 0.1700
48 0.0977 0.1944
50 24 0.0164 0.2771
36 0.0768 0.2617
48 0.0752 0.3038
100 24 0.0160 0.4230
36 0.0960 0.2970
48 0.1140 0.3197

6. Case study: Type 1 diabetes

Machine learning is currently under consideration in type 1 diabetes through studies to build and test a “closed loop” system that joins continuous blood glucose monitoring and subcutaneous insulin infusion through an underlying algorithm. Known as the artificial pancreas, this technology has been shown to be safe in preliminary studies and is making headway from small hospital-based safety studies to large-scale outpatient effectiveness studies (Ly et al., 2014, 2015). Despite the success of the artificial pancreas, the rate of uptake may be limited and widespread use may not occur for many years (Kowalski, 2015). The proposed method may be useful for implementing mHealth interventions for use alongside the artificial pancreas or before it is widely available.

Studies have shown that data on food intake and physical activity to inform optimal decision making can be collected in an inpatient setting (see, e.g., Cobry et al., 2010; Wolever and Mullan, 2011). However, Maahs et al. (2012) demonstrated that rich data on the effect of food intake and physical activity can be collected in an outpatient setting using mobile technology. Here, we apply the proposed methodology to the observational data collected by Maahs et al. (2012).

The full data consist of N = 31 patients with type 1 diabetes, aged 12–18. Glucose levels were monitored using continuous glucose monitoring and physical activity tracked using accelerometers for five days. Dietary data were self-reported by the patient in telephone-based interviews for two days. Patients were treated using either an insulin pump or multiple daily insulin injections. We use data on a subset of n = 14 patients treated with an insulin pump for whom full follow-up is available on days when dietary information was recorded. This represents 28 patient-days of data, with which we use V-learning to estimate an optimal treatment policy. An advantage of mHealth is the ability to collect data passively, limiting the amount of missing data. There is no intermittent missingness in this data set.

The setup closely follows the simulation experiments in Section 5.1. Patient state at each time, t, is taken to be average glucose level and total counts of physical activity over the two previous 60 minute intervals and total food intake in grams over the four previous 60 minute intervals. The goal is to learn a policy to determine when to administer insulin injections based on prior blood glucose, dietary intake, and physical activity. The utility at time t is a weighted sum of glycemic events over the 60 minutes preceding and following time t with weights defined in Section 5.1. A treatment regime with large value will minimize the number of hypo- and hyperglycemic episodes weighted to reflect the clinical importance of each. We note that because V(π, s; θπ) is linear in θπ, we can evaluate V^n,R^(π) with only the mean of Φ(S) under R. These were estimated from the data. Because we cannot simulate data following a given policy to estimate its value, we report the parametric value estimate V^n,R^(π^n). Interpreting the parametric value estimate is difficult because of the effect the discount factor has on estimated value. We cannot compare parametric value estimates to mean outcomes observed in the data. Instead, we use Ent0γtUt as an estimate of value under the observational policy.

We estimate optimal treatment strategies for two different action spaces. In the first, the only decision made at each time is whether or not to administer an insulin injection, i.e., the action space contains a single binary action. In the second, the action space contains all possible combinations of insulin injection, physical activity, and food intake. This corresponds to a hypothetical mHealth intervention where insulin injections are administered via an insulin pump and suggestions for physical activity and food intake are administered via a mobile app.

Table 6 contains parametric value estimates for policies estimated using V-learning for the two action spaces outlined above with different basis functions and discount factors. These results indicate that improvements in glycemic control can come from personalized and dynamic treatment strategies that account for food intake and physical activity. Improvement results from a dynamic insulin regimen (binary action space), and, in most cases, further improvement results from a comprehensive mHealth intervention including suggestions for diet and exercise delivered via mobile app in addition to insulin therapy (multiple action space). When considering multiple actions, the policy estimated using a polynomial basis and γ = 0.7 achieves a 64% increase in value and the policy estimated using a Gaussian basis and γ = 0.8 achieves a 68% increase in value over the observational policy. Although the small sample size is a weakness of this study, these results represent a significant improvement in value despite the sample size.

Table 6:

Parametric value estimates for V-learning applied to type 1 diabetes data.

Action space Basis γ = 0.7 γ = 0.8 γ = 0.9
Binary Linear −6.20 −9.35 −15.99
Polynomial −3.91 −9.03 −17.50
Gaussian −3.44 −13.09 −25.52
Multiple Linear −6.47 −9.92 −0.49
Polynomial −2.44 −6.80 −14.48
Gaussian −8.45 −3.58 −21.18
Observational policy −6.77 −11.28 −21.79

Finally, we use an example hyperglycemic patient to illustrate how an estimated policy would be applied in practice. One patient in the data presented at a specific time with an average blood glucose of 229 m g/dL over the previous hour and an average blood glucose of 283 m g/dL over the hour before that. The policy estimated with γ = 0.7 and a polynomial basis recommends each action according to the probabilities in Table 7. Because this patient presented with blood glucose levels that are higher than the optimal range, the policy recommends actions that would lower the patient’s blood glucose levels, assigning a probability of 0.79 to insulin and a probability of 0.21 to insulin combined with activity.

Table 7:

Probabilities for each action as recommended by estimated policy for one example patient.

Action Probability
No action < 0.0001
Physical activity < 0.0001
Food intake < 0.0001
Food and activity < 0.0001
Insulin 0.7856
Insulin and activity 0.2143
Insulin and food 0.0002
Insulin, food, and activity < 0.0001

7. Conclusion

The emergence of mHealth has provided great potential for the estimation and implementation of dynamic treatment regimes. Mobile technologies can be used both in the collection of rich longitudinal data to inform decision making and the delivery of deeply tailored interventions. The proposed method, V-learning, addresses a number of challenges associated with estimating dynamic treatment regimes in mHealth applications. V-learning directly estimates a policy which maximizes the value over a class of policies and requires minimal assumptions on the data-generating process. Furthermore. V-learning permits estimation of a randomized decision rule which can be used in place of existing strategies (e.g., ϵ-greedy) to encourage exploration in online estimation. A randomized decision rule can also provide patients with multiple treatment options. Estimation of an optimal policy for different populations can be handled through the use of different reference distributions.

V-learning and mobile technologies have the potential to improve patient outcomes in a variety of clinical areas. We have demonstrated, for example, that the proposed method can be used to estimate treatment regimes to reduce the number of hypoand hyperglycemic episodes in patients with type 1 diabetes. The proposed method could also be useful for other mHealth applications as well as applications outside of mHealth. For example, V-learning could be used to estimate dynamic treatment regimes for chronic illnesses using electronic health records data. Future research in this area may include increasing flexibility through use of a semiparametric model for the state-value function. Alternatively, nonlinear models for the state-value function may be informed by underlying theory or mathematical models of the system of interest. Data-driven selection of tuning parameters for the proposed method may help to improve performance. Developing theory for alternative penalty functions, such as the LASSO penalty, is another important step. Accounting for patient availability and feasibility of a sequence of treatments can be done by setting constraints on the class of policies. This will ensure that the resulting mHealth intervention is able to be implemented and that the recommended decisions are consistent with domain knowledge.

It would also be worthwhile to generalize our asymptotic results to permit nonstationarity. While we believe that stationarity is generally a reasonable assumption for moderate stretches of time—including when we do online estimation using moderately large batches of observations wherein both the patient dynamics and treatment policy remain approximately constant over each batch—stationarity would not in general hold when either the patient dynamics or treatment policy change more rapidly. For example, online estimation with treatment policy changes after each observation could induce nonstationarity. We conjecture that our asymptotic results will continue to hold in this setting, as our simulation studies in Section 5.2 seem to indicate.

8. Acknowledgments

We thank the editor, associate editor, and reviewers for helpful comments which led to a significantly improved paper.

Appendix

Proofs

Proof of Lemma 2.1. Let π be an arbitrary policy and 7 γ ∈ (0, 1) a fixed constant. Suppose we observe a state St = st at time t and let a¯t1=(a1,,at1) be the sequence of actions resulting in St = st, i.e., St(a¯t1)=st. Let a¯k+1=(at,,at+k)Ak+1 be a potential sequence of actions taken from time t to time t + k. We have that

V(π,st)=k0γkE{U*(t+k)(π)|St=st}=k0γkE(a¯t+kAt+kU*(t+k)(a¯t+k)v=tt+k1[ξπv{S*v(a¯v1)}=av]|St=st)=k0γka¯k+1Ak+1U*(t+k)(a¯t1,ak+1){v=1t+kE(1[ξπv{S*v(a¯v1)}=av]|St=st)}=k0γka¯k+1Ak+1U*(t+k)(a¯t1,ak+1)v=tt+kπ{av;S*v(a¯v1)}v=tt+kμv{av;S*v(a¯v1)}μv{av;S*v(a¯v1)}=k0γkE[Ut+k{v=0kπ(at+v;st+v)μt+v(at+v;st+v)}|St=st],

where we let π(at; st) = 0 for all at and st whenever t > T*(π). The last equality uses the consistency and strong ignorability assumptions.

Proof of Theorem 4.2. Proof of part 1: We first note that θ0π must solve

0=E(π(At;St)μt(At;St)[Ut+{γΦ(St+1)Φ(St)}Tθπ]Φ(St)),

or

E[π(At;St)μt(At;St){Φ(St)γΦ(St+1)}T]θπ=E{π(At;St)μt(At;St)UtΦ(St)},

which is equivalent to w1(π)θπ=w2(π) where w2(π)=E{π(At;St)μt(At;St)1UtΦ(St)}. We have that

E{π(At;St)μt(At;St)UtΦ(St)}c01(E|Ut|2)1/2(EΦ(St)2)1/2<,

by assumption 3, part 1 of assumption 4 and the Cauchy-Schwarz inequality. Let cq be arbitrary and note that

E{π(At;St)μt(At;St)cΦ(St)Φ(St+1)c}[E{π(At;St)μt(At;St)cΦ(St)2c}E{π(At;St)μt(At;St)cΦ(St+1)2c}]1/2,

by the Cauchy-Schwarz inequality, where u⊗2 = uu. This implies that

cw1(π)cE{π(At;St)μt(At;St)cΦ(St)2c}E{π(At;St)μt(At;St)cΦ(St)2c}1/2E{γ2π(At;St)μt(At;St)cΦ(St+1)2c}1/2=AA1/2B1/2=A1/2(A1/2B1/2)=A1/2(AB)A1/2+B1/2,

where we simplify notation by defining A=E{π(At;St)μt(At;St)1cΦ(St)2c} and B=E{γ2π(At;St)μt(At;St)1cΦ(St+1)2c}. We have that

A1/2+B1/2c01/2c{EΦ(St)2}1/2+c01/2c{EΦ(St+1)2}1/2=2c01/2c{EΦ(St)2}1/2<,

by the Cauchy-Schwarz inequality, the fact that EΦ(St)2=EΦ(St+1)2 by time-homogeneity, and part 1 of assumption 4. Also, AAB and ABc1c2 by assumption 5. Thus,

AA1/2B1/2c13/2c32c01/2c{EΦ(St)2}1/2=c01/2c13/2c22{EΦ(St)2}1/2,

which finally implies that w1(π) is invertible and thus θ0π=w1(π)1w2(π) is well-defined uniformly over π ∈ Π. Using the fact that cw1(π)ck0c2 for a constant k0 > 0, we can show that w1(π)1k11 for some constant k1 > 0, where ∥ · ∥ is the usual matrix norm when applied to a matrix. Therefore, θ0πk11w2(π)c01k11{E(Ut)2}1/2{EΦ(St)2}1/2<. Finally, it follows from assumptions 5 and 7 that supβ1β2δθπβ1θπβ20asδ0.

Proof of part 2: Define

G={Φ(st)Φ(st)/μt(at;st),γΦ(st)Φ(st+1)/μt(at;st),utΦ(st)/μt(at;st)}.

Let G be an envelope for G, for example G(st+1,at,st)=maxgGg(st+1,at,st). By part 1 of assumption 4, EG3ρ<. Part 4 of Lemma 8.1 below gives us that G is Donsker. Since Π satisfies J[] {∞, Π, L3ρ(P)} < ∞ to, we have that

F1={Ω1/2π(at;st)μt(at;st)Φ(st){Φ(st)γΦ(st+1)}:πΠ}

satisfies J[]{,F1,L3ρ(P)}< by parts 1 and 2 of Lemma 8.1 below. Moreover, F(at, st) = ∥Φ(st)∥·∥Φ(st)−γΦ(st+1)∥/μt(at; s) is an envelope for F1 with EF3ρ < ∞ by assumption 3 and part 1 of assumption 4. Thus, F1 is Donsker. Let

F2={Ω1/2π(at;st)μt(at;st)utΦ(st):πΠ}.

Similar arguments yield that F2 is Donsker.

Now, let A^(π)={Enf1π:f1πF1} and B^(π)={Enf2π:f2πF2}. Let A^(π)=A^(π)+λnA^(π). We have that θ^nπ=A^(π)1B^(π). Thus,

n(θ^nπθ0π)=n{A^(π)1B^(π)A^(π)1A^(π)θ0π}+oP(1)=A^(π)1n{B^(π)A^(π)θ0π}+oP(1)=A^(π)1n{B^(π)A^(π)θ0π}+A^(π)1n{A^(π)A^(π)}θ0π+oP(1)=A^(π)1n{B^(π)A^(π)θ0π}+oP(1)

where oP(1) doesn’t depend on π, because A^(π)1Pw1(π)1< uniformly over π ∈ Π by assumption 3 and part 1 of assumption 4, supπΠθ0π< by part 1 of this theorem, and n{A^(π)A^(π)}=nλnA^(π)=oP(1) because λn = oP(n−1/2). Using arguments similar to those in the previous paragraph, one can show that F3={f2πf1πθ:f1πF1,f2πF2,πΠ,θB} is Donsker, where B* is any finite collection of elements of q. By part 1 of this theorem, there exists a bounded, closed set B0 such that θ0πB0 for all π ∈ Π. Let Gn(π,θ)=n(EnE)(f2πf1πθ). Note that

supπΠGn(π,θ1)Gn(π,θ2)supπΠn(EnE)f1πθ1θ2Rθ1θ2,

where R* = OP(1) by the Donsker property of F1 and R* doesn’t depend on π. Thus, Gn(π,θ) is stochastically equicontinuous on B0. Combined with the Donsker property of F3 for arbitrary B*, we have that the class F4={f2πf1πθ:f1πF1,f2πF2,πΠ,θB0} is Donsker. Using Slutsky’s Theorem, Theorem 11.24 of Kosorok (2008), the fact that F1 is Glivenko-Cantelli, and the fact that θ0π=(Ef1π)1Ef2π, we have that n(θ^0πθ0π)=A^(π)1Gn(π,θ0π)w1(π)1G0(π), in (Π) where G0(π) is a mean zero Gaussian process indexed by Π with covariance E{G0(π1)G0(π2)}=w0(π1,π2).

Proof of part 3: We have that

n{V^n,R^(π)VR(π)}=nEnΦ(St)(θ^nπθ0π)νw1(π)1G0(π)

in (Π) by Slutsky’s Theorem.

Proof of Theorem 4.3. Proof of part 1: Following part 3 of Theorem 4.2, we have that supβB|V^n,R^(πβ)VR(πβ)|P0. Combining this with the unique and well separated maximum condition (assumption 6), continuity of VR(πβ) in β, and Theorem 2.12 of Kosorok (2008), yields the result in part 1. Part 2 follows from part 1 of this theorem, part 1 of Theorem 4.2, and the continuous mapping theorem. Part 3 follows from parts 2 and 3 of Theorem 4.2. The proof of part 4 follows standard arguments. ☐

Lemma 8.1. Let F and G be function classes with respective envelopes F and G. Let Fu=(E|F|u)1/u. For any 1 ≤ r, s1, s2 ≤ ∞ with s11+s21=1,

  1. J[]{,FG,Lr(P)}2(Frs1+Grs2)[J[]{,F,Lrs1(P)}+J[]{,G,Lrs2(P)}].

  2. J[]{,F+G,Lr(P)}2[J[]{,F,Lr(P)}+J[]{,G,Lr(P)}].

  3. For any 0r,J[]{,FG,Lr(P)}log2(Fr+Gr)+J[]{,F,Lr(P)}+J[]{,G,Lr(P)}.

  4. If G is a finite class, J[]{,G,Lr(P)}2Grlog|G|, where |G| denotes the cardinality of G.

Proof of Lemma 8.1. Proof of part 1: Let 1 ≤ r, s1, s2 ≤ ∞ with s11+s21=1 and let (F, uF) and (G, uG) be Lrs1(P) and Lrs2(P) ϵ-brackets, respectively. Choose F, ≤ f1, f2uF and Gg1, g2uG and consider the bracket for any f2g2 defined by f1g1 ± (F|uGG| + G|uFF|). Note that f1g1+F|uGG|+G|uFF|−f2g2F|uGG|+G|uFF|−F|g1g2|−G|f1f2| ≥ 0, because f2g2f1g1 = f2g2 + f2g1f2g1f1g1F|g1g2| + G|f1f2|. Similarly, f2g2 + F|uGG| + G|uFF| − f1g1 ≥ 0. Thus, these brackets hold all f2g2 for f2 ∈ (F, uF) and g2 ∈ (G, uG). Now, ∥F|uGG| + G|uFF |∥r ≤ ∥Frs1ϵ + ∥Grs2ϵ by Minkowski’s inequality and Hölder’s inequality, and it follows that

N[]{2ϵ(Frs1+Grs2),FG,Lr(P)}N[]{ϵ,F,Lrs1(P)}N[]{ϵ,G,Lrs2(P)}.

Next we note that

N[]{ϵ,FG,Lr(P)}N[]{ϵ2(Frs1+Grs2),F,Lrs1(P)}N[]{ϵ2(Frs1+Grs2),G,Lrs2(P)}

and thus

J[]{,FG,Lr(P)}02Frs1Grs2logN[]{ϵ2(Frs1+Grs2),F,Lrs1(P)}dϵ+02Frs1Grs2logN[]{ϵ2(Frs1+Grs2),G,Lrs2(P)}dϵ2(Frs1+Grs2)[J[]{,F,Lrs1(P)}+J[]{,G,Lrs2(P)}].

The proof of part 2 follows from Lemma 9.25 part (i) of Kosorok (2008) after a change of variables. Proof of part 3: First note that

N[]{ϵ,FG,Lr(P)}N[]{ϵ,F,Lr(P)}+N[]{ϵ,G,Lr(P)},

whence it follows that

J[]{,FG,Lr(P)}=02(Fr+Gr)logN[]{ϵ,FG,Lr(P)}dϵ02(Fr+Gr)log[N[]{ϵ,F,Lr(P)}+N[]{ϵ,G,Lr(P)}]dϵ02(Fr+Gr)log2+logN[]{ϵ,F,Lr(P)}+logN[]{ϵ,G,Lr(P)}dϵ02(Fr+Gr)log2dϵ+J[]{,F,Lr(P)}+J[]{,G,Lr(P)},

where the second inequality uses the fact that a + b ≤ 2ab for all a, b ≥ 1.

Proof of part 4: If G is finite, then N[]{ϵ,G,Lr(P)}|G|. Thus,

J[]{,G,Lr(P)}=02GrlogN[]{ϵ,G,Lr(P)}dϵ02Grlog|G|dϵ,

which completes the proof.

Lemma 8.2. Define the class of functions

Π={πβ˜(a;s)=aJ+j=1J1ajexp(sjβj)1+j=1J1exp(sjβj):β˜=(β1,,βJ1),β˜Bp(J1)}

for a compact set B and 2 ≤ J < ∞ where a = (a1, …, aJ). Then, there exists a b0 < ∞ such that for any 1 ≤ r ≤ ∞, J[]{,Π,Lr(P)}b0Srp(J1)π, which is finite wheneverSr < ∞. Furthermore, supβ˜1β˜2δEπβ˜1(A;S)πβ˜2(A;S)0asδ0.

Proof of Lemma 8.2. For β˜1,β˜2B, define d(β˜1,β˜2)=max1jJ1β˜1jβ˜2j and b0=supβ˜1,β˜2Bβ˜1β˜2<0 because B is compact. By the mean value theorem, for any β˜1,β˜2B, there exists a point β˜ on the line segment between β˜1 and β˜2 such that

πβ˜1(a;s)πβ˜2(a;s)=11+j=1J1exp(sβ˜j)[j=1J1{ajπβ˜(a;s)}exp(sβ˜j)s(β˜1jβ˜2j)],

which implies that

|πβ˜1(a;s)πβ˜2(a;s)|sd(β˜1,β˜2). (9)

It follows from equation (9) that assumption 7 holds for this particular class of policies. Now, N[]{2ϵSr,Π,Lr(P)}N(ϵ,B,d) by Theorem 9.23 of Kosorok (2008). Furthermore, N(ϵ,B,d)max{(b0/ϵ)p(J1),1}, and thus

J[]{ϵ,Π,Lr(P)}2Sr0b0p(J1){logb0+log(1/ϵ)}dϵ2Srb0p(J1)01log(1/ϵ)dϵ=2Srb0p(J1)0u1/2exp(u)du=Srb0p(J1)π,

which proves the result.

Additional simulation results

Table 8:

Monte Carlo value estimates for offline simulations with γ = 0.9, P is the LASSO penalty, and Ω is the identity.

n T Linear VL Polynomial VL Gaussian VL GGQ Observed
25 24 0.123 (0.0773) 0.117 (0.1067) 0.128 (0.0960) 0.025 (0.0330) −0.005
36 0.117 (0.0900) 0.120 (0.0933) 0.138 (0.0992) 0.030 (0.0341) −0.004
48 0.122 (0.0782) 0.103 (0.1002) 0.141 (0.0878) 0.028 (0.0301) 0.000
50 24 0.109 (0.0727) 0.122 (0.0954) 0.153 (0.0692) 0.028 (0.0321) −0.005
36 0.137 (0.0782) 0.127 (0.1061) 0.141 (0.0816) 0.024 (0.0285) 0.003
48 0.110 (0.0761) 0.127 (0.0860) 0.147 (0.0778) 0.029 (0.0347) 0.000
100 24 0.125 (0.0802) 0.129 (0.0854) 0.164 (0.0609) 0.027 (0.0289) −0.001
36 0.151 (0.0739) 0.148 (0.0822) 0.131 (0.0897) 0.025 (0.0356) −0.002
48 0.131 (0.0726) 0.132 (0.0814) 0.169 (0.0666) 0.030 (0.0325) −0.001

Table 9:

Monte Carlo value estimates for offline simulations with γ = 0.9, P is Euclidean norm, and Ω is the inverse Fisher information.

n T Linear VL Polynomial VL Gaussian VL GGQ Observed
25 24 0.136 (0.0862) 0.118 (0.0925) 0.153 (0.0785) 0.022 (0.0272) −0.005
36 0.147 (0.0768) 0.132 (0.1057) 0.124 (0.0909) 0.026 (0.0355) −0.004
48 0.128 (0.0897) 0.146 (0.0826) 0.116 (0.1067) 0.020 (0.0317) 0.000
50 24 0.113 (0.0954) 0.129 (0.0964) 0.123 (0.1052) 0.027 (0.0275) −0.005
36 0.116 (0.0973) 0.149 (0.0940) 0.152 (0.0798) 0.029 (0.0289) 0.003
48 0.109 (0.0899) 0.132 (0.0998) 0.124 (0.0932) 0.024 (0.0285) 0.000
100 24 0.167 (0.0652) 0.155 (0.0743) 0.144 (0.0897) 0.025 (0.0291) −0.002
36 0.167 (0.0731) 0.153 (0.0988) 0.155 (0.0851) 0.027 (0.0311) −0.002
48 0.137 (0.0868) 0.175 (0.0615) 0.148 (0.0978) 0.026 (0.0332) −0.001

Table 10:

Monte Carlo value estimates for offline simulations with γ = 0.9, P is the LASSO penalty, and Ω is the inverse Fisher information.

n T Linear VL Polynomial VL Gaussian VL GGQ Observed
25 24 0.123 (0.0750) 0.123 (0.0912) 0.140 (0.0930) 0.025 (0.0301) −0.005
36 0.139 (0.0780) 0.110 (0.1013) 0.138 (0.0813) 0.024 (0.0344) −0.004
48 0.135 (0.0690) 0.110 (0.1177) 0.143 (0.0718) 0.023 (0.0282) 0.000
50 24 0.118 (0.0705) 0.124 (0.0994) 0.137 (0.0802) 0.030 (0.0287) −0.006
36 0.117 (0.0827) 0.123 (0.0972) 0.121 (0.0804) 0.030 (0.0292) 0.003
48 0.128 (0.0807) 0.113 (0.1085) 0.137 (0.0921) 0.023 (0.0282) 0.000
100 24 0.131 (0.0563) 0.123 (0.1015) 0.167 (0.0472) 0.029 (0.0295) −0.001
36 0.132 (0.0735) 0.148 (0.0851) 0.161 (0.0670) 0.029 (0.0334) −0.002
48 0.149 (0.0612) 0.137 (0.1003) 0.156 (0.0687) 0.023 (0.0267) −0.001

Contributor Information

Daniel J. Luckett, Department of Biostatistics, University of North Carolina at Chapel Hill.

Eric B. Laber, Department of Statistics, North Carolina State University

Anna R. Kahkoska, Department of Nutrition, University of North Carolina at Chapel Hill

David M. Maahs, Department of Pediatrics, Stanford University

Elizabeth Mayer-Davis, Department of Nutrition, University of North Carolina at Chapel Hill.

Michael R. Kosorok, Department of Biostatistics, University of North Carolina at Chapel Hill

References

  1. Ali AA, Hossain SM, Hovsepian K, Rahman MM, Plarre K, and Kumar S (2012). mpuff: Automated detection of cigarette smoking puffs from respiration measurements. In Proceedings of the 11th International Conference on Information Processing in Sensor Networks, pp. 269–280. ACM. [Google Scholar]
  2. Bergenstal RM, Garg S, Weinzimer SA, Buckingham BA, Bode BW, Tamborlane WV, and Kaufman FR (2016). Safety of a hybrid closed-loop insulin delivery system in patients with type 1 diabetes. Journal of the American Medical Association 316 (13), 1407–1408. [DOI] [PubMed] [Google Scholar]
  3. Bexelius C, Löf M, Sandin S, Lagerros YT, Forsum E, and Litton J-E (2010). Measures of physical activity using cell phones: Validation using criterion methods. Journal of Medical Internet Research 12 (1), e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chakraborty B and Moodie EE (2013). Statistical Methods for Dynamic Treatment Regimes. Springer. [Google Scholar]
  5. Cobry E, McFann K, Messer L, Gage V, VanderWel B, Horton L, and Chase HP (2010). Timing of meal insulin boluses to achieve optimal postprandial glycemic control in patients with type 1 diabetes. Diabetes Technology & Therapeutics 12 (3), 173–177. [DOI] [PubMed] [Google Scholar]
  6. Dai Y-H (2002). Convergence properties of the BFGS algoritm. SIAM Journal on Optimization 13 (3), 693–701. [Google Scholar]
  7. Doya K (2000). Reinforcement learning in continuous time and space. Neural computation 12 (1), 219–245. [DOI] [PubMed] [Google Scholar]
  8. Ertefaie A (2014). Constructing dynamic treatment regimes in infinite-horizon settings. arXiv preprint arXiv:1406.0764.
  9. Free C, Phillips G, Watson L, Galli L, Felix L, Edwards P, Patel V, and Haines A (2013). The effectiveness of mobile-health technologies to improve health care service delivery processes: A systematic review and meta-analysis. PLoS Med 10 (1), e1001363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Haller MJ, Stalvey MS, and Silverstein JH (2004). Predictors of control of diabetes: Monitoring may be the key. The Journal of Pediatrics 144 (5), 660–661. [DOI] [PubMed] [Google Scholar]
  11. Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2 ed.). New York: Springer. [Google Scholar]
  12. Hernan MA and Robins JM (2010). Causal Inference. CRC Boca Raton, FL. [Google Scholar]
  13. Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, and Murphy SA (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34 (S), 1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kober J and Peters J (2012). Reinforcement learning in robotics: A survey In Reinforcement Learning, pp. 579–610. Springer. [Google Scholar]
  15. Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. New York: Springer. [Google Scholar]
  16. Kosorok MR and Moodie EE (2015). Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine, Volume 21 SIAM. [Google Scholar]
  17. Kowalski A (2015). Pathway to artificial pancreas systems revisited: Moving downstream. Diabetes Care 38 (6), 1036–1043. [DOI] [PubMed] [Google Scholar]
  18. Kumar S, Nilsen WJ, Abernethy A, Atienza A, Patrick K, Pavel M, Riley WT, Shar A, Spring B, Spruijt-Metz D, et al. (2013). Mobile health technology evaluation: The mhealth evidence workshop. American Journal of Preventive Medicine 45 (2), 228–236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for Q-learning. Biometrika 101 (4), 831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lai TL and Robbins H (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), 4–22. [Google Scholar]
  21. Levine B-S, Anderson BJ, Butler DA, Antisdel JE, Brackett J, and Laffel LM (2001). Predictors of glycemic control and short-term adverse outcomes in youth with type 1 diabetes. The Journal of Pediatrics 139 (2), 197–203. [DOI] [PubMed] [Google Scholar]
  22. Liao P, Klasnja P, Tewari A, and Murphy SA (2016). Sample size calculations for micro-randomized trials in mhealth. Statistics in Medicine 35 (12), 1944–1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Linn KA, Laber EB, and Stefanski LA (2017). Interactive Q-learning for quantiles. Journal of the American Statistical Association 112 (518), 638–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, and Gonzalez-Recio O (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research 92 (3), 209–225. [DOI] [PubMed] [Google Scholar]
  25. Ly TT, Breton MD, Keith-Hynes P, De Salvo D, Clinton P, Benassi K, Mize B, Chernavvsky D, Place J, Wilson DM, et al. (2014). Overnight glucose control with an automated, unified safety system in children and adolescents with type 1 diabetes at diabetes camp. Diabetes Care 37 (8), 2310–2316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ly TT, Roy A, Grosman B, Shin J, Campbell A, Monirabbasi S, Liang B, von Eyben R, Shanmugham S, Clinton P, et al. (2015). Day and night closed-loop control using the integrated medtronic hybrid closed-loop system in type 1 diabetes at diabetes damp. Diabetes Care 38 (7), 1205–1211. [DOI] [PubMed] [Google Scholar]
  27. Maahs DM, Mayer-Davis E, Bishop FK, Wang L, Mangan M, and McMurray RG (2012). Outpatient assessment of determinants of glucose excursions in adolescents with type 1 diabetes: Proof of concept. Diabetes Technology & Therapeutics 14 (8), 658–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Maei HR, Szepesvári C, Bhatnagar S, and Sutton RS (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 719–726. [Google Scholar]
  29. Moodie EE, Richardson TS, and Stephens DA (2007). Demystifying optimal dynamic rreatment regimes. Biometrics 63 (2), 447–455. [DOI] [PubMed] [Google Scholar]
  30. Murphy S, Deng Y, Laber E, Maei H, Sutton R, and Witkiewitz K (2016). A batch, off-policy, actor-critic algorithm for optimizing the average reward. arXiv preprint arXiv:1607.05047. [Google Scholar]
  31. Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B 65 (2), 331–355. [Google Scholar]
  32. Murphy SA (2005). A generalization error for Q-learning. Journal of Machine Learning Research 6 (Jul), 1073–1097. [PMC free article] [PubMed] [Google Scholar]
  33. Murphy SA, van der Laan MJ, and Robins JM (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), 1410–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Puterman ML (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. [Google Scholar]
  35. Quinn CC, Shardell MD, Terrin ML, Barr EA, Ballew SH, and Gruber-Baldini AL (2011). Cluster-randomized trial of a mobile phone personalized behavioral intervention for blood glucose control. Diabetes Care 34 (9), 1934–1942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
  37. Robins JM (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium on Biostatitics, pp. 189–326. Springer. [Google Scholar]
  38. Rubin D (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics 6 (1), 34–58. [Google Scholar]
  39. Schulte PJ, Tsiatis AA, Laber EB, and Davidian M (2014). Q- and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science 29 (4), 640–661. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Steinhubl SR, Muse ED, and Topol EJ (2013). Can mobile health technologies transform health care? Journal of the American Medical Association 310 (22), 2395–2396. [DOI] [PubMed] [Google Scholar]
  41. Sutton R and Barto A (1998). Reinforcment Learning: An Introduction. The MIT Press. [Google Scholar]
  42. Tang Y and Kosorok MR (2012). Developing adaptive personalized therapy for cystic fibrosis using reinforcement rearning. The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Working paper 30.
  43. Weinzimer SA, Steil GM, Swan KL, Dziura J, Kurtz N, and Tamborlane WV (2008). Fully automated closed-loop insulin delivery versus semiautomated hybrid control in pediatric patients with type 1 diabetes using an artificial pancreas. Diabetes Care 31 (5), 934–939. [DOI] [PubMed] [Google Scholar]
  44. Wolever T and Mullan Y (2011). Sugars and fat have different effects on postprandial glucose responses in normal and type 1 diabetic subjects. Nutrition, Metabolism and Cardiovascular Diseases 21 (9), 719–725. [DOI] [PubMed] [Google Scholar]
  45. Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 (4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zhang B, Tsiatis AA, Laber EB, and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 (3), 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Zhao Y, Kosorok MR, and Zeng D (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28 (26), 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Ziegler R, Heidtmann B, Hilgard D, Hofer S, Rosenbauer J, and Holl R (2011). Frequency of SMBG correlates with hba1c and acute complications in children and adolescents with type 1 diabetes. Pediatric Diabetes 12 (1), 11–17. [DOI] [PubMed] [Google Scholar]

RESOURCES