Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

Daniel J Luckett; Eric B Laber; Anna R Kahkoska; David M Maahs; Elizabeth Mayer-Davis; Michael R Kosorok

doi:10.1080/01621459.2018.1537919

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2019 Apr 17;115(530):692–706. doi: 10.1080/01621459.2018.1537919

Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

Daniel J Luckett ^1,^*, Eric B Laber ², Anna R Kahkoska ³, David M Maahs ⁴, Elizabeth Mayer-Davis ⁵, Michael R Kosorok ⁶

PMCID: PMC7500510 NIHMSID: NIHMS990912 PMID: 32952236

Abstract

The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best possible healthcare for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient’s health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, most existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show that the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes.

Keywords: Markov decision processes, Precision medicine, Reinforcement learning, Type 1 diabetes

1. Introduction

The use of mobile devices in clinical care, called mobile health (mHealth), provides an effective and scalable platform to assist patients in managing their illness (Free et al., 2013; Steinhubl et al., 2013). Advantages of mHealth interventions include real-time communication between a patient and their health-care provider as well as systems for delivering training, teaching, and social support (Kumar et al., 2013). Mobile technologies can also be used to collect rich longitudinal data to estimate optimal dynamic treatment regimes and to deliver treatment that is deeply tailored to each individual patient. We propose a new estimator of an optimal treatment regime that is suitable for use with longitudinal data collected in mHealth applications.

A dynamic treatment regime provides a framework to administer individualized treatment over time through a series of decision rules. Dynamic treatment regimes have been well-studied in the statistical and biomedical literature (Murphy, 2003; Robins, 2004; Moodie et al., 2007; Kosorok and Moodie, 2015; Chakraborty and Moodie, 2013) and furthermore, statistical considerations in mHealth have been studied by, for example, Liao et al. (2016) and Klasnja et al. (2015). Although mobile technology has been successfully utilized in clinical areas such as diabetes (Quinn et al., 2011; Maahs et al., 2012), smoking cessation (Ali et al., 2012), and obesity (Bexelius et al., 2010), mHealth poses some unique challenges that preclude direct application of existing methodologies for dynamic treatment regimes. For example, mHealth applications typically have no definite time horizon in the sense that treatment decisions are made continually throughout the life of the patient with no fixed time point for the final treatment decision; estimation of an optimal treatment strategy must utilize data collected over a much shorter time period than that over which treatment would be applied in practice; the momentary signal may be weak and may not directly measure the outcome of interest; and estimation of optimal treatment strategies must be done online as data accumulate.

This work is motivated in part by our involvement in a study of mHealth as a management tool for type 1 diabetes. Type 1 diabetes is an autoimmune disease wherein the pancreas produces insufficient levels of insulin, a hormone needed to regulate blood glucose concentration. Patients with type 1 diabetes are continually engaged in management activities including monitoring glucose levels, timing and dosing insulin injections, and regulating diet and physical activity. Increased glucose monitoring and attention to self-management facilitate more frequent treatment adjustments and have been shown to improve patient outcomes (Levine et al., 2001; Haller et al., 2004; Ziegler et al., 2011). Thus, patient outcomes have the potential to be improved by diabetes management tools which are deeply tailored to the continually evolving health status of each patient. Mobile technologies can be used to collect data on physical activity, glucose, and insulin at a fine granularity in an outpatient setting (Maahs et al., 2012). There is great potential for using these data to create comprehensive and accessible mHealth interventions for clinical use. We envision application of this work for use before the artificial pancreas (Weinzimer et al., 2008; Kowalski, 2015; Bergenstal et al., 2016) becomes widely available.

In our motivating example as well as other mHealth applications, the goal is to treat a chronic disease over the long term. However, data will typically be collected over a short time period. Because the time frame of data collection is much shorter than the time frame of application, standard methods for longitudinal data analysis, such as generalized estimating equations or mixed models, cannot be used. We assume that the data collected in the field consist of a sample from a stationary Markov process, which allows us to estimate a dynamic treatment regime that will lead to good outcomes over the long term using data collected over a much shorter time period.

The sequential decision making process can be modeled as a Markov decision process (Puterman, 2014) and the optimal treatment regime can be estimated using reinforcement learning algorithms such as Q-learning (Murphy, 2005; Zhao et al., 2009; Tang and Kosorok, 2012; Schulte et al., 2014). Ertefaie (2014) proposed a variant of greedy gradient Q-learning (GGQ) to estimate optimal dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010). In GGQ, the form of the estimated Q-function dictates the form of the estimated optimal treatment regime. Thus, one must choose between a parsimonious model for the Q-function at the risk of model misspecification or a complex Q-function that yields unintelligible treatment regimes. Furthermore, GGQ requires modeling a non-smooth function of the data, which creates complications (Laber et al., 2014; Linn et al., 2017). Applications of mHealth require methods that can both estimate a policy from a fixed sample of retrospective data (offline estimation) and estimate a policy that is updated as data accumulate (online estimation). Online estimation has been given considerable attention in the field of reinforcement learning, particularly in engineering applications (Doya, 2000; Kober and Peters, 2012), with special consideration given to algorithms that provide fast updates in situations where new data accumulate multiple times per second. In our mHealth examples, treatment decisions may be made multiple times per day or even every hour or minute; however, rapid updating of the estimated policy at the scale of engineering applications is not needed. Therefore, we are able to focus on different aspects of the problem without needing to ensure very fast estimation. We propose an alternative estimation method for infinite horizon dynamic treatment regimes that is suited to mHealth applications. Our approach, which we call V-learning, involves estimating the optimal policy among a prespecified class of policies (Zhang et al., 2012, 2013). It requires minimal assumptions about the data-generating process and permits estimating a randomized decision rule that can be implemented online as data accumulate.

In Section 2, we describe the setup and present our method for offline estimation using data from a micro-randomized trial or observational study. In Section 3, we extend our method for application to online estimation with accumulating data. Theoretical results, including consistency and asymptotic normality of the proposed estimators, are presented in Section 4. We compare the proposed method to GGQ using simulated data in Section 5. A case study using data from patients with type 1 diabetes is presented in Section 6 and we conclude with a discussion in Section 7. Proofs of technical results are in the Appendix.

2. Offline estimation from observational data

We assume that the available data are ${(S_{i}^{1}, A_{i}^{1}, S_{i}^{2}, \dots, S_{i}^{T_{i}}, A_{i}^{T_{i}}, S_{i}^{T_{i} + 1})}_{i = 1}^{n}$ , which comprise n independent, identically distributed trajectories (S¹, A¹, S², …, S^T, A^T, S^T+1), where: $S^{t} \in ℝ^{p}$ denotes a summary of patient information collected up to and including time t; $A^{t} \in A$ denotes the treatment assigned at time t; and $T \in ℤ_{+}$ denotes the (possibly random) patient follow-up time. In the motivating example of type 1 diabetes, S^t could contain a patient’s blood glucose, dietary intake, and physical activity in the hour leading up to time t and A^t could denote an indicator that an insulin injection is taken at time t. We assume that the data-generating model is a time-homogeneous Markov process so that S^t+1 ╨ (A^t−1, S^t+1, …, A¹, S¹)|(A^t, S^t) and the conditional density p(s^t+1|a^t, s^t) is the same for all t ≥ 1. Let L^t ∈ {0, 1} denote an indicator that the patient is still in follow-up at time t, i.e., L^t = 1 if the patient is being followed at time t and zero otherwise. We assume that L^t is contained in S^t so that P(L^t+1 = 1| A^t, S^t,…, A¹, S¹) = P(L^t+1 = 1|A^t, S^t). It is not necessary for time points to be evenly spaced or homogeneous across patients. For example, we can define the decision times as those times at which data are observed and include time since the previous observation in S^t. Defining the decision times in this way allows us to effectively handle intermittent missing data, and thus we can assume that L^t = 0 implies L^t+1 = 0 with probability one. Furthermore, we assume a known utility function $u : ℝ^{p} \times A \times ℝ^{p} \to ℝ$ so that U^t = u (S^t+1, A^t, S^t) measures the ‘goodness’ of choosing treatment A^t in state S^t and subsequently transitioning to state S^t+1. In our motivating example, the utility at time t could be a measure of how infrequently the patient’s blood glucose concentration deviates from the optimal range over the hour preceding and following time t. The goal is to select treatments to maximize expected cumulative utility; treatment selection is formalized using a treatment regime (Schulte et al., 2014; Kosorok and Moodie, 2015) and the utility associated with any regime is defined using potential outcomes (Rubin, 1978).

Let $B (A)$ denote the space of probability distributions over $A$ . A treatment regime in this context is a function $π : dom S^{t} \to B (A)$ so that, under π, a decision maker presented with state S^t= s^t at time t will select action $a^{t} \in A$ with probability π(a^t; s^t). Define ${\bar{a}}^{t} = (a^{1}, \dots, a^{t}) \in A^{t}$ , and ${\bar{a}}^{\infty} = (a^{1}, a^{2}, \dots) \in A^{\infty}$ . The set of potential outcomes is

W^{*} = {S^{1}, S^{* 2} (a^{1}), \dots, S^{* T^{*} ({\bar{a}}^{\infty})} ({\bar{a}}^{T^{*} ({\bar{a}}^{\infty}) - 1}) : T^{*} ({\bar{a}}^{\infty}) = inf {t \geq 1 : L^{* t} ({\bar{a}}^{t - 1}) = 0}, {\bar{a}}^{\infty} \in A^{\infty}},

where $S^{* t} ({\bar{a}}^{t - 1})$ is the potential state and $L^{* t} ({\bar{a}}^{t - 1})$ is the potential follow-up status at time t under treatment sequence ${\bar{a}}^{t - 1}$ . Thus, the potential utility at time t is $U^{* t} ({\bar{a}}^{t}) = u {S^{* (t + 1)} ({\bar{a}}^{t}), a^{t}, S^{* t} ({\bar{a}}^{t - 1})}$ . For any π, define ${ξ_{π}^{t} (\cdot)}_{t \geq 1}$ to be a sequence of independent, $A$ −valued stochastic processes indexed by dom S^t such that $P {ξ_{π}^{t} (s^{t}) = a^{t}} = π (a^{t}; s^{t})$ . The potential follow-up time under π is

T^{*} (π) \sum_{t \geq 1} \sum_{{\bar{a}}^{t} \in A^{t}} t 1 {{sup}_{{\underline{a}}^{t + 1}} T^{*} ({\bar{a}}^{t}, {\underline{a}}^{t + 1}) = t} \prod_{υ = 1}^{t} 1 [ξ_{π}^{υ} {S^{* υ} ({\bar{a}}^{υ - 1})} = a^{υ}],

where ${\underline{a}}^{t + 1} = (a^{t + 1}, a^{t + 2}, \dots)$ . The potential utility under π at time t is

U^{* t} (π) = {\begin{array}{l} \sum_{{\bar{a}}^{t} \in A^{t}} U^{* t} ({\bar{a}}^{t}) \prod_{υ = 1}^{t} 1 [ξ_{π}^{υ} {S^{* υ} ({\bar{a}}^{υ - 1})} = a^{υ}], & if T^{*} (π) \geq t \\ 0, & otherwise, \end{array}

where $S^{* 1} ({\bar{a}}^{0}) = S^{1}$ . Thus, utility is set to zero after a patient is lost to follow-up. However, in certain situations, utility may be constructed so as to take a negative value at the time point when the patient is lost to follow-up, e.g., if the patient discontinues treatment because of a negative effect associated with the intervention. Define the state-value function $V (π, s^{t}) = E {\sum_{k \geq 0} γ^{k} U^{* (t + k)} (π) | S^{t} = s^{t}}$ (Sutton and Barto, 1998), where γ ∈ (0,1) is a fixed constant that captures the trade-off between short- and long-term outcomes. For any distribution $R$ on dom S¹, define the value function with respect to reference distribution $R$ as $V_{R} (π) = \int V (π, s) d R (s)$ ; throughout, we assume that this reference distribution is fixed. The reference distribution can be thought of as a distribution of initial states and we estimate it from the data in the implementation in Sections 5 and 6. For a prespecified class of regimes, Π, the optimal regime, $π_{R}^{opt} \in Π$ , satisfies $V_{R} (π_{R}^{opt}) \geq V_{R} (π)$ for all π ∈ Π. The goal is to estimate $π_{R}^{opt}$ using data collected from n patients, where patient i is followed for T_i time points, i = 1, …, n. Thus, T_i represents the number of treatment decisions made for patient i in the observed data; however, because the observed data are assumed to come from a time-homogeneous Markov chain, the estimated policy could be applied in a population indefinitely.

To construct an estimator of $π_{R}^{opt}$ , we make a series of assumptions that connect the potential outcomes in W^* with the data-generating model.

Assumption 1. Strong ignorability, A^t ╨ W^* | S^t for all t.

Assumption 2. Consistency, $S^{t} = S^{* t} ({\bar{A}}^{t - 1})$ for all t and $T = T^{*} ({\bar{A}}^{\infty})$ .

Assumption 3. Positivity, there exists c₀ > 0 so that P(A^t = a^t|S^t = s^t) ≥ c₀ for all $a^{t} \in A, s^{t} \in dom S^{t}$ , and all t.

In addition, we implicitly assume that there is no interference among the experimental units. These assumptions are common in the context of estimating dynamic treatment regimes (Robins, 2004; Hernan and Robins, 2010; Schulte et al., 2014). Assumption 1 implies that there are no unmeasured confounders and assumptions 1 and 3 hold by construction in a micro-randomized trial (Klasnja et al., 2015; Liao et al., 2016).

Let μ^t(a^t; s^t) = P(A^t = a^t|S^t = s^t) for each t ≥ 1. In a micro-randomized trial, μ^t(a^t; s^t) is a known randomization probability; in an observational study, it must be estimated from the data. The following lemma characterizes $V_{R} (π)$ for any regime, π, in terms of the data-generating model (see also Lemma 4.1 of Murphy et al., 2001). A proof is provided in the appendix.

Lemma 2.1. Let π denote an arbitrary regime and γ ∈ (0, 1) a discount factor. Then, under assumptions 1–3 and provided interchange of the sum and integration is justified, the state-value function of π at s^t is

V (π, s^{t}) = \sum_{k \geq 0} E [γ^{k} U^{t + k} {\prod_{υ = 0}^{k} \frac{π (A^{υ + t}; S^{υ + t})}{μ^{υ + t} (A^{υ + t}; S^{υ + t})}} | S^{t} = s^{t}] .

(1)

The preceding result will form the basis for an estimating equation for $V_{R} (π)$ . Write the right hand side of (1) as

\begin{array}{l} V (π, S^{t}) & = & E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} (U^{t} + γ \sum_{k \geq 0} E [γ^{k} U^{t + k + 1} {\prod_{υ = 0}^{k} \frac{π (A^{υ + t + 1}; S^{υ + t + 1})}{μ^{υ + t + 1} (A^{υ + t + 1}; S^{υ + t + 1})}} | S^{t + 1}]) | S^{t}} \\ = & E [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} {U^{t} + γ V (π, S^{t + 1})} | S^{t}], \end{array}

from which it follows that

0 = E [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} {U^{t} + γ V (π, S^{t + 1}) - V (π, S^{t})} | S^{t}] .

Subsequently, for any function ψ defined on dom S^t, the state-value function satisfies

0 = E [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} {U^{t} + γ V (π, S^{t + 1}) - V (π, S^{t})} ψ (S^{t})],

(2)

which is an importance-weighted variant of the well-known Bellman optimality equation (Sutton and Barto, 1998).

Let V(π, s; θ^π) denote a model for V(π, s) indexed by $θ^{π} \in Θ \subseteq ℝ^{q}$ . We assume that the map θ^π ⟼ V(π, s;θ^π) is differentiable everywhere for each fixed s and π. Let $\nabla_{θ^{π}} V (π, s; θ^{π})$ denote the gradient of V(π, s; θ^π) and define

Λ_{n} (π, θ^{π}) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 1}^{T_{i}} \frac{π (A_{i}^{t}; S_{i}^{t})}{μ^{t} (A_{i}^{t}; S_{i}^{t})} {U_{i}^{t} + γ V (π, S_{i}^{t + 1}; θ^{π}) - V (π, S_{i}^{t}; θ^{π})} \nabla_{θ^{π}} V (π, S_{i}^{t}; θ^{π}) .

(3)

Given a positive definite matrix $Ω \in ℝ^{q \times q}$ and penalty function $P : ℝ^{q} \to ℝ_{+}$ , define ${\hat{θ}}_{n}^{π} = arg {min}_{θ^{π} \in Θ} {Λ_{n} {(π, θ^{π})}^{⊺} Ω Λ_{n} (π, θ^{π}) + λ_{n} P (θ^{π})}$ , where λ_n is a tuning parameter. Subsequently, $V (π, s; {\hat{θ}}_{n}^{π})$ is the estimated state-value function under π in state s. Thus, given a reference distribution, $R$ , the estimated value of a regime, π, is ${\hat{V}}_{n, R} (π) = \int V (π, s; {\hat{θ}}_{n}^{π}) d R (S)$ and the estimated optimal regime is ${\hat{π}}_{n} = arg {max}_{π \in Π} {\hat{V}}_{n, R} (π)$ . The idea of V-learning is to use estimating equation (3) to estimate the value of any policy and maximize estimated value over a class of policies; we will discuss strategies for this maximization below.

V-learning requires a parametric class of policies. Assuming that there are K possible treatments, a₁, … , a_K, we can define a parametric class of policies as follows. Define $π (a_{j}; s, β) = exp (s^{⊺} β_{j}) / {1 + \sum_{k - 1}^{K - 1} exp (s^{⊺} β_{k})}$ for j = 1, … , K − 1, and $π (a_{K}; s) = 1 / {1 + \sum_{k = 1}^{K - 1} exp (s^{⊺} β_{k})}$ . This defines a class of randomized policies, Π, parametrized by $β = {(β_{1}^{⊺}, \dots, β_{K - 1}^{⊺})}^{⊺}$ , where β_k is a vector of parameters for the k-th treatment. Under a policy in this class defined by β, actions are selected stochastically according to the probabilities π(a_j; s, β), j = 1, … , K. In the case of a binary treatment, a policy in this class reduces to π(1; s, β) = exp(s^⊺β)/ {1 + exp(s^⊺β)} and π(0; s, β) = 1/ {1 + exp(s^⊺β)} for a p × 1 vector β. This class of policies is used in the implementation in Sections 5 and 6.

V-learning also requires a class of models for the state value function indexed by a parameter, θ^π. We use a basis function approximation (Hastie et al., 2009; Long et al., 2010). Let Φ = (ϕ₁, … , ϕ_q)^⊺ be a vector of prespecified basis functions and let $Φ (s_{i}^{t}) = {ϕ_{1} (s_{i}^{t}), \dots, ϕ_{q} (s_{i}^{t})}^{⊺} .$ Let $V (π, s_{i}^{t}; θ^{π}) = Φ {(s_{i}^{t})}^{⊺} θ^{π}$ Under this working model,

\begin{array}{l} Λ_{n} (π, θ^{π}) & = [n^{- 1} \sum_{i = 1}^{n} \sum_{t = 1}^{T_{i}} \frac{π (A_{i}^{t}; S_{i}^{t})}{μ^{t} (A_{i}^{t}; S_{i}^{t})} {γ Φ (S_{i}^{t}) Φ {(S_{i}^{t + 1})}^{⊺} - Φ (S_{i}^{t}) Φ {(S_{i}^{t})}^{⊺}}] θ^{π} \\ + n^{- 1} \sum_{i = 1}^{n} \sum_{t = 1}^{T_{i}} {\frac{π (A_{i}^{t}; S_{i}^{t})}{μ^{t} (A_{i}^{t}; S_{i}^{t})} U_{i}^{t} Φ (S_{i}^{t})} . \end{array}

(4)

Computational efficiency is gained from the linearity of $V (π, s_{i}^{t}; θ^{π})$ in θ^π; flexibility can be achieved through the choice of Φ. We recommend Gaussian basis functions as they offer the highest degree of flexibility. However, along with Gaussian basis functions, we also examine the performance of V-learning using linear and polynomial basis functions in Sections 5 and 6 as these offer reasonable alternatives.

The algorithm for V-learning is given in Algorithm 1 below. The algorithm can be terminated when $‖ β^{k} - β^{k - 1} ‖$ is below some small threshold. The update in step can be achieved with a variety of existing optimization methods. In our implementation, we use the BFGS algorithm (Dai, 2002) as implemented in the optim function in R software (R Core Team, 2016). Because the objective function is not necessarily convex, care must be taken when selecting the starting point, β¹ (step 2). In our implementation, we use simulated annealing as implemented in the optim function in R to find an appropriate starting point.

Algorithm 1: V-learning.
1 Initialize a class of policies, $Π = {π_{β} : β \in B}$ , and a model, V (π, s; θ^π);
2 Set k = 1 and initialize β¹ to a starting value in $B$ ;
3 while Not converged do
4 Estimate ${\hat{θ}}_{n}^{π_{β^{k}}} = arg {min}_{θ^{π_{β^{k}}} \in Θ} {Λ_{n} {(π_{β^{k}}, θ^{π_{β^{k}}})}^{⊺} Ω Λ_{n} (π_{β^{k}}, θ^{π_{β^{k}}}) + λ_{n} P (θ^{π_{β^{k}}})}$ ;
5 Evaluate ${\hat{V}}_{n, R} (π_{β^{k}}) = \int V (π_{β^{k}}, S; {\hat{θ}}_{n}^{π_{β^{k}}}) d R (s)$ ;
6 Set $β^{k + 1} = β^{k} + α^{k} \nabla_{β_{k}} {\hat{V}}_{n, R} (π_{β^{k}})$ for some step size, α^k, where $\nabla_{β_{k}} {\hat{V}}_{n, R} (π_{β^{k}})$ is the gradient of ${\hat{V}}_{n, R} (π_{β^{k}})$ ;
7 end

Open in a new tab

2.1. Greedy gradient Q-learning

Here we briefly discuss an existing method for infinite horizon dynamic treatment regimes, which will be used for comparison in the simulation studies in Section 5.

Ertefaie (2014) introduced greedy gradient Q-learning (GGQ) for estimating dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010; Murphy et al., 2016).

Define $Q^{π} (s^{t}, a^{t}) = E {{\sum_{k \geq 0} γ^{k} U}^{t + k} (π) | S^{t} = s^{t}, A^{t} = a^{t}}$ . The Bellman optimality equation (Sutton and Barto, 1998) is

Q^{opt} (s^{t}, a^{t}) = E {U^{t} + γ max_{a \in A} Q^{opt} (S^{t + 1}, a) | S^{t} = s^{t}, A^{t} = a^{t}} .

(5)

Let Q(s, a; η^opt) be a parametric model for Q^opt(s, a) indexed by $η^{opt} \in H \subseteq ℝ^{q}$ . In our implementation, we model Q(s, a; η^opt) as a linear function with interactions between all state variables and treatment. The Bellman optimality equation motivates the estimating equation

D_{n} (η^{opt}) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{t = 1}^{T_{i}} {U_{i}^{t} + γ max_{a \in A} Q (S_{i}^{t + 1}, a; η^{opt}) - Q (S_{i}^{t}, A_{i}^{t}; η^{opt})} \nabla_{η^{opt}} Q (S_{i}^{t}, A_{i}^{t}; η^{opt}) .

(6)

For a positive definite matrix, Ω, we estimate η^opt using ${\hat{η}}_{n}^{opt} = arg {min}_{η \in H} D_{n} {(η)}^{⊺} Ω D_{n} (η)$ . The estimated optimal policy in state s selects action ${\hat{π}}_{n} (s) = {max}_{a \in A} Q (s, a; {\hat{η}}_{n}^{opt})$ . This optimization problem is non-convex and non-differentiable in η^opt. However, it can be solved with a generalization of the greedy gradient Q-learning algorithm of Maei et al. (2010), and hence is referred to as GGQ by Ertefaie (2014) and in the following.

The performance of GGQ has been demonstrated in the context of chronic diseases with large sample sizes and a moderate number of time points. However, in mHealth applications, it is common to have small sample sizes and a large number of time points, with decisions occurring at a fine granularity. In GGQ, the estimated policy depends directly ${\hat{η}}_{n}^{opt}$ and, therefore, depends on modeling the transition probabilities of the data-generating process. Furthermore, estimating equation (6) contains a non-smooth max operator, which makes estimation difficult without large amounts of data (Laber et al., 2014; Linn et al., 2017). V-learning only requires modeling the policy and the value function rather than the data-generating process and directly maximizes estimated value over a class of policies, thereby avoiding the non-smooth max operator in the estimating equation (compare equations (3) and (6)); these attributes may prove advantageous in mHealth settings.

3. Online estimation from accumulating data

Suppose we have accumulating data ${(S_{i}^{1}, A_{i}^{1}, S_{i}^{2}, \dots)}_{i = 1}^{n}$ , where $S_{i}^{t}$ and $A_{i}^{t}$ represent the state and action for patient i = 1, … , n at time t ≥ 1. At each time t, we estimate an optimal policy in a class, Π, using data collected up to time t, take actions according to the estimated optimal policy, and estimate a new policy using the resulting states. Let ${\hat{π}}_{n}^{t}$ be the estimated policy at time t, i.e., ${\hat{π}}_{n}^{t}$ is estimated after observing state S^t+1 and before taking action A^t+1. If Π is a class of randomized policies, we can select an action for a patient presenting with S^t+1 = s^t+1 according to ${\hat{π}}_{n}^{t} (\cdot; s^{t + 1})$ , i.e., we draw A^t+1 according to the distribution $P (A^{t + 1} = a) = {\hat{π}}_{n}^{t} (α; s^{t + 1})$ . If a class of deterministic policies is of interest, we can inject some randomness into ${\hat{π}}_{n}^{t}$ to facilitate exploration. One way to do this is an ϵ-greedy strategy (Sutton and Barto, 1998), which selects the estimated optimal action with probability 1 – ϵ and otherwise samples equally from all other actions. Because an ϵ-greedy strategy can be used to introduce randomness into a deterministic policy, we can assume a class of randomized policies.

At each time t ≥ 1, let ${\hat{θ}}_{n, t}^{π} = arg {min}_{θ^{π} \in Θ} {Λ_{n, t} {(π, θ^{π})}^{⊺} Ω Λ_{n, t} (π, θ^{π}) + λ_{n} P (θ^{π})}$ where Ω, λ_n, and $P$ are as defined in Section 2 and

Λ_{n, t} (π, θ^{π}) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{υ = 1}^{_{t}} \frac{π (A_{i}^{υ}; S_{i}^{υ})}{{\hat{π}}_{n}^{υ - 1} (A_{i}^{υ}; S_{i}^{υ})} {U_{i}^{υ} + γ V (π, S_{i}^{υ + 1}; θ^{π}) - V (π, S_{i}^{υ}; θ^{π})} \nabla_{θ^{π}} V (π, S_{i}^{υ}; θ^{π})

(7)

with ${\hat{π}}_{n}^{0}$ some initial randomized policy. We note that estimating equation (7) is similar to (3), except that ${\hat{π}}_{n}^{υ - 1}$ replaces μ^υ as the data-generating policy. Given the estimator of the value of π at time t, ${\hat{V}}_{n, R, t} (π) = \int V (π, s; {\hat{θ}}_{n, t}^{π}) d R (s)$ , the estimated optimal policy at time t is ${\hat{π}}_{n}^{t} = arg {max}_{π \in Π} {\hat{V}}_{n, R, t} (π)$ . In practice, we may choose to update the policy in batches rather than at every time point. An alternative way to encourage exploration through the action space is to choose ${\hat{π}}_{n}^{t} = arg {max}_{π \in Π} {{\hat{V}}_{n, R, t} (π) + α^{t} {\hat{ψ}}^{t} (π)}$ for some sequence α^t ≥ 0, where ${\hat{ψ}}^{t} (π)$ is a measure of uncertainty in ${\hat{V}}_{n, R, t} (π)$ . An example of this is upper confidence bound sampling, or UCB (Lai and Robbins, 1985).

In some settings, when the data-generating process may vary across patients, it may be desirable to allow each patient to follow an individualized policy that is estimated using only that patient’s data. Suppose that n patients are followed for an initial T₁ time points after which the policy ${\hat{π}}_{n}^{1}$ is estimated. Then, suppose that patient i follows ${\hat{π}}_{n}^{1}$ until time T₂, when a policy ${\hat{π}}_{i}^{2}$ is estimated using only the states and actions observed for patient i. This procedure is then carried out until time T_K for some fixed K with each patient following their own individual policy which is adapted to match the individual over time. We may also choose to adapt the randomness of the policy at each estimation. For example, we could select ϵ₁ > ϵ₂ > … > ϵ_K and, following estimation k, have patient i follow policy ${\hat{π}}_{i}^{k}$ with probability 1 – ϵ_k and policy ${\hat{π}}_{n}^{1}$ with probability ϵ_k. In this way, patients become more likely to follow their own individualized policy and less likely to follow the initial policy over time, reflecting increasing confidence in the individualized policy as more data become available. The same class of policies and model for the state value function can be used as in Section 2.

4. Theoretical results

In this section, we establish asymptotic properties of ${\hat{θ}}_{n}^{π}$ and ${\hat{π}}_{n}$ for offline estimation. Because the proposed online estimation procedure involves performing offline estimation repeatedly in smaller batches, our theoretical results apply to each smaller batch as the number of observations increases. Developing more general theory for online estimation is an interesting topic for future research. Throughout, we assume assumptions 1–3 from Section 2.

Let ${\hat{θ}}_{n}^{π} = arg {min}_{θ^{π}}_{\in Θ} {Λ_{n} {(π, θ^{π})}^{⊺} Ω Λ_{n} (π, θ^{π}) + λ_{n} {(θ^{π})}^{⊺} θ^{π}}$ . Thus, we consider the special case where the penalty function is the squared Euclidean norm of θ^π. We will assume that λ_n = o_P(n^−1/2). All of our results hold for any positive definite matrix, Ω. Assume the working model for the state value function introduced in Section 2, i.e., $V (π, s_{i}^{t}; θ^{π}) = Φ {(s_{i}^{t})}^{⊺} θ^{π}$ . For fixed π, denote the true $θ^{π} b y θ_{0}^{π}$ , i.e., $V (π, s) = Φ {(s)}^{⊺} θ_{0}^{π}$ . Let $ν = \int Φ (s) d R (s)$ so that $V_{R} (π) = ν^{⊺} θ_{0}^{π}$ . Define ${\hat{V}}_{n, \hat{R}} (π) = {E_{n} Φ (S)}^{⊺} {\hat{θ}}_{n}^{π}$ , where $E_{n}$ denotes the empirical measure of the observed data. Let $Π = {π_{β} : β \in B}$ be a parametric class of policies and let ${\hat{π}}_{n} = π_{{\hat{β}}_{n}}$ where ${\hat{β}}_{n} = arg {max}_{β \in B} {\hat{V}}_{n, \hat{R}} (π_{β})$ .

Our main results are summarized in Theorems 4.2 and 4.3 below. Because each patient trajectory is a stationary Markov chain, we need to use asymptotic theory based on stationary processes; consequently, some of the required technical conditions are more difficult to verify than those for i.i.d. data. Define the bracketing integral for a class of functions, $F$ , by $J_{[]} {δ, F, L_{r} (P)} = \int_{0}^{δ} \sqrt{log N_{[]} {ϵ, F, L_{r} (P)}} d ϵ,$ where the bracketing number for $F$ , $N_{[]} {ϵ, F, L_{r} (P)}$ , is the number of L_r(P) ϵ-brackets needed such that each element of $F$ is contained in at least one bracket (see Chapter 2 of Kosorok, 2008). For any stationary sequence of possibly dependent random variables, {X^t}_t≥1; let $M_{b}^{c}$ be the σ-field generated by X^b, … ,X^c and define $ζ (k) = E [{sup}_{m \geq 1} {| P (B | M_{1}^{m}) - P (B) | : B \in M_{m + k}^{\infty}}]$ . We say that the chain {X^t}_t≥1 is absolutely regular if ζ(k) → 0 as k → 0 (also called β-mixing in Chapter 11 of Kosorok, 2008). We make the following assumptions.

Assumption 4. There exists a 2 < ρ < ∞ such that

$E {| U^{t} |}^{3 ρ} < \infty, E {‖ Φ (S^{t}) ‖}^{3 ρ} < \infty$ , and $E {‖ S^{t} ‖}^{3 ρ} < \infty$ .
The sequence {(S^t, A^t)}_t≥1 is absolutely regular with $\sum_{k = 1}^{\infty} k^{2 / (ρ - 2)} ζ (k) < \infty$ .
The bracketing integral of the class of policies, J_[] (∞,Π, L_3ρ(P)} < ∞.

Assumption 5. There exists some c₁ > 0 such that

inf_{π \in Π} c^{⊺} E [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} {Φ (S^{t}) Φ {(S^{t})}^{⊺} - γ^{2} Φ (S^{t + 1}) Φ {(S^{t + 1})}^{⊺}}] c \geq c_{1} {‖ c ‖}^{2}

for all $c \in ℝ^{q}$

Assumption 6. The map $β \mapsto V_{R} (π_{β})$ has a unique and well separated maximum over β in the interior of $B$ ; let β₀ denote the maximizer.

Assumption 7. The following condition holds: ${sup}_{‖ β_{1} - β_{2} ‖ \leq δ} E ‖ π_{β_{1}} (A; S) - π_{β_{2}} (A; S) ‖ \to 0$ as δ ↓0.

Remark 4.1. Assumption 4 requires certain finite moments and that the dependence between observations on the same patient vanishes as observations become further apart. In Lemma 8.2 in the appendix, we verify part 3 of assumption 4 and assumption 7 for the class of policies introduced in Section 2. However, note that the theory holds for any class of policies satisfying the given assumptions, not just the class considered here. Assumption 5 is needed to show the existence of a unique $θ_{0}^{π}$ uniformly over Π and can be verified empirically by checking that certain data-dependent matrices are invertible. Assumption requires that the true optimal decision in each state is unique (see assumption A. 8 of Ertefaie, 2014) and is a standard assumption in M-estimation (see chapter 14 of Kosorok, 2008). Assumption 7 requires smoothness on the class of policies.

The main results of this section are stated below. Theorem 4.2 states that there exists a unique solution to $0 = E Λ_{n} (π, θ^{π})$ uniformly over Π and that the estimator ${\hat{θ}}_{n}$ converges weakly to a mean zero Gaussian process in $ℓ^{\infty} (Π)$ .

Theorem 4.2. Under the given assumptions, the following hold.

For all π ∈ Π, there exists $a θ_{0}^{π} \in ℝ^{q}$ such that $E Λ_{n} (π, θ^{π})$ has a zero at $θ^{π} = θ_{0}^{π}$ . Moreover, ${sup}_{π \in Π} ‖ θ_{0}^{π} ‖ < \infty$ and ${sup}_{‖ β_{1} - β_{2} ‖ \leq δ} ‖ θ_{0}^{π_{β_{1}}} - θ_{0}^{π_{β_{2}}} ‖ \to 0 as δ ↓ 0$ .
Let $G (π)$ be a tight, mean zero Gaussian process indexed by Π with covariance $E {G (π_{1}) G (π_{2})} = w_{1} {(π_{1})}^{- 1} w_{0} (π_{1}, π_{2}) w_{1} {(π_{2})}^{- ⊺}$ where
$w_{0} (π_{1}, π_{2}) = E [\frac{π_{1} (A^{t}; S^{t}) π_{2} (A^{t}; S^{t})}{μ^{t} {(A^{t}; S^{t})}^{2}} {U^{t} + γ Φ (S^{t + 1}) θ_{0}^{π_{1}} - Φ (S^{t}) θ_{0}^{π_{1}}} {U^{t} + γ Φ (S^{t + 1}) θ_{0}^{π_{2}} - Φ (S^{t}) θ_{0}^{π_{2}}} Φ (S^{t}) Φ {(S^{t})}^{⊺}]$
and
$w_{1} (π) = E [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} Φ (S^{t}) {Φ (S^{t}) - γ Φ {(S^{t + 1})}^{⊺}}] .$
Then, $\sqrt{n} ({\hat{θ}}_{n}^{π} - θ_{0}^{π}) ⇝ G (π)$ in $ℓ^{\infty} (Π)$ .
Let $G (π)$ be as defined in part 2. Then, $\sqrt{n} {{\hat{V}}_{n, \hat{R}} (π) - V_{R} (π)} ⇝ ν^{⊺} G (π)$ in $ℓ^{\infty} (Π)$ .

Theorem 4.3 below gives us that the estimated optimal policy converges in probability to the true optimal policy over Π and that the estimated value of the estimated optimal policy converges to the true value of the estimated optimal policy.

Theorem 4.3. Under the given assumptions, the following hold.

Let ${\hat{β}}_{n} = arg {max}_{β \in B} {\hat{V}}_{n, \hat{R}} (π_{β})$ and $β_{0} = arg {max}_{β \in B} V_{R} (π_{β})$ . We have that $‖ {\hat{β}}_{n} - β_{0} ‖ \overset{P}{\to} 0$ .
Let ${\hat{β}}_{n}$ and β₀ be defined as in part 1. Then, $| V_{R} (π_{{\hat{β}}_{n}}) - V_{R} (π_{{\hat{β}}_{0}}) | \overset{P}{\to} 0$ .
Let $σ_{0}^{2} = ν^{⊺} w_{1} {(π_{β_{0}})}^{- 1} w_{0} (π_{β_{0}}, π_{β_{0}}) w_{1} (π_{β_{0}})^{- ⊺} ν .$ Then, $\sqrt{n} {{\hat{V}}_{n, \hat{R}} (π_{{\hat{β}}_{n}}) - V_{R} (π_{{\hat{β}}_{n}})} ⇝ N (0, σ_{0}^{2})$ .
A consistent estimator for $σ_{0}^{2}$ is
${\hat{σ}}_{n}^{2} = {E Φ (S^{t})}^{⊺} {\hat{w}}_{1} {(π_{{\hat{β}}_{n}})}^{- 1} {\hat{w}}_{0} (π_{{\hat{β}}_{n}}, π_{{\hat{β}}_{n}}) {\hat{w}}_{1} {(π_{{\hat{β}}_{n}})}^{- ⊺} {E_{n} Φ (S^{t})},$
where
${\hat{w}}_{0} (π_{1}, π_{2}) = E_{n} [\frac{π_{1} (A^{t}; S^{t}) π_{2} (A^{t}; S^{t})}{μ^{t} {(A^{t}; S^{t})}^{2}} {U^{t} + γ Φ (S^{t + 1}) {\hat{θ}}_{n}^{π_{1}} - Φ (S^{t}) {\hat{θ}}_{0}^{π_{1}}} {U^{t} + γ Φ (S^{t + 1}) {\hat{θ}}_{n}^{π_{2}} - Φ (S^{t}) {\hat{θ}}_{n}^{π_{2}}} Φ (S^{t}) Φ {(S^{t})}^{⊺}]$
and
${\hat{w}}_{1} (π) = E_{n} [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} Φ (S^{t}) {Φ (S^{t}) - γ Φ (S^{t + 1})}^{⊺}] .$

Proofs of the above results are in the Appendix along with a result on bracketing entropy that is needed for the proof of Theorem 4.2 and a proof that the class of policies introduced above satisfies the necessary bracketing integral assumption.

5. Simulation experiments

In this section, we examine the performance of V-learning on simulated data. Section 5.1 contains results for offline estimation and Section 5.2 contains results for online estimation. All simulation results are averaged across 50 replications.

5.1. Offline simulations

Our implementation of V-learning follows the setup in Section 2. Maximizing ${\hat{V}}_{n, R} (π)$ is done using a combination of simulated annealing and the BFGS algorithm as implemented in the optim function in R software (R Core Team, 2016). We note that ${\hat{V}}_{n, R} (π)$ is differentiable in π, thereby avoiding some of the computational complexity of GGQ. However, the objective is not necessarily convex. In order to avoid local maxima, simulated annealing with 1000 function evaluations is used to find a neighborhood of the maximum; this solution is then used as the starting value for the BFGS algorithm.

We use the class of policies introduced in Section 2. Although we maximize the value over a class of randomized policies, the true optimal policy is deterministic. To prevent the coefficients of ${\hat{β}}_{n}$ from diverging to infinity, we add an L2 penalty when maximizing over β. To prevent overfitting, we use an L2 penalty when computing ${\hat{θ}}_{n}^{π}$ , i.e., $P (θ^{π}) = {(θ^{π})}^{⊺} θ^{π}$ . We let Ω be the identity matrix. Simulation results with alternate choices of $P$ and Ω are given in the appendix. Tuning parameters can be used to control the amount of randomness in the estimated policy. For example, increasing the penalty when computing ${\hat{β}}_{n}$ is one way to encourage exploration through the action space because β = 0 defines a policy where each action is selected with equal probability.

We consider three different models for the state-value function: (i) linear; (ii) second degree polynomial; and (iii) Gaussian radial basis functions (RBF). The Gaussian RBF is $ϕ (x; κ, τ^{2}) = exp {- {(x - κ)}^{2} / 2 τ^{2}}$ . We use τ = 0.25 and κ = 0, 0.25, 0.5, 0.75, 1 to create a basis of functions and apply this basis to the state variables after scaling them to be between 0 and 1. Each model also implicitly contains an intercept.

We begin with the following simple generative model. Let the two-dimensional state vector be $S_{i}^{t} = {(S_{i, 1}^{t}, S_{i, 2}^{t})}^{⊺}$ , i = 1, … , n, t = 1, … , T. We initiate the state variables as independent standard normal random variables and let them evolve according to $S_{i, 1}^{t} = (3 / 4) (2 A_{i}^{t - 1} - 1) S_{i, 1}^{t - 1} + (1 / 4) S_{i, 1}^{t - 1} S_{i, 2}^{t - 1} + ϵ_{1}^{t}$ and $S_{i, 2}^{t} = (3 / 4) (1 - 2 A_{i}^{t - 1}) S_{i, 2}^{t - 1} + (1 / 4) S_{i, 1}^{t - 1} S_{i, 2}^{t - 1} + ϵ_{2}^{t}$ , where $A_{i}^{t}$ takes values in {0, 1} and $ϵ_{1}^{t}$ and $ϵ_{2}^{t}$ are independent N (0, 1/4) random variables. Define the utility function by $U_{i}^{t} = u (S_{i}^{t + 1}, A_{i}^{t}, S_{i}^{t}) = 2 S_{i, 1}^{t + 1} + S_{i, 2}^{t + 1} - (1 / 4) (2 A_{i}^{t} - 1)$ . At each time t, we must make a decision to treat or not with the goal of maximizing the components of S while treating as few times as possible. Treatment has a positive effect on S₁ and a negative effect on S₂. We generate $A_{i}^{t}$ from a Bernoulli distribution with mean 1/2. In estimation, we assume that the generating model for treatment is known, as would be the case in a micro-randomized trial.

We generate samples of n patients with T time points per patient from the given generative model after an initial burn-in period of 50 time points. The burn-in period ensures that our simulated data are sampled from an approximately stationary distribution. We estimate policies using V-learning with three different types of basis functions and GGQ. After estimating optimal policies, we simulate 100 patients following each estimated policy for 100 time points and take the mean utility under each policy as an estimate of the value of that policy. Estimated values are found in Table 1 with Monte Carlo standard errors along with the observed value. Recall that larger values are better. The policies estimated using V-learning produce better outcomes than the observational policy and the policy estimated using GGQ. V-learning produces the best outcomes using Gaussian basis functions. In the Appendix, we give results for the same simulation settings given in Table 1 for alternate choices of $P$ and Ω. Table 8 presents results with the LASSO penalty when Ω is the identity matrix, Table 9 presents results with the L2 penalty when $Ω = {E_{n} S^{t} {(S^{t})}^{⊺}}^{- 1}$ , and Table 10 presents results with the LASSO penalty when $Ω = {E_{n} S^{t} {(S^{t})}^{⊺}}^{- 1}$ .

Table 1:

Monte Carlo value estimates for offline simulations with γ = 0.9.

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ	Observed
25	24	0.118 (0.0892)	0.091 (0.0825)	0.110 (0.0979)	0.014 (0.0311)	−0.005
	36	0.108 (0.0914)	0.115 (0.0911)	0.112 (0.0919)	0.029 (0.0280)	−0.004
	48	0.106 (0.0705)	0.071 (0.0974)	0.103 (0.0757)	0.031 (0.0350)	0.000
50	24	0.124 (0.0813)	0.109 (0.1045)	0.118 (0.0879)	0.016 (0.0355)	−0.005
	36	0.126 (0.0818)	0.134 (0.0878)	0.136 (0.0704)	0.027 (0.0276)	0.003
	48	0.101 (0.0732)	0.109 (0.0767)	0.115 (0.0763)	0.020 (0.0245)	0.000
100	24	0.117 (0.0895)	0.135 (0.0973)	0.140 (0.0866)	0.019 (0.0257)	0.011
	36	0.113 (0.0853)	0.105 (0.1033)	0.139 (0.0828)	0.021 (0.0312)	0.012
	48	0.111 (0.0762)	0.143 (0.0853)	0.114 (0.0699)	0.031 (0.0306)	−0.001

Open in a new tab

Next, we simulate cohorts of patients with type 1 diabetes to mimic the mHealth study of Maahs et al. (2012). Maahs et al. (2012) followed a small sample of youths with type 1 diabetes and recorded data at a fine granularity using mobile devices. Blood glucose levels were tracked in real time using continuous glucose monitoring, physical activity was measured continuously using accelerometers, and insulin injections were logged by an insulin pump. Dietary data were recorded by 24-hour recall over phone interviews.

In our simulation study, we divide each day of follow-up into 60 minute intervals. Thus, for one day of follow-up, we observe T = 24 time points per simulated patient and a treatment decision is made every hour. Our hypothetical mHealth study is designed to estimate an optimal dynamic treatment regime for the timing of insulin injections based on patient blood glucose, physical activity, and dietary intake with the goal of controlling future blood glucose as close as possible to the optimal range. To this end, we define the utility at time t as a weighted sum of hypo- and hyperglycemic episodes in the 60 minutes preceding and following time t. Weights are −3 when glucose ≤ 70 (hypoglycemic), −2 when glucose ≤ 150 (hyperglycemic), −1 when 70 < glucose ≤ 80 or 120 < glucose ≤ 150 (borderline hypo- and hyperglycemic), and 0 when 80 < glucose ≤ 120 (normal glycemia). Utility at each time point ranges from −6 to 0 with larger utilities (closer to 0) being more preferable. For example, a patient who presents with an average blood glucose of 155 mg/dL over time interval t–1, takes an action to correct their hyperglycemia, and presents with an average blood glucose of 145 mg/dL over time interval t would receive a utility of U^t = −3. Weights were chosen to reflect the relative clinical consequences of high and low blood glucose. For example, acute hypoglycemia, characterized by blood glucose levels below 70 mg/dL, is an emergency situation that can result in coma or death.

Simulated data are generated as follows. At each time point, patients are randomly chosen to receive an insulin injection with probability 0.3, consume food with probability 0.2, partake in mild physical activity with probability 0.4, and partake in moderate physical activity with probability 0.2. Grams of food intake and counts of physical activity are generated from normal distributions with parameters estimated from the data of Maahs et al. (2012). Initial blood glucose level for each patient is drawn from a normal distribution with mean 100 and standard deviation 25. Define the covariates for patient i collected at time t by ${({Gl}_{i}^{t}, {Di}_{i}^{t}, {Ex}_{i}^{t})}^{⊺}$ , where ${Gl}_{i}^{t}$ is average blood glucose level, ${Di}_{i}^{t}$ is total dietary intake, and ${Ex}_{i}^{t}$ is total counts of physical activity as would be measured by an accelerometer. Glucose levels evolve according to

{Gl}^{t} = μ (1 - α_{1}) + α_{1} {Gl}^{t - 1} + α_{2} {Di}^{t - 1} + α_{3} {Di}^{t - 2} + α_{4} {Ex}^{t - 1} + α_{5} E x^{t - 2} + α_{6} {In}^{t - 1} + α_{7} {In}^{t - 2} + e,

(8)

where In^t is an indicator of an insulin injection received at time t and e ~ N(0, σ²). We use the parameter vector α = (α₁, … , α₇)^⊺ = (0.9, 0.1, 0.1, −0.01, −0.01, −2, −4)^⊺, μ = 100, and σ = 5.5 based on a linear model fit to the data of Maahs et al. (2012). The known lag-time in the effect of insulin is reflected by α₆ = −2 and α₇ = −4. Selecting α₁ < 1 ensures the existence of a stationary distribution.

We define the state vector for patient i at time t to contain average blood glucose, total dietary intake, and total physical activity measured over previous time intervals; we include blood glucose and physical activity for the previous two time intervals and dietary intake for the previous four time intervals. Let n denote number of patients and T denote number of time points per patient. Our choices for n and T are based on what is feasible for an mHealth outpatient study (dietary data was collected on two days by Maahs et al., 2012). For each replication, the optimal treatment regime is estimated with V-learning using three different types of basis functions and GGQ. The generative model for insulin treatment is not assumed to be known and we estimate it using logistic regression. We record mean outcomes in an independent sample of 100 patients followed for 100 time points with treatments generated according to each estimated optimal regime. Simulation results (estimated values under each regime and Monte Carlo standard errors along with observed values) are found in Table 2. Again, V-learning with Gaussian basis functions performs the best out of all methods, generally producing large values and small standard errors. V-learning with the linear model underperforms and GGQ underperforms somewhat more so.

Table 2:

Monte Carlo value estimates for simulated T1D cohorts with γ = 0.9.

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ	Observed
25	24	−2.716 (1.2015)	−2.335 (0.9818)	−2.018 (1.2011)	−3.870 (0.9225)	−2.316
	36	−2.700 (1.2395)	−2.077 (1.0481)	−1.760 (0.8468)	−3.644 (0.8745)	−2.261
	48	−2.496 (1.1986)	−2.236 (1.1978)	−1.751 (0.9887)	−2.405 (1.1025)	−2.365
50	24	−2.545 (1.1865)	−2.069 (1.0395)	−1.605 (0.8064)	−3.368 (1.0186)	−2.263
	36	−2.644 (1.1719)	−2.004 (0.9074)	−1.778 (0.8496)	−3.099 (0.9722)	−2.336
	48	−2.469 (1.1635)	−2.073 (0.9870)	−2.102 (1.2078)	−2.528 (0.9571)	−2.308
100	24	−2.350 (1.1171)	−2.128 (1.0520)	−1.612 (0.7203)	−3.272 (0.8636)	−2.299
	36	−2.547 (1.1852)	−2.116 (0.8518)	−1.672 (0.8643)	−3.232 (0.7951)	−2.321
	48	−2.401 (1.0643)	−2.204 (1.0400)	−1.494 (0.5413)	−2.820 (0.8442)	−2.351

Open in a new tab

5.2. Online simulations

In practice, it may be useful for patients to follow a dynamic treatment regime that is updated as new data are collected. Here we consider a hypothetical study wherein n patients are followed for an initial period of T′ time points, an optimal policy is estimated, and patients are followed for an additional T – T′ time points with the estimated optimal policy being continuously updated. At each time point, t ≥ T′, actions are taken according to the most recently estimated policy. Recall that V-learning produces a randomized decision rule from which to sample actions at each time point. When selecting an action based on a GGQ policy, we incorporate an ϵ-greedy strategy by selecting the action recommended by the estimated policy with probability 1 – ϵ and otherwise randomly selecting one of the other actions. At the tth estimation, we use ϵ = 0.5^t, allowing ϵ to decrease over time to reflect increasing confidence in the estimated policy. A burn-in period of 50 time points is discarded to ensure that we are sampling from a stationary distribution. We estimate the first policy after 12 time points and a new policy is estimated every 6 time points thereafter. After T time points, we estimate the value as the average utility over all patients and all time points after the initial period.

Table 3 presents mean outcomes under policies estimated online using data generated according to the simple two covariate generative model introduced at the beginning of Section 5.1. There is some variability across n and T regarding which type of basis function is best, but V-learning with a polynomial basis generally produces the best outcomes. GGQ performs well in large samples.

Table 3:

Value estimates for online simulations with γ = 0.9.

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ
25	24	0.0053	0.0149	−0.0100	−0.0081
	36	0.0525	0.0665	0.0310	0.0160
	48	0.0649	0.0722	0.0416	0.0493
50	24	0.0164	0.0117	0.0037	0.0058
	36	0.0926	0.0791	0.0666	0.0227
	48	0.1014	0.0894	0.0512	0.0434
100	24	0.0036	−0.0157	0.0200	0.0239
	36	0.0766	0.0626	0.0907	0.0540
	48	0.0728	0.0781	0.0608	0.0818

Open in a new tab

Next, we study the performance of online V-learning in simulated mHealth studies of type 1 diabetes by following the generative model described in (8). Mean outcomes are found in Table 4. Gaussian V-learning performs the best out of all methods. Across all variants of V-learning, outcomes improve with increased follow-up time.

Table 4:

Value estimates for online estimation of simulated T1D cohorts with γ = 0.9

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ
25	24	−2.3887	−1.9713	−1.8860	−3.2027
	36	−2.3784	−2.1535	−1.7857	−3.5127
	48	−2.2190	−2.0679	−1.6999	−3.2280
50	24	−2.3405	−2.2313	−1.7761	−2.8976
	36	−2.2829	−2.0922	−1.6016	−3.1589
	48	−2.1587	−1.9669	−1.5948	−2.8729
100	24	−2.3229	−2.2295	−1.9138	−3.0865
	36	−2.2927	−2.1608	−1.9030	−3.3483
	48	−2.2096	−2.0454	−1.8252	−2.9428

Open in a new tab

Finally, we consider online simulations using individualized policies as outlined at the end of Section 3. Consider the simple two covariate generative model introduced above but let state variables evolve according to $S_{i, 1}^{t} = μ_{i} (2 A_{i}^{t - 1} - 1) S_{i, 1}^{t - 1} + (1 / 4) S_{i, 1}^{t - 1} S_{i, 2}^{t - 1} + ϵ_{1}^{t}$ and $S_{i, 2}^{t} = μ_{i} (1 - 2 A_{i}^{t - 1}) S_{i, 2}^{t - 1} + (1 / 4) S_{i, 1}^{t - 1} S_{i, 2}^{t - 1} + ϵ_{2}^{t}$ where μ_i is a subject-specific term drawn uniformly between 0.4 and 0.9. Including μ_i ensures that the optimal policy differs across patients. Table 5 contains mean outcomes for online simulation where a universal policy is estimated using data from all patients and where individualized policies are estimated using only a single patient’s data. Because data are generated in a such a way that the optimal policy varies across patients, individualized policies achieve better outcomes than universal policies.

Table 5:

Value estimates for online V-learning simulations with universal and patient-specific policies when γ = 0.9.

n	T	Universal policy	Patient-specific policy
25	24	0.0282	0.1813
	36	0.1025	0.1700
	48	0.0977	0.1944
50	24	0.0164	0.2771
	36	0.0768	0.2617
	48	0.0752	0.3038
100	24	0.0160	0.4230
	36	0.0960	0.2970
	48	0.1140	0.3197

Open in a new tab

6. Case study: Type 1 diabetes

Machine learning is currently under consideration in type 1 diabetes through studies to build and test a “closed loop” system that joins continuous blood glucose monitoring and subcutaneous insulin infusion through an underlying algorithm. Known as the artificial pancreas, this technology has been shown to be safe in preliminary studies and is making headway from small hospital-based safety studies to large-scale outpatient effectiveness studies (Ly et al., 2014, 2015). Despite the success of the artificial pancreas, the rate of uptake may be limited and widespread use may not occur for many years (Kowalski, 2015). The proposed method may be useful for implementing mHealth interventions for use alongside the artificial pancreas or before it is widely available.

Studies have shown that data on food intake and physical activity to inform optimal decision making can be collected in an inpatient setting (see, e.g., Cobry et al., 2010; Wolever and Mullan, 2011). However, Maahs et al. (2012) demonstrated that rich data on the effect of food intake and physical activity can be collected in an outpatient setting using mobile technology. Here, we apply the proposed methodology to the observational data collected by Maahs et al. (2012).

The full data consist of N = 31 patients with type 1 diabetes, aged 12–18. Glucose levels were monitored using continuous glucose monitoring and physical activity tracked using accelerometers for five days. Dietary data were self-reported by the patient in telephone-based interviews for two days. Patients were treated using either an insulin pump or multiple daily insulin injections. We use data on a subset of n = 14 patients treated with an insulin pump for whom full follow-up is available on days when dietary information was recorded. This represents 28 patient-days of data, with which we use V-learning to estimate an optimal treatment policy. An advantage of mHealth is the ability to collect data passively, limiting the amount of missing data. There is no intermittent missingness in this data set.

The setup closely follows the simulation experiments in Section 5.1. Patient state at each time, t, is taken to be average glucose level and total counts of physical activity over the two previous 60 minute intervals and total food intake in grams over the four previous 60 minute intervals. The goal is to learn a policy to determine when to administer insulin injections based on prior blood glucose, dietary intake, and physical activity. The utility at time t is a weighted sum of glycemic events over the 60 minutes preceding and following time t with weights defined in Section 5.1. A treatment regime with large value will minimize the number of hypo- and hyperglycemic episodes weighted to reflect the clinical importance of each. We note that because V(π, s; θ^π) is linear in θ^π, we can evaluate ${\hat{V}}_{n, \hat{R}} (π)$ with only the mean of Φ(S) under $R$ . These were estimated from the data. Because we cannot simulate data following a given policy to estimate its value, we report the parametric value estimate ${\hat{V}}_{n, \hat{R}} ({\hat{π}}_{n})$ . Interpreting the parametric value estimate is difficult because of the effect the discount factor has on estimated value. We cannot compare parametric value estimates to mean outcomes observed in the data. Instead, we use $E_{n} \sum_{t \geq 0} γ^{t} U^{t}$ as an estimate of value under the observational policy.

We estimate optimal treatment strategies for two different action spaces. In the first, the only decision made at each time is whether or not to administer an insulin injection, i.e., the action space contains a single binary action. In the second, the action space contains all possible combinations of insulin injection, physical activity, and food intake. This corresponds to a hypothetical mHealth intervention where insulin injections are administered via an insulin pump and suggestions for physical activity and food intake are administered via a mobile app.

Table 6 contains parametric value estimates for policies estimated using V-learning for the two action spaces outlined above with different basis functions and discount factors. These results indicate that improvements in glycemic control can come from personalized and dynamic treatment strategies that account for food intake and physical activity. Improvement results from a dynamic insulin regimen (binary action space), and, in most cases, further improvement results from a comprehensive mHealth intervention including suggestions for diet and exercise delivered via mobile app in addition to insulin therapy (multiple action space). When considering multiple actions, the policy estimated using a polynomial basis and γ = 0.7 achieves a 64% increase in value and the policy estimated using a Gaussian basis and γ = 0.8 achieves a 68% increase in value over the observational policy. Although the small sample size is a weakness of this study, these results represent a significant improvement in value despite the sample size.

Table 6:

Parametric value estimates for V-learning applied to type 1 diabetes data.

Action space	Basis	γ = 0.7	γ = 0.8	γ = 0.9
Binary	Linear	−6.20	−9.35	−15.99
	Polynomial	−3.91	−9.03	−17.50
	Gaussian	−3.44	−13.09	−25.52
Multiple	Linear	−6.47	−9.92	−0.49
	Polynomial	−2.44	−6.80	−14.48
	Gaussian	−8.45	−3.58	−21.18
Observational policy		−6.77	−11.28	−21.79

Open in a new tab

Finally, we use an example hyperglycemic patient to illustrate how an estimated policy would be applied in practice. One patient in the data presented at a specific time with an average blood glucose of 229 m g/dL over the previous hour and an average blood glucose of 283 m g/dL over the hour before that. The policy estimated with γ = 0.7 and a polynomial basis recommends each action according to the probabilities in Table 7. Because this patient presented with blood glucose levels that are higher than the optimal range, the policy recommends actions that would lower the patient’s blood glucose levels, assigning a probability of 0.79 to insulin and a probability of 0.21 to insulin combined with activity.

Table 7:

Probabilities for each action as recommended by estimated policy for one example patient.

Action	Probability
No action	< 0.0001
Physical activity	< 0.0001
Food intake	< 0.0001
Food and activity	< 0.0001
Insulin	0.7856
Insulin and activity	0.2143
Insulin and food	0.0002
Insulin, food, and activity	< 0.0001

Open in a new tab

7. Conclusion

The emergence of mHealth has provided great potential for the estimation and implementation of dynamic treatment regimes. Mobile technologies can be used both in the collection of rich longitudinal data to inform decision making and the delivery of deeply tailored interventions. The proposed method, V-learning, addresses a number of challenges associated with estimating dynamic treatment regimes in mHealth applications. V-learning directly estimates a policy which maximizes the value over a class of policies and requires minimal assumptions on the data-generating process. Furthermore. V-learning permits estimation of a randomized decision rule which can be used in place of existing strategies (e.g., ϵ-greedy) to encourage exploration in online estimation. A randomized decision rule can also provide patients with multiple treatment options. Estimation of an optimal policy for different populations can be handled through the use of different reference distributions.

V-learning and mobile technologies have the potential to improve patient outcomes in a variety of clinical areas. We have demonstrated, for example, that the proposed method can be used to estimate treatment regimes to reduce the number of hypoand hyperglycemic episodes in patients with type 1 diabetes. The proposed method could also be useful for other mHealth applications as well as applications outside of mHealth. For example, V-learning could be used to estimate dynamic treatment regimes for chronic illnesses using electronic health records data. Future research in this area may include increasing flexibility through use of a semiparametric model for the state-value function. Alternatively, nonlinear models for the state-value function may be informed by underlying theory or mathematical models of the system of interest. Data-driven selection of tuning parameters for the proposed method may help to improve performance. Developing theory for alternative penalty functions, such as the LASSO penalty, is another important step. Accounting for patient availability and feasibility of a sequence of treatments can be done by setting constraints on the class of policies. This will ensure that the resulting mHealth intervention is able to be implemented and that the recommended decisions are consistent with domain knowledge.

It would also be worthwhile to generalize our asymptotic results to permit nonstationarity. While we believe that stationarity is generally a reasonable assumption for moderate stretches of time—including when we do online estimation using moderately large batches of observations wherein both the patient dynamics and treatment policy remain approximately constant over each batch—stationarity would not in general hold when either the patient dynamics or treatment policy change more rapidly. For example, online estimation with treatment policy changes after each observation could induce nonstationarity. We conjecture that our asymptotic results will continue to hold in this setting, as our simulation studies in Section 5.2 seem to indicate.

8. Acknowledgments

We thank the editor, associate editor, and reviewers for helpful comments which led to a significantly improved paper.

Appendix

Proofs

Proof of Lemma 2.1. Let π be an arbitrary policy and 7 γ ∈ (0, 1) a fixed constant. Suppose we observe a state S^t = s^t at time t and let ${\bar{a}}^{t - 1} = (a^{1}, \dots, a^{t - 1})$ be the sequence of actions resulting in S^t = s^t, i.e., $S^{* t} ({\bar{a}}^{t - 1}) = s^{t}$ . Let ${\bar{a}}^{k + 1} = (a^{t}, \dots, a^{t + k}) \in A^{k + 1}$ be a potential sequence of actions taken from time t to time t + k. We have that

\begin{array}{l} V (π, s^{t}) & = & \sum_{k \geq 0} γ^{k} E {U^{* (t + k)} (π) | S^{t} = s^{t}} \\ = & \sum_{k \geq 0} γ^{k} E (\sum_{{\bar{a}}^{t + k} \in A^{t + k}} U^{* (t + k)} ({\bar{a}}^{t + k}) \prod_{v = t}^{t + k} 1 [ξ_{π}^{v} {S^{* v} ({\bar{a}}^{v - 1})} = a^{v}] | S^{t} = s^{t}) \\ = & \sum_{k \geq 0} γ^{k} \sum_{{\bar{a}}^{k + 1} \in A^{k + 1}} U^{* (t + k)} ({\bar{a}}^{t - 1}, a^{- k + 1}) {\prod_{v = 1}^{t + k} E (1 [ξ_{π}^{v} {S^{* v} ({\bar{a}}^{v - 1})} = a^{v}] | S^{t} = s^{t})} \\ = & \sum_{k \geq 0} γ^{k} \sum_{{\bar{a}}^{k + 1} \in A^{k + 1}} U^{* (t + k)} ({\bar{a}}^{t - 1}, a^{- k + 1}) \prod_{v = t}^{t + k} π {a^{v}; S^{* v} ({\bar{a}}^{v - 1})} \prod_{v = t}^{t + k} \frac{μ^{v} {a^{v}; S^{* v} ({\bar{a}}^{v - 1})}}{μ^{v} {a^{v}; S^{* v} ({\bar{a}}^{v - 1})}} \\ = & \sum_{k \geq 0} γ^{k} E [U^{t + k} {\prod_{v = 0}^{k} \frac{π (a^{t + v}; s^{t + v})}{μ^{t + v} (a^{t + v}; s^{t + v})}} | S^{t} = s^{t}], \end{array}

where we let π(a^t; s^t) = 0 for all a^t and s^t whenever t > T*(π). The last equality uses the consistency and strong ignorability assumptions.

Proof of Theorem 4.2. Proof of part 1: We first note that $θ_{0}^{π}$ must solve

0 = E (\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} [U^{t} + {γ Φ (S^{t + 1}) - Φ (S^{t})}^{T} θ^{π}] Φ (S^{t})),

E [\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} {Φ (S^{t}) - γ Φ (S^{t + 1})}^{T}] θ^{π} = E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} U^{t} Φ (S^{t})},

which is equivalent to $w_{1} (π) θ^{π} = w_{2} (π)$ where $w_{2} (π) = E {π (A^{t}; S^{t}) μ^{t} {(A^{t}; S^{t})}^{- 1} U^{t} Φ (S^{t})}$ . We have that

‖ E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} U^{t} Φ (S^{t})} ‖ \leq c_{0}^{- 1} {(E {| U^{t} |}^{2})}^{1 / 2} {(E ‖ Φ (S^{t}) ‖^{2})}^{1 / 2} < \infty,

by assumption 3, part 1 of assumption 4 and the Cauchy-Schwarz inequality. Let $c \in ℝ^{q}$ be arbitrary and note that

E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} c^{⊺} Φ (S^{t}) Φ {(S^{t + 1})}^{⊺} c} \leq {[E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} c^{⊺} Φ {(S^{t})}^{\otimes 2} c} \cdot E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} c^{⊺} Φ {(S^{t + 1})}^{\otimes 2} c}]}^{1 / 2},

by the Cauchy-Schwarz inequality, where u^⊗2 = uu^⊺. This implies that

\begin{array}{l} c^{⊺} w_{1} (π) c & \geq & E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} c^{⊺} Φ {(S^{t})}^{\otimes 2} c} - E {\frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} c^{⊺} Φ {(S^{t})}^{\otimes 2} c}^{1 / 2} E {γ^{2} \frac{π (A^{t}; S^{t})}{μ^{t} (A^{t}; S^{t})} c^{⊺} Φ {(S^{t + 1})}^{\otimes 2} c}^{1 / 2} \\ = & A - A^{1 / 2} B^{1 / 2} \\ = & A^{1 / 2} (A^{1 / 2} - B^{1 / 2}) \\ = & \frac{A^{1 / 2} (A - B)}{A^{1 / 2} + B^{1 / 2}}, \end{array}

where we simplify notation by defining $A = E {π (A^{t}; S^{t}) μ^{t} {(A^{t}; S^{t})}^{- 1} c^{⊺} Φ {(S^{t})}^{\otimes 2} c}$ and $B = E {γ^{2} π (A^{t}; S^{t}) μ^{t} {(A^{t}; S^{t})}^{- 1} c^{⊺} Φ {(S^{t + 1})}^{\otimes 2} c}$ . We have that

\begin{array}{l} A^{1 / 2} + B^{1 / 2} & \leq & c_{0}^{- 1 / 2} ‖ c ‖ {E {‖ Φ (S^{t}) ‖}^{2}}^{1 / 2} + c_{0}^{- 1 / 2} ‖ c ‖ {E {‖ Φ (S^{t + 1}) ‖}^{2}}^{1 / 2} \\ = & 2 c_{0}^{- 1 / 2} ‖ c ‖ {E {‖ Φ (S^{t}) ‖}^{2}}^{1 / 2} \\ < & \infty, \end{array}

by the Cauchy-Schwarz inequality, the fact that $E {‖ Φ (S^{t}) ‖}^{2} = E {‖ Φ (S^{t + 1}) ‖}^{2}$ by time-homogeneity, and part 1 of assumption 4. Also, A ≥ A − B and A − B ≥ c₁∥c∥² by assumption 5. Thus,

\begin{array}{l} A - A^{1 / 2} B^{1 / 2} & \geq & \frac{c_{1}^{3 / 2} {‖ c ‖}^{3}}{2 c_{0}^{- 1 / 2} ‖ c ‖ {E {‖ Φ (S^{t}) ‖}^{2}}^{1 / 2}} \\ = & \frac{c_{0}^{1 / 2} c_{1}^{3 / 2} {‖ c ‖}^{2}}{2 {E {‖ Φ (S^{t}) ‖}^{2}}^{1 / 2}}, \end{array}

which finally implies that w₁(π) is invertible and thus $θ_{0}^{π} = w_{1} {(π)}^{- 1} w_{2} (π)$ is well-defined uniformly over π ∈ Π. Using the fact that c^⊺w₁(π)c ≥ k₀∥c∥² for a constant k₀ > 0, we can show that $‖ w_{1} {(π)}^{- 1} ‖ \leq k_{1}^{- 1}$ for some constant k₁ > 0, where ∥ · ∥ is the usual matrix norm when applied to a matrix. Therefore, $‖ θ_{0}^{π} ‖ \leq k_{1}^{- 1} ‖ w_{2} (π) ‖ \leq c_{0}^{- 1} k_{1}^{- 1} {E {(U^{t})}^{2}}^{1 / 2} {E {‖ Φ (S^{t}) ‖}^{2}}^{1 / 2} < \infty$ . Finally, it follows from assumptions 5 and 7 that ${sup}_{‖ β_{1} - β_{2} ‖ \leq δ} ‖ θ^{π_{β_{1}}} - θ^{π_{β_{2}}} ‖ \to 0 as δ ↓ 0$ .

Proof of part 2: Define

G = {Φ (s^{t}) Φ {(s^{t})}^{⊺} / μ^{t} (a^{t}; s^{t}), γ Φ (s^{t}) Φ {(s^{t + 1})}^{⊺} / μ^{t} (a^{t}; s^{t}), u^{t} Φ (s^{t}) / μ^{t} (a^{t}; s^{t})} .

Let G be an envelope for $G$ , for example $G (s^{t + 1}, a^{t}, s^{t}) = {max}_{g \in G} g (s^{t + 1}, a^{t}, s^{t})$ . By part 1 of assumption 4, $E G^{3 ρ} < \infty$ . Part 4 of Lemma 8.1 below gives us that $G$ is Donsker. Since Π satisfies J_[] {∞, Π, L_3ρ(P)} < ∞ to, we have that

F_{1} = {Ω^{1 / 2} \frac{π (a^{t}; s^{t})}{μ^{t} (a^{t}; s^{t})} Φ (s^{t}) {Φ (s^{t}) - γ Φ (s^{t + 1})}^{⊺} : π \in Π}

satisfies $J_{[]} {\infty, F_{1}, L_{3 ρ} (P)} < \infty$ by parts 1 and 2 of Lemma 8.1 below. Moreover, F(a^t, s^t) = ∥Φ(s^t)∥·∥Φ(s^t)−γΦ(s^t+1)∥/μ^t(a^t; s^⊺) is an envelope for $F_{1}$ with $E F^{3 ρ}$ < ∞ by assumption 3 and part 1 of assumption 4. Thus, $F_{1}$ is Donsker. Let

F_{2} = {Ω^{1 / 2} \frac{π (a^{t}; s^{t})}{μ^{t} (a^{t}; s^{t})} u^{t} Φ (s^{t}) : π \in Π} .

Similar arguments yield that $F_{2}$ is Donsker.

Now, let $\hat{A} (π) = {E_{n} f_{1 π} : f_{1 π} \in F_{1}}$ and $\hat{B} (π) = {E_{n} f_{2 π} : f_{2 π} \in F_{2}}$ . Let $\hat{A^{'}} (π) = \hat{A} (π) + λ_{n} \hat{A} {(π)}^{- ⊺}$ . We have that ${\hat{θ}}_{n}^{π} = \hat{A^{'}} {(π)}^{- 1} \hat{B} (π)$ . Thus,

\begin{array}{l} \sqrt{n} ({\hat{θ}}_{n}^{π} - θ_{0}^{π}) & = & \sqrt{n} {\hat{A^{'}} {(π)}^{- 1} \hat{B} (π) - \hat{A^{'}} {(π)}^{- 1} \hat{A^{'}} (π) θ_{0}^{π}} + o_{P} (1) \\ = & \hat{A^{'}} {(π)}^{- 1} \sqrt{n} {\hat{B} (π) - \hat{A^{'}} (π) θ_{0}^{π}} + o_{P} (1) \\ = & \hat{A^{'}} {(π)}^{- 1} \sqrt{n} {\hat{B} (π) - \hat{A} (π) θ_{0}^{π}} + \hat{A^{'}} {(π)}^{- 1} \sqrt{n} {\hat{A} (π) - \hat{A^{'}} (π)} θ_{0}^{π} + o_{P} (1) \\ = & \hat{A^{'}} {(π)}^{- 1} \sqrt{n} {\hat{B} (π) - \hat{A} (π) θ_{0}^{π}} + o_{P} (1) \end{array}

where o_P(1) doesn’t depend on π, because $\hat{A^{'}} {(π)}^{- 1} \overset{P}{\to} w_{1} {(π)}^{- 1} < \infty$ uniformly over π ∈ Π by assumption 3 and part 1 of assumption 4, ${sup}_{π \in Π} ‖ θ_{0}^{π} ‖ < \infty$ by part 1 of this theorem, and $\sqrt{n} {\hat{A} (π) - \hat{A^{'}} (π)} = - \sqrt{n} λ_{n} \hat{A} {(π)}^{- ⊺} = o_{P} (1)$ because λ_n = o_P(n^−1/2). Using arguments similar to those in the previous paragraph, one can show that $F_{3} = {f_{2 π} - f_{1 π} θ : f_{1 π} \in F_{1}, f_{2 π} \in F_{2}, π \in Π, θ \in B_{*}}$ is Donsker, where B_* is any finite collection of elements of $ℝ^{q}$ . By part 1 of this theorem, there exists a bounded, closed set B₀ such that $θ_{0}^{π} \in B_{0}$ for all π ∈ Π. Let $G_{n} (π, θ) = \sqrt{n} (E_{n} - E) (f_{2 π} - f_{1 π} θ)$ . Note that

\begin{array}{l} sup_{π \in Π} ‖ G_{n} (π, θ_{1}) - G_{n} (π, θ_{2}) ‖ & \leq & sup_{π \in Π} ‖ \sqrt{n} (E_{n} - E) f_{1 π} ‖ \cdot ‖ θ_{1} - θ_{2} ‖ \\ \leq & R^{*} ‖ θ_{1} - θ_{2} ‖, \end{array}

where R^* = O_P(1) by the Donsker property of $F_{1}$ and R^* doesn’t depend on π. Thus, $G_{n} (π, θ)$ is stochastically equicontinuous on B₀. Combined with the Donsker property of $F_{3}$ for arbitrary B_*, we have that the class $F_{4} = {f_{2 π} - f_{1 π} θ : f_{1 π} \in F_{1}, f_{2 π} \in F_{2}, π \in Π, θ \in B_{0}}$ is Donsker. Using Slutsky’s Theorem, Theorem 11.24 of Kosorok (2008), the fact that $F_{1}$ is Glivenko-Cantelli, and the fact that $θ_{0}^{π} = {(E f_{1 π})}^{- 1} E f_{2 π}$ , we have that $\sqrt{n} ({\hat{θ}}_{0}^{π} - θ_{0}^{π}) = \hat{A^{'}} {(π)}^{- 1} G_{n} (π, θ_{0}^{π}) ⇝ w_{1} {(π)}^{- 1} G_{0} (π)$ , in $ℓ^{\infty} (Π)$ where $G_{0} (π)$ is a mean zero Gaussian process indexed by Π with covariance $E {G_{0} (π_{1}) G_{0} (π_{2})} = w_{0} (π_{1}, π_{2})$ .

Proof of part 3: We have that

\begin{array}{l} \sqrt{n} {{\hat{V}}_{n, \hat{R}} (π) - V_{R} (π)} & = & \sqrt{n} E_{n} Φ (S^{t}) ({\hat{θ}}_{n}^{π} - θ_{0}^{π}) \\ ⇝ & ν^{⊺} w_{1} {(π)}^{- 1} G_{0} (π) \end{array}

in $ℓ^{\infty} (Π)$ by Slutsky’s Theorem.

Proof of Theorem 4.3. Proof of part 1: Following part 3 of Theorem 4.2, we have that ${sup}_{β \in B} | {\hat{V}}_{n, \hat{R}} (π_{β}) - V_{R} (π_{β}) | \overset{P}{\to} 0$ . Combining this with the unique and well separated maximum condition (assumption 6), continuity of $V_{R} (π_{β})$ in β, and Theorem 2.12 of Kosorok (2008), yields the result in part 1. Part 2 follows from part 1 of this theorem, part 1 of Theorem 4.2, and the continuous mapping theorem. Part 3 follows from parts 2 and 3 of Theorem 4.2. The proof of part 4 follows standard arguments. ☐

Lemma 8.1. Let $F$ and $G$ be function classes with respective envelopes F and G. Let ${‖ F ‖}_{u} = {(E {| F |}^{u})}^{1 / u}$ . For any 1 ≤ r, s₁, s₂ ≤ ∞ with $s_{1}^{- 1} + s_{2}^{- 1} = 1$ ,

$J_{[]} {\infty, F \cdot G, L_{r} (P)} \leq 2 ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}}) [J_{[]} {\infty, F, L_{r s_{1}} (P)} + J_{[]} {\infty, G, L_{r s_{2}} (P)}]$ .
$J_{[]} {\infty, F + G, L_{r} (P)} \leq 2 [J_{[]} {\infty, F, L_{r} (P)} + J_{[]} {\infty, G, L_{r} (P)}]$ .
For any $0 \leq r \leq \infty, J_{[]} {\infty, F \cup G, L_{r} (P)} \leq \sqrt{log 2} ({‖ F ‖}_{r} + {‖ G ‖}_{r}) + J_{[]} {\infty, F, L_{r} (P)} + J_{[]} {\infty, G, L_{r} (P)}$ .
If $G$ is a finite class, $J_{[]} {\infty, G, L_{r} (P)} \leq 2 {‖ G ‖}_{r} \sqrt{log | G |}$ , where $| G |$ denotes the cardinality of $G$ .

Proof of Lemma 8.1. Proof of part 1: Let 1 ≤ r, s₁, s₂ ≤ ∞ with $s_{1}^{- 1} + s_{2}^{- 1} = 1$ and let (ℓ_F, u_F) and (ℓ_G, u_G) be $L_{r s_{1}} (P)$ and $L_{r s_{2}} (P)$ ϵ-brackets, respectively. Choose ℓ_F, ≤ f₁, f₂ ≤ u_F and ℓ_G ≤ g₁, g₂ ≤ u_G and consider the bracket for any f₂g₂ defined by f₁g₁ ± (F|u_G − ℓ_G| + G|u_F − ℓ_F|). Note that f₁g₁+F|u_G−ℓ_G|+G|u_F−ℓ_F|−f₂g₂ ≥ F|u_G−ℓ_G|+G|u_F−ℓ_F|−F|g₁−g₂|−G|f₁−f₂| ≥ 0, because f₂g₂ − f₁g₁ = f₂g₂ + f₂g₁ − f₂g₁ − f₁g₁ ≤ F|g₁ − g₂| + G|f₁ − f₂|. Similarly, f₂g₂ + F|u_G − ℓ_G| + G|u_F − ℓ_F| − f1g1 ≥ 0. Thus, these brackets hold all f₂g₂ for f₂ ∈ (ℓ_F, u_F) and g₂ ∈ (ℓ_G, u_G). Now, ∥F|u_G − ℓ_G| + G|u_F − ℓ_F |∥_r ≤ ∥F∥_rs1ϵ + ∥G∥_rs2ϵ by Minkowski’s inequality and Hölder’s inequality, and it follows that

N_{[]} {2 ϵ ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}}), F \cdot G, L_{r} (P)} \leq N_{[]} {ϵ, F, L_{r s_{1}} (P)} N_{[]} {ϵ, G, L_{r s_{2}} (P)} .

Next we note that

N_{[]} {ϵ, F \cdot G, L_{r} (P)} \leq N_{[]} {\frac{ϵ}{2 ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}})}, F, L_{r s_{1}} (P)} N_{[]} {\frac{ϵ}{2 ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}})}, G, L_{r s_{2}} (P)}

and thus

\begin{array}{l} J_{[]} {\infty, F \cdot G, L_{r} (P)} & \leq & \int_{0}^{2 {‖ F ‖}_{r s_{1}} {‖ G ‖}_{r s_{2}}} \sqrt{log N_{[]} {\frac{ϵ}{2 ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}})}, F, L_{r s_{1}} (P)}} d ϵ + \int_{0}^{2 {‖ F ‖}_{r s_{1}} {‖ G ‖}_{r s_{2}}} \sqrt{log N_{[]} {\frac{ϵ}{2 ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}})}, G, L_{r s_{2}} (P)}} d ϵ \\ \leq & 2 ({‖ F ‖}_{r s_{1}} + {‖ G ‖}_{r s_{2}}) [J_{[]} {\infty, F, L_{r s_{1}} (P)} + J_{[]} {\infty, G, L_{r s_{2}} (P)}] . \end{array}

The proof of part 2 follows from Lemma 9.25 part (i) of Kosorok (2008) after a change of variables. Proof of part 3: First note that

N_{[]} {ϵ, F \cup G, L_{r} (P)} \leq N_{[]} {ϵ, F, L_{r} (P)} + N_{[]} {ϵ, G, L_{r} (P)},

whence it follows that

\begin{array}{l} J_{[]} {\infty, F \cup G, L_{r} (P)} & = & \int_{0}^{2 ({‖ F ‖}_{r} + {‖ G ‖}_{r})} \sqrt{log N_{[]} {ϵ, F \cup G, L_{r} (P)}} d ϵ \\ \leq & \int_{0}^{2 ({‖ F ‖}_{r} + {‖ G ‖}_{r})} \sqrt{log [N_{[]} {ϵ, F, L_{r} (P)} + N_{[]} {ϵ, G, L_{r} (P)}]} d ϵ \\ \leq & \int_{0}^{2 ({‖ F ‖}_{r} + {‖ G ‖}_{r})} \sqrt{log 2 + log N_{[]} {ϵ, F, L_{r} (P)} + log N_{[]} {ϵ, G, L_{r} (P)}} d ϵ \\ \leq & \int_{0}^{2 ({‖ F ‖}_{r} + {‖ G ‖}_{r})} \sqrt{log 2} d ϵ + J_{[]} {\infty, F, L_{r} (P)} + J_{[]} {\infty, G, L_{r} (P)}, \end{array}

where the second inequality uses the fact that a + b ≤ 2ab for all a, b ≥ 1.

Proof of part 4: If $G$ is finite, then $N_{[]} {ϵ, G, L_{r} (P)} \leq | G |$ . Thus,

\begin{array}{l} J_{[]} {\infty, G, L_{r} (P)} & = & \int_{0}^{2 {‖ G ‖}_{r}} \sqrt{log N_{[]} {ϵ, G, L_{r} (P)}} d ϵ \\ \leq & \int_{0}^{2 {‖ G ‖}_{r}} \sqrt{log | G |} d ϵ, \end{array}

which completes the proof.

Lemma 8.2. Define the class of functions

Π = {π_{\tilde{β}} (a; s) = \frac{a_{J} + \sum_{j = 1}^{J - 1} a_{j} exp (s_{j}^{⊺} β_{j})}{1 + \sum_{j = 1}^{J - 1} exp (s_{j}^{⊺} β_{j})} : \tilde{β} = {(β_{1}^{⊺}, \dots, β_{J - 1}^{⊺})}^{⊺}, \tilde{β} \in B \subset ℝ^{p (J - 1)}}

for a compact set $B$ and 2 ≤ J < ∞ where a = (a₁, …, a_J)^⊺. Then, there exists a b₀ < ∞ such that for any 1 ≤ r ≤ ∞, $J_{[]} {\infty, Π, L_{r} (P)} \leq b_{0} {‖ S ‖}_{r} \sqrt{p (J - 1) π}$ , which is finite whenever ∥S∥_r < ∞. Furthermore, ${sup}_{‖ {\tilde{β}}_{1} - {\tilde{β}}_{2} ‖ \leq δ} E ‖ π_{{\tilde{β}}_{1}} (A; S) - π_{{\tilde{β}}_{2}} (A; S) ‖ \to 0 as δ ↓ 0$ .

Proof of Lemma 8.2. For ${\tilde{β}}_{1}, {\tilde{β}}_{2} \in B$ , define $d ({\tilde{β}}_{1}, {\tilde{β}}_{2}) = {max}_{1 \leq j \leq J - 1} ‖ {\tilde{β}}_{1 j} - {\tilde{β}}_{2 j} ‖$ and $b_{0} = {sup}_{{\tilde{β}}_{1}, {\tilde{β}}_{2} \in B} ‖ {\tilde{β}}_{1} - {\tilde{β}}_{2} ‖ < 0$ because $B$ is compact. By the mean value theorem, for any ${\tilde{β}}_{1}, {\tilde{β}}_{2} \in B$ , there exists a point ${\tilde{β}}_{*}$ on the line segment between ${\tilde{β}}_{1}$ and ${\tilde{β}}_{2}$ such that

π_{{\tilde{β}}_{1}} (a; s) - π_{{\tilde{β}}_{2}} (a; s) = \frac{1}{1 + \sum_{j = 1}^{J - 1} exp (s^{⊺} {\tilde{β}}_{* j})} [\sum_{j = 1}^{J - 1} {a_{j} - π_{{\tilde{β}}_{*}} (a; s)} exp (s^{⊺} {\tilde{β}}_{* j}) s^{⊺} ({\tilde{β}}_{1 j} - {\tilde{β}}_{2 j})],

which implies that

| π_{{\tilde{β}}_{1}} (a; s) - π_{{\tilde{β}}_{2}} (a; s) | \leq ‖ s ‖ d ({\tilde{β}}_{1}, {\tilde{β}}_{2}) .

(9)

It follows from equation (9) that assumption 7 holds for this particular class of policies. Now, $N_{[]} {2 ϵ {‖ S ‖}_{r}, Π, L_{r} (P)} \leq N (ϵ, B, d)$ by Theorem 9.23 of Kosorok (2008). Furthermore, $N (ϵ, B, d) \leq max {{(b_{0} / ϵ)}^{p (J - 1)}, 1}$ , and thus

\begin{array}{l} J_{[]} {ϵ, Π, L_{r} (P)} & \leq & 2 {‖ S ‖}_{r} \int_{0}^{b_{0}} \sqrt{p (J - 1) {log b_{0} + log (1 / ϵ)}} d ϵ \\ \leq & 2 {‖ S ‖}_{r} b_{0} \sqrt{p (J - 1)} \int_{0}^{1} \sqrt{log (1 / ϵ)} d ϵ \\ = & 2 {‖ S ‖}_{r} b_{0} \sqrt{p (J - 1)} \int_{0}^{\infty} u^{1 / 2} exp (- u) d u \\ = & {‖ S ‖}_{r} b_{0} \sqrt{p (J - 1) π}, \end{array}

which proves the result.

Additional simulation results

Table 8:

Monte Carlo value estimates for offline simulations with γ = 0.9, $P$ is the LASSO penalty, and Ω is the identity.

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ	Observed
25	24	0.123 (0.0773)	0.117 (0.1067)	0.128 (0.0960)	0.025 (0.0330)	−0.005
	36	0.117 (0.0900)	0.120 (0.0933)	0.138 (0.0992)	0.030 (0.0341)	−0.004
	48	0.122 (0.0782)	0.103 (0.1002)	0.141 (0.0878)	0.028 (0.0301)	0.000
50	24	0.109 (0.0727)	0.122 (0.0954)	0.153 (0.0692)	0.028 (0.0321)	−0.005
	36	0.137 (0.0782)	0.127 (0.1061)	0.141 (0.0816)	0.024 (0.0285)	0.003
	48	0.110 (0.0761)	0.127 (0.0860)	0.147 (0.0778)	0.029 (0.0347)	0.000
100	24	0.125 (0.0802)	0.129 (0.0854)	0.164 (0.0609)	0.027 (0.0289)	−0.001
	36	0.151 (0.0739)	0.148 (0.0822)	0.131 (0.0897)	0.025 (0.0356)	−0.002
	48	0.131 (0.0726)	0.132 (0.0814)	0.169 (0.0666)	0.030 (0.0325)	−0.001

Open in a new tab

Table 9:

Monte Carlo value estimates for offline simulations with γ = 0.9, $P$ is Euclidean norm, and Ω is the inverse Fisher information.

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ	Observed
25	24	0.136 (0.0862)	0.118 (0.0925)	0.153 (0.0785)	0.022 (0.0272)	−0.005
	36	0.147 (0.0768)	0.132 (0.1057)	0.124 (0.0909)	0.026 (0.0355)	−0.004
	48	0.128 (0.0897)	0.146 (0.0826)	0.116 (0.1067)	0.020 (0.0317)	0.000
50	24	0.113 (0.0954)	0.129 (0.0964)	0.123 (0.1052)	0.027 (0.0275)	−0.005
	36	0.116 (0.0973)	0.149 (0.0940)	0.152 (0.0798)	0.029 (0.0289)	0.003
	48	0.109 (0.0899)	0.132 (0.0998)	0.124 (0.0932)	0.024 (0.0285)	0.000
100	24	0.167 (0.0652)	0.155 (0.0743)	0.144 (0.0897)	0.025 (0.0291)	−0.002
	36	0.167 (0.0731)	0.153 (0.0988)	0.155 (0.0851)	0.027 (0.0311)	−0.002
	48	0.137 (0.0868)	0.175 (0.0615)	0.148 (0.0978)	0.026 (0.0332)	−0.001

Open in a new tab

Table 10:

Monte Carlo value estimates for offline simulations with γ = 0.9, $P$ is the LASSO penalty, and Ω is the inverse Fisher information.

n	T	Linear VL	Polynomial VL	Gaussian VL	GGQ	Observed
25	24	0.123 (0.0750)	0.123 (0.0912)	0.140 (0.0930)	0.025 (0.0301)	−0.005
	36	0.139 (0.0780)	0.110 (0.1013)	0.138 (0.0813)	0.024 (0.0344)	−0.004
	48	0.135 (0.0690)	0.110 (0.1177)	0.143 (0.0718)	0.023 (0.0282)	0.000
50	24	0.118 (0.0705)	0.124 (0.0994)	0.137 (0.0802)	0.030 (0.0287)	−0.006
	36	0.117 (0.0827)	0.123 (0.0972)	0.121 (0.0804)	0.030 (0.0292)	0.003
	48	0.128 (0.0807)	0.113 (0.1085)	0.137 (0.0921)	0.023 (0.0282)	0.000
100	24	0.131 (0.0563)	0.123 (0.1015)	0.167 (0.0472)	0.029 (0.0295)	−0.001
	36	0.132 (0.0735)	0.148 (0.0851)	0.161 (0.0670)	0.029 (0.0334)	−0.002
	48	0.149 (0.0612)	0.137 (0.1003)	0.156 (0.0687)	0.023 (0.0267)	−0.001

Open in a new tab

Contributor Information

Daniel J. Luckett, Department of Biostatistics, University of North Carolina at Chapel Hill.

Eric B. Laber, Department of Statistics, North Carolina State University

Anna R. Kahkoska, Department of Nutrition, University of North Carolina at Chapel Hill

David M. Maahs, Department of Pediatrics, Stanford University

Elizabeth Mayer-Davis, Department of Nutrition, University of North Carolina at Chapel Hill.

Michael R. Kosorok, Department of Biostatistics, University of North Carolina at Chapel Hill

References

Ali AA, Hossain SM, Hovsepian K, Rahman MM, Plarre K, and Kumar S (2012). mpuff: Automated detection of cigarette smoking puffs from respiration measurements. In Proceedings of the 11th International Conference on Information Processing in Sensor Networks, pp. 269–280. ACM. [Google Scholar]
Bergenstal RM, Garg S, Weinzimer SA, Buckingham BA, Bode BW, Tamborlane WV, and Kaufman FR (2016). Safety of a hybrid closed-loop insulin delivery system in patients with type 1 diabetes. Journal of the American Medical Association 316 (13), 1407–1408. [DOI] [PubMed] [Google Scholar]
Bexelius C, Löf M, Sandin S, Lagerros YT, Forsum E, and Litton J-E (2010). Measures of physical activity using cell phones: Validation using criterion methods. Journal of Medical Internet Research 12 (1), e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakraborty B and Moodie EE (2013). Statistical Methods for Dynamic Treatment Regimes. Springer. [Google Scholar]
Cobry E, McFann K, Messer L, Gage V, VanderWel B, Horton L, and Chase HP (2010). Timing of meal insulin boluses to achieve optimal postprandial glycemic control in patients with type 1 diabetes. Diabetes Technology & Therapeutics 12 (3), 173–177. [DOI] [PubMed] [Google Scholar]
Dai Y-H (2002). Convergence properties of the BFGS algoritm. SIAM Journal on Optimization 13 (3), 693–701. [Google Scholar]
Doya K (2000). Reinforcement learning in continuous time and space. Neural computation 12 (1), 219–245. [DOI] [PubMed] [Google Scholar]
Ertefaie A (2014). Constructing dynamic treatment regimes in infinite-horizon settings. arXiv preprint arXiv:1406.0764.
Free C, Phillips G, Watson L, Galli L, Felix L, Edwards P, Patel V, and Haines A (2013). The effectiveness of mobile-health technologies to improve health care service delivery processes: A systematic review and meta-analysis. PLoS Med 10 (1), e1001363. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haller MJ, Stalvey MS, and Silverstein JH (2004). Predictors of control of diabetes: Monitoring may be the key. The Journal of Pediatrics 144 (5), 660–661. [DOI] [PubMed] [Google Scholar]
Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2 ed.). New York: Springer. [Google Scholar]
Hernan MA and Robins JM (2010). Causal Inference. CRC Boca Raton, FL. [Google Scholar]
Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, and Murphy SA (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34 (S), 1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kober J and Peters J (2012). Reinforcement learning in robotics: A survey In Reinforcement Learning, pp. 579–610. Springer. [Google Scholar]
Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. New York: Springer. [Google Scholar]
Kosorok MR and Moodie EE (2015). Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine, Volume 21 SIAM. [Google Scholar]
Kowalski A (2015). Pathway to artificial pancreas systems revisited: Moving downstream. Diabetes Care 38 (6), 1036–1043. [DOI] [PubMed] [Google Scholar]
Kumar S, Nilsen WJ, Abernethy A, Atienza A, Patrick K, Pavel M, Riley WT, Shar A, Spring B, Spruijt-Metz D, et al. (2013). Mobile health technology evaluation: The mhealth evidence workshop. American Journal of Preventive Medicine 45 (2), 228–236. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for Q-learning. Biometrika 101 (4), 831. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lai TL and Robbins H (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), 4–22. [Google Scholar]
Levine B-S, Anderson BJ, Butler DA, Antisdel JE, Brackett J, and Laffel LM (2001). Predictors of glycemic control and short-term adverse outcomes in youth with type 1 diabetes. The Journal of Pediatrics 139 (2), 197–203. [DOI] [PubMed] [Google Scholar]
Liao P, Klasnja P, Tewari A, and Murphy SA (2016). Sample size calculations for micro-randomized trials in mhealth. Statistics in Medicine 35 (12), 1944–1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
Linn KA, Laber EB, and Stefanski LA (2017). Interactive Q-learning for quantiles. Journal of the American Statistical Association 112 (518), 638–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, and Gonzalez-Recio O (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research 92 (3), 209–225. [DOI] [PubMed] [Google Scholar]
Ly TT, Breton MD, Keith-Hynes P, De Salvo D, Clinton P, Benassi K, Mize B, Chernavvsky D, Place J, Wilson DM, et al. (2014). Overnight glucose control with an automated, unified safety system in children and adolescents with type 1 diabetes at diabetes camp. Diabetes Care 37 (8), 2310–2316. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ly TT, Roy A, Grosman B, Shin J, Campbell A, Monirabbasi S, Liang B, von Eyben R, Shanmugham S, Clinton P, et al. (2015). Day and night closed-loop control using the integrated medtronic hybrid closed-loop system in type 1 diabetes at diabetes damp. Diabetes Care 38 (7), 1205–1211. [DOI] [PubMed] [Google Scholar]
Maahs DM, Mayer-Davis E, Bishop FK, Wang L, Mangan M, and McMurray RG (2012). Outpatient assessment of determinants of glucose excursions in adolescents with type 1 diabetes: Proof of concept. Diabetes Technology & Therapeutics 14 (8), 658–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maei HR, Szepesvári C, Bhatnagar S, and Sutton RS (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 719–726. [Google Scholar]
Moodie EE, Richardson TS, and Stephens DA (2007). Demystifying optimal dynamic rreatment regimes. Biometrics 63 (2), 447–455. [DOI] [PubMed] [Google Scholar]
Murphy S, Deng Y, Laber E, Maei H, Sutton R, and Witkiewitz K (2016). A batch, off-policy, actor-critic algorithm for optimizing the average reward. arXiv preprint arXiv:1607.05047. [Google Scholar]
Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B 65 (2), 331–355. [Google Scholar]
Murphy SA (2005). A generalization error for Q-learning. Journal of Machine Learning Research 6 (Jul), 1073–1097. [PMC free article] [PubMed] [Google Scholar]
Murphy SA, van der Laan MJ, and Robins JM (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), 1410–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
Puterman ML (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. [Google Scholar]
Quinn CC, Shardell MD, Terrin ML, Barr EA, Ballew SH, and Gruber-Baldini AL (2011). Cluster-randomized trial of a mobile phone personalized behavioral intervention for blood glucose control. Diabetes Care 34 (9), 1934–1942. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
Robins JM (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium on Biostatitics, pp. 189–326. Springer. [Google Scholar]
Rubin D (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics 6 (1), 34–58. [Google Scholar]
Schulte PJ, Tsiatis AA, Laber EB, and Davidian M (2014). Q- and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science 29 (4), 640–661. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinhubl SR, Muse ED, and Topol EJ (2013). Can mobile health technologies transform health care? Journal of the American Medical Association 310 (22), 2395–2396. [DOI] [PubMed] [Google Scholar]
Sutton R and Barto A (1998). Reinforcment Learning: An Introduction. The MIT Press. [Google Scholar]
Tang Y and Kosorok MR (2012). Developing adaptive personalized therapy for cystic fibrosis using reinforcement rearning. The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Working paper 30.
Weinzimer SA, Steil GM, Swan KL, Dziura J, Kurtz N, and Tamborlane WV (2008). Fully automated closed-loop insulin delivery versus semiautomated hybrid control in pediatric patients with type 1 diabetes using an artificial pancreas. Diabetes Care 31 (5), 934–939. [DOI] [PubMed] [Google Scholar]
Wolever T and Mullan Y (2011). Sugars and fat have different effects on postprandial glucose responses in normal and type 1 diabetic subjects. Nutrition, Metabolism and Cardiovascular Diseases 21 (9), 719–725. [DOI] [PubMed] [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 (4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 (3), 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Kosorok MR, and Zeng D (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28 (26), 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ziegler R, Heidtmann B, Hilgard D, Hofer S, Rosenbauer J, and Holl R (2011). Frequency of SMBG correlates with hba1c and acute complications in children and adolescents with type 1 diabetes. Pediatric Diabetes 12 (1), 11–17. [DOI] [PubMed] [Google Scholar]

[R1] Ali AA, Hossain SM, Hovsepian K, Rahman MM, Plarre K, and Kumar S (2012). mpuff: Automated detection of cigarette smoking puffs from respiration measurements. In Proceedings of the 11th International Conference on Information Processing in Sensor Networks, pp. 269–280. ACM. [Google Scholar]

[R2] Bergenstal RM, Garg S, Weinzimer SA, Buckingham BA, Bode BW, Tamborlane WV, and Kaufman FR (2016). Safety of a hybrid closed-loop insulin delivery system in patients with type 1 diabetes. Journal of the American Medical Association 316 (13), 1407–1408. [DOI] [PubMed] [Google Scholar]

[R3] Bexelius C, Löf M, Sandin S, Lagerros YT, Forsum E, and Litton J-E (2010). Measures of physical activity using cell phones: Validation using criterion methods. Journal of Medical Internet Research 12 (1), e2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chakraborty B and Moodie EE (2013). Statistical Methods for Dynamic Treatment Regimes. Springer. [Google Scholar]

[R5] Cobry E, McFann K, Messer L, Gage V, VanderWel B, Horton L, and Chase HP (2010). Timing of meal insulin boluses to achieve optimal postprandial glycemic control in patients with type 1 diabetes. Diabetes Technology & Therapeutics 12 (3), 173–177. [DOI] [PubMed] [Google Scholar]

[R6] Dai Y-H (2002). Convergence properties of the BFGS algoritm. SIAM Journal on Optimization 13 (3), 693–701. [Google Scholar]

[R7] Doya K (2000). Reinforcement learning in continuous time and space. Neural computation 12 (1), 219–245. [DOI] [PubMed] [Google Scholar]

[R8] Ertefaie A (2014). Constructing dynamic treatment regimes in infinite-horizon settings. arXiv preprint arXiv:1406.0764.

[R9] Free C, Phillips G, Watson L, Galli L, Felix L, Edwards P, Patel V, and Haines A (2013). The effectiveness of mobile-health technologies to improve health care service delivery processes: A systematic review and meta-analysis. PLoS Med 10 (1), e1001363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Haller MJ, Stalvey MS, and Silverstein JH (2004). Predictors of control of diabetes: Monitoring may be the key. The Journal of Pediatrics 144 (5), 660–661. [DOI] [PubMed] [Google Scholar]

[R11] Hastie T, Tibshirani R, and Friedman JH (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2 ed.). New York: Springer. [Google Scholar]

[R12] Hernan MA and Robins JM (2010). Causal Inference. CRC Boca Raton, FL. [Google Scholar]

[R13] Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, and Murphy SA (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34 (S), 1220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Kober J and Peters J (2012). Reinforcement learning in robotics: A survey In Reinforcement Learning, pp. 579–610. Springer. [Google Scholar]

[R15] Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. New York: Springer. [Google Scholar]

[R16] Kosorok MR and Moodie EE (2015). Adaptive Treatment Strategies in Practice: Planning Trials and Analyzing Data for Personalized Medicine, Volume 21 SIAM. [Google Scholar]

[R17] Kowalski A (2015). Pathway to artificial pancreas systems revisited: Moving downstream. Diabetes Care 38 (6), 1036–1043. [DOI] [PubMed] [Google Scholar]

[R18] Kumar S, Nilsen WJ, Abernethy A, Atienza A, Patrick K, Pavel M, Riley WT, Shar A, Spring B, Spruijt-Metz D, et al. (2013). Mobile health technology evaluation: The mhealth evidence workshop. American Journal of Preventive Medicine 45 (2), 228–236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Laber EB, Linn KA, and Stefanski LA (2014). Interactive model building for Q-learning. Biometrika 101 (4), 831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lai TL and Robbins H (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), 4–22. [Google Scholar]

[R21] Levine B-S, Anderson BJ, Butler DA, Antisdel JE, Brackett J, and Laffel LM (2001). Predictors of glycemic control and short-term adverse outcomes in youth with type 1 diabetes. The Journal of Pediatrics 139 (2), 197–203. [DOI] [PubMed] [Google Scholar]

[R22] Liao P, Klasnja P, Tewari A, and Murphy SA (2016). Sample size calculations for micro-randomized trials in mhealth. Statistics in Medicine 35 (12), 1944–1971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Linn KA, Laber EB, and Stefanski LA (2017). Interactive Q-learning for quantiles. Journal of the American Statistical Association 112 (518), 638–649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Long N, Gianola D, Rosa GJ, Weigel KA, Kranis A, and Gonzalez-Recio O (2010). Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research 92 (3), 209–225. [DOI] [PubMed] [Google Scholar]

[R25] Ly TT, Breton MD, Keith-Hynes P, De Salvo D, Clinton P, Benassi K, Mize B, Chernavvsky D, Place J, Wilson DM, et al. (2014). Overnight glucose control with an automated, unified safety system in children and adolescents with type 1 diabetes at diabetes camp. Diabetes Care 37 (8), 2310–2316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Ly TT, Roy A, Grosman B, Shin J, Campbell A, Monirabbasi S, Liang B, von Eyben R, Shanmugham S, Clinton P, et al. (2015). Day and night closed-loop control using the integrated medtronic hybrid closed-loop system in type 1 diabetes at diabetes damp. Diabetes Care 38 (7), 1205–1211. [DOI] [PubMed] [Google Scholar]

[R27] Maahs DM, Mayer-Davis E, Bishop FK, Wang L, Mangan M, and McMurray RG (2012). Outpatient assessment of determinants of glucose excursions in adolescents with type 1 diabetes: Proof of concept. Diabetes Technology & Therapeutics 14 (8), 658–664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Maei HR, Szepesvári C, Bhatnagar S, and Sutton RS (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 719–726. [Google Scholar]

[R29] Moodie EE, Richardson TS, and Stephens DA (2007). Demystifying optimal dynamic rreatment regimes. Biometrics 63 (2), 447–455. [DOI] [PubMed] [Google Scholar]

[R30] Murphy S, Deng Y, Laber E, Maei H, Sutton R, and Witkiewitz K (2016). A batch, off-policy, actor-critic algorithm for optimizing the average reward. arXiv preprint arXiv:1607.05047. [Google Scholar]

[R31] Murphy SA (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B 65 (2), 331–355. [Google Scholar]

[R32] Murphy SA (2005). A generalization error for Q-learning. Journal of Machine Learning Research 6 (Jul), 1073–1097. [PMC free article] [PubMed] [Google Scholar]

[R33] Murphy SA, van der Laan MJ, and Robins JM (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), 1410–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Puterman ML (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. [Google Scholar]

[R35] Quinn CC, Shardell MD, Terrin ML, Barr EA, Ballew SH, and Gruber-Baldini AL (2011). Cluster-randomized trial of a mobile phone personalized behavioral intervention for blood glucose control. Diabetes Care 34 (9), 1934–1942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] R Core Team (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]

[R37] Robins JM (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium on Biostatitics, pp. 189–326. Springer. [Google Scholar]

[R38] Rubin D (1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics 6 (1), 34–58. [Google Scholar]

[R39] Schulte PJ, Tsiatis AA, Laber EB, and Davidian M (2014). Q- and A-learning methods for estimating optimal dynamic treatment regimes. Statistical Science 29 (4), 640–661. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Steinhubl SR, Muse ED, and Topol EJ (2013). Can mobile health technologies transform health care? Journal of the American Medical Association 310 (22), 2395–2396. [DOI] [PubMed] [Google Scholar]

[R41] Sutton R and Barto A (1998). Reinforcment Learning: An Introduction. The MIT Press. [Google Scholar]

[R42] Tang Y and Kosorok MR (2012). Developing adaptive personalized therapy for cystic fibrosis using reinforcement rearning. The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Working paper 30.

[R43] Weinzimer SA, Steil GM, Swan KL, Dziura J, Kurtz N, and Tamborlane WV (2008). Fully automated closed-loop insulin delivery versus semiautomated hybrid control in pediatric patients with type 1 diabetes using an artificial pancreas. Diabetes Care 31 (5), 934–939. [DOI] [PubMed] [Google Scholar]

[R44] Wolever T and Mullan Y (2011). Sugars and fat have different effects on postprandial glucose responses in normal and type 1 diabetic subjects. Nutrition, Metabolism and Cardiovascular Diseases 21 (9), 719–725. [DOI] [PubMed] [Google Scholar]

[R45] Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 (4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Zhang B, Tsiatis AA, Laber EB, and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 (3), 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Zhao Y, Kosorok MR, and Zeng D (2009). Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28 (26), 3294–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Ziegler R, Heidtmann B, Hilgard D, Hofer S, Rosenbauer J, and Holl R (2011). Frequency of SMBG correlates with hba1c and acute complications in children and adolescents with type 1 diabetes. Pediatric Diabetes 12 (1), 11–17. [DOI] [PubMed] [Google Scholar]

PERMALINK

Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

Daniel J Luckett

Eric B Laber

Anna R Kahkoska

David M Maahs

Elizabeth Mayer-Davis

Michael R Kosorok

Abstract

1. Introduction

2. Offline estimation from observational data

2.1. Greedy gradient Q-learning

3. Online estimation from accumulating data

4. Theoretical results

5. Simulation experiments

5.1. Offline simulations

Table 1:

Table 2:

5.2. Online simulations

Table 3:

Table 4:

Table 5:

6. Case study: Type 1 diabetes

Table 6:

Table 7:

7. Conclusion

8. Acknowledgments

Appendix

Proofs

Additional simulation results

Table 8:

Table 9:

Table 10:

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Estimating Dynamic Treatment Regimes in Mobile Health Using V-learning

Daniel J Luckett

Eric B Laber

Anna R Kahkoska

David M Maahs

Elizabeth Mayer-Davis

Michael R Kosorok

Abstract

1. Introduction

2. Offline estimation from observational data

2.1. Greedy gradient Q-learning

3. Online estimation from accumulating data

4. Theoretical results

5. Simulation experiments

5.1. Offline simulations

Table 1:

Table 2:

5.2. Online simulations

Table 3:

Table 4:

Table 5:

6. Case study: Type 1 diabetes

Table 6:

Table 7:

7. Conclusion

8. Acknowledgments

Appendix

Proofs

Additional simulation results

Table 8:

Table 9:

Table 10:

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases