Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 6.
Published in final edited form as: Mach Learn. 2021 Jun 21;110(9):2685–2727. doi: 10.1007/s10994-021-05995-8

IntelligentPooling: Practical Thompson Sampling for mHealth

Sabina Tomkins 1, Peng Liao 2, Predrag Klasnja 3, Susan Murphy 4
PMCID: PMC8494236  NIHMSID: NIHMS1718065  PMID: 34621105

Abstract

In mobile health (mHealth) smart devices deliver behavioral treatments repeatedly over time to a user with the goal of helping the user adopt and maintain healthy behaviors. Reinforcement learning appears ideal for learning how to optimally make these sequential treatment decisions. However, significant challenges must be overcome before reinforcement learning can be effectively deployed in a mobile healthcare setting. In this work we are concerned with the following challenges: 1) individuals who are in the same context can exhibit differential response to treatments 2) only a limited amount of data is available for learning on any one individual, and 3) non-stationary responses to treatment. To address these challenges we generalize Thompson-Sampling bandit algorithms to develop IntelligentPooling. IntelligentPooling learns personalized treatment policies thus addressing challenge one. To address the second challenge, IntelligentPooling updates each user’s degree of personalization while making use of available data on other users to speed up learning. Lastly, IntelligentPooling allows responsivity to vary as a function of a user’s time since beginning treatment, thus addressing challenge three.

1. Introduction

Mobile health (mHealth) applications deliver treatments in users’ everyday lives to support healthy behaviors. These mHealth applications offer an opportunity to impact health across a diverse range of domains from substance use [46], to disease self-management [25] to physical inactivity [15]. For example, to help users increase their physical activity, an mHealth application might send walking suggestions at the times and in the contexts (e.g. current location or recent physical activity) when a user is likely to be able to pursue the suggestions. A goal of mHealth applications is to provide treatments in contexts in which users need support while avoiding over-treatment. Over-treatment can lead to user disengagement [41], for example users might ignore treatments or even delete the application. Consequently, the goal is to be able to learn an optimal policy for when and how to intervene for each user and context without over-treating.

Contextual bandit algorithms appear ideal for this task. Contextual bandit algorithms have been successful in a range of application settings from news recommendations [34] to education [43]. However, as we discuss below, many challenges remain to adapt contextual bandit algorithms for mHealth settings. Thompson sampling offers an attractive framework for addressing these challenges. In their seminal work [3], Agrawal and Goyal show that Thompson sampling for contextual bandits, which works well in practice, can also achieve strong theoretical guarantees. In our work, we propose Thompson sampling contextual bandit algorithm which introduces a mixed effects structure for the weights on the feature vector, an algorithm we call IntelligentPooling. We demonstrate empirically that IntelligentPooling has many advantages. We also derive a high-probability regret bound for our approach which achieves similar regret to [3]. Unlike [3], our regret bound depends on the variance components introduced by the mixed effects structure which is at the center of our approach.

1.1. Challenges

There are significant challenges to learning optimal policies in mHealth. This work primarily addresses the challenge of learning personalized user policies from limited data. Contextual bandit algorithms can be viewed as algorithms that use the user’s context to adapt treatment. While this approach can have advantages compared to ignoring the user’s context, it fails to address that users can respond differentially to treatments even when they appear to be in the same context. This occurs since sensors on smart devices are unlikely to record all aspects of a user’s context that affect their health behaviors. For example, the context may not include social constraints on the user (e.g., care-giving responsibilities), which may influence the user’s ability to be active. Thus, algorithms that can learn from the differential responsiveness to treatment are desirable. This motivates the need for an algorithm that not only incorporates contextual information, but that can also learn personalized policies. A natural first approach would be to use the algorithm separately for each user, but the algorithm is likely to learn very slowly if data on a user is sparse and/or noisy. However, typically in mHealth studies multiple users are using the application at any given time. Thus an algorithm that pools data over users intelligently so as to speed up learning of personalized policies is desirable.

An additional challenge is non-stationary responses to treatment (e.g. non-stationary reward function). For example, in the beginning of a study, a user might be excited to receive a treatment, however after a few weeks this excitement can wane. This motivates the need for algorithms that can learn time-varying treatment policies.

1.2. Contributions

We develop IntelligentPooling, a type of Thompson sampling contextual bandit algorithm specifically designed to overcome the above challenges. Our main contributions are:

  • IntelligentPooling: A Thompson sampling contextual bandit algorithm for rapid personalization in limited data settings. This algorithm employs classical random effects in the reward function [32, 47] and empirical Bayes [10, 39]) to adaptively adjust the degree to which policies are personalized to each user. We present an analysis of this adaptivity in Section 3.5 showing that IntelligentPooling can learn to personalize to a user as a function of the observed variance in the treatment effect both between and within users.

  • A high probability regret bound for IntelligentPooling.

  • An empirical evaluation of IntelligentPooling in a simulation environment constructed from mHealth data. IntelligentPooling not only achieves 26% lower regret than state-of-the-art approaches, it also is better able to adapt to the degree of heterogeneity present in a population than this approach.

  • Feasibility of IntelligentPooling from a pilot study in a live clinical trial. We demonstrate that IntelligentPooling can be executed in a real-time online environment and show preliminary evidence of this method’s effectiveness.

  • We show how to modify IntelligentPooling to learn in non-stationary environments.

Next, in Section 2 we discuss relevant related work. In Section 3 we present IntelligentPooling and provide a high-probability regret bound for this algorithm. We then describe how we use historical data to construct a simulation environment and evaluate our approach against state-of-the-art in Section 4. Next, in Section 5 we introduce the feasibility study and provide preliminary evidence into the benefits of this approach. We then discuss how to extend this work to include time-varying effects in Section 6. Finally, we discuss the limitations with our approach in Section 7 before concluding.

2. Related Work

To put the proposed work in a broader healthcare perspective, an overview of similar work in mHealth is provided by Section 2.1. Next, we discuss the extent to which reinforcement learning/bandit algorithms have been deployed in mHealth settings (Section 2.1). IntelligentPooling has similarities with several modeling approaches, here we discuss the most relevant: multi-task learning, meta-learning, Gaussian processes for Thompson Sampling contextual bandits, and time-delayed bandits. These topics are discussed in Section 2.2 - Section 2.4.

2.1. Connections to Bandit algorithms in mHealth

Bandit algorithms in mHealth have typically used one of two approaches. The first approach is person specific, that is, an algorithm is deployed separately on each user, such as in [45], [26], [21] and [37]. This approach makes sense when users are highly heterogeneous, that is, their optimal policies differ greatly one from another. However, this approach can present challenges for policy learning when data is scarce and/or noisy, as in our motivating example of encouraging activity in an mHealth study where only a few decision time-points occur each day (see Xia [59] for an empirical evaluation of the shortcomings of Thompson sampling for personalized contextual bandits in mHealth settings). The second approach completely pools users’ data, that is one algorithm is used on all users so as to learn a common treatment policy both in bandit algorithms [42, 60], and in full reinforcement learning algorithms [14, 62]. This second approach can potentially learn quickly but may result in poor performance if there is large heterogeneity between users. We compare to these two approaches empirically as they not only represent state-of-the-art in practice, they also represent two intuitive theoretical extremes.

In IntelligentPooling we strike a balance between these two extremes, adjusting the degree of pooling to the degree that users are similarly responsive. When users are heterogeneous, IntelligentPooling achieves lower regret than the second approach while learning more quickly than the first approach. When users are homogeneous our method performs as well as the second approach.

2.2. Connections to multi-task learning and meta-learning

Following original work on non-pooled linear contextual bandits[3], researchers have proposed pooling data in a variety of ways. For example, Deshmukh et al. [17] proposed pooling data from different arms of a single bandit problem. Li and Kar [35] used context-sensitive clustering to produce aggregate reward estimates for the bandit algorithm. More relevant to this work is multi-task Gaussian Process (GP), e.g., [6, 33, 56], however these have been proposed in the prediction as opposed to the reinforcement learning setting. The Gang of Bandits approach [11], which is a generalization from the original LinUCB algorithm for a single task [34], has been shown to be successful when there is prior knowledge on the similarities between users. For example, a known social network graph might provide a mechanism for pooling. It was later extended to the Horde of Bandits in [55] which used Thompson Sampling, allowing the algorithm to deal with a large number of tasks.

Each of the multi-task approaches introduces some concept of similarity between users. The extent to which a given user’s data contributes to another user’s policy is some function of this similarity measure. This is fundamentally different from the approach taken in IntelligentPooling. Rather than determining the extent to which any two users are similar, IntelligentPooling determines the extent to which a given user’s reward function parameters differ from parameters in a population (average over all users) reward function. This approach has the advantage of requiring fewer hyper-parameters, as we do not need to learn a similarity function between users. Instead of a pairwise similarity function it is as if we are learning a similarity between each user and the population average. In the limited data setting, we expect this simpler model to be advantageous.

In meta-learning, one exploits shared structure across tasks to improve performance on new tasks. IntelligentPooling thus shares similarities with meta-learning for reinforcement learning [19, 20, 24, 40, 51, 63]. At a high level, one can view our method as a form of meta-learning where the population-level parameters are learned from all available data and each user’s parameters represent deviations from the shared parameters. However, while meta-learning might require a large collection of source tasks, we demonstrate the efficacy of our approach on data on the small scale found in clinical mHealth studies.

2.3. Connections to Gaussian process models for Thompson sampling contextual bandits

IntelligentPooling is based on Bayesian mixed effects model of the reward, which is similar to using a Gaussian Process (GP) model with a simple form of the kernel. GP models have been used for multi-armed bandits [5, 8, 13, 16, 18, 53, 57], and for contextual bandits [31, 34]. However the above approaches do not structure the way in which the pooling of data across users occurs. IntelligentPooling uses a mixed effects GP model to pool across users in structured manner. Although mixed effects GP models have been previously used for off-line data analysis [38, 52], to the best of our knowledge they have not been previously used in the online decision making setting considered in this work.

2.4. Connection to non-stationary linear bandits

There is a growing literature investigating how to adapt linear bandit algorithms to changing environments. A common approach is for the learning algorithm to differentially weight data across time. Differential weighting is used by both Russac et al. [48] (using a LinUCB algorithm) and Kim and Tewari [27] (using perturbation-based algorithms). Cheung et al. [12] use a linear moving window to estimate the parameters in the reward function and Zhao et al. [61] restart the algorithm at regular intervals discarding the prior data. Similarly Bogunovic et al. [5], using GP-based UCB algorithms, accommodate non-stationarity by both restarting and using an autoregressive model for the rewards function. Kim and Tewari [28] analyze the non-stationary setting with randomized exploration.

IntelligentPooling allows for non-stationary reward functions by the use of time-varying random effects. The correlation between the time-varying random effects induces a weighted estimator whereby more weight is put on the recently collected samples, similar to the discounted estimators in [48] and [27]. In contrast to existing approaches, IntelligentPooling considers both individual and time-specific variation.

3. Intelligent Pooling

IntelligentPooling is a generalization of a Thompson sampling contextual bandit for learning personalized treatment policies. We first outline the components of IntelligentPooling and then introduce the problem definition in Section 3.2. As our approach offers a natural alternative to two commonly used approaches, we begin by describing these simpler methods in Section 3.3. We introduce our method in Section 3.4.

3.1. Overview

The central component of IntelligentPooling is a Bayesian model for the reward function. In particular, IntelligentPooling uses a Gaussian mixed effects linear model for the reward function. Mixed effects models are widely used across the health and behavioral sciences to model the variation in the linear model parameters across users [32, 47] and within a user across time. Use of these models enhances the ability of domain scientists to inform and critique the model used in IntelligentPooling. The properties and pitfalls of these models are well understood; see [44] for an application of a mixed effects model in mHealth. IntelligentPooling uses Bayesian inference for the mixed effects model. As discussed in Section 2.3, a Bayesian mixed effects linear model is a GP model with a simple kernel. This facilitates increasing the flexibility of the model for the reward function, given sufficient data.

Furthermore, IntelligentPooling uses Thompson sampling [54], also known as posterior sampling [49], to select actions. At each decision point, the parameters in the model for the reward function are sampled from their posterior distribution, thus inducing exploration over the action space [50]. These sampled parameters are then used to form an estimated reward function and the action with the highest estimated reward is selected.

The hyper-parameters (e.g., the variance of the random effects) control the extent of pooling across users and across decision times. The right amount of pooling depends on the heterogeneity among users and the non-stationarity, which is often difficult to pre-specify. Unlike other bandit algorithms in which the hyper-parameters are set at the beginning [11, 17, 55], IntelligentPooling includes a procedure for updating the hyper-parameters online. In particular, empirical Bayes [9] is used to update the hyper-parameters in the online setting, as more data becomes available.

3.2. Problem formulation

Consider an mHealth study which will recruit a total of N users. 1 Let i[N]={1,,N} be a user index. For each user, we use k{1,2,} to index decision times, i.e., times at which a treatment could be provided. Denote by Si,k the states/contexts at the kth decision time of user i. For simplicity, we focus on the case where the action is binary, i.e., Ai,k{0,1}. The algorithm can be easily generalized to cases with more than two actions. After the action Ai,k is chosen, the reward Ri,k is observed. Throughout the remainder of the paper, S, A and R are random variables and we use lowercase (s, a and r) to refer to a realization of these random variables.

Below we consider a simpler setting where the parameters in the reward are assumed time-stationary. We discuss how to generalize the algorithm to the non-stationary setting in Section 6. The goal is to learn personalized treatment policies for each of the N users. We treat this as N contextual bandit problems as the reward function may differ between users. In mHealth settings this might occur due to the inability of sensors to record users’ entire contexts. Section 3.3 reviews two approaches for using Thompson Sampling [2] and Section 3.4 presents IntelligentPooling, our approach for learning the treatment policy for any specific user.

3.3. Two Thompson Sampling instantiations

First, consider learning the treatment policy separately per person. We refer to this approach as Person-Specific. At each decision time k, we would like to select a treatment Ai,k{0,1} based on the context Si,k. We model the reward Ri,k by a Bayesian linear regression model: for user i and time k

Ri,k=ϕ(Si,k,Ai,k)wi+ϵi,k, (1)

where ϕ(s,a) is a pre-specified mapping from a context s and treatment a (e.g., those described in Section 4.2), wi is a vector of weights which we will learn, and ϵi,kN(0,σϵ2) is the error term. The weight vectors {wi} are assumed independent across users and to follow a common prior distribution wiN(μw,Σw). See Fig. 1 for a graphical representation of this approach.

Figure 1:

Figure 1:

Consider a setting with two users, here we show the relationship between select random variables in our model: Ri,k the reward for user i at decision time k, σϵi,k2 the noise for user i at time k and wi the latent weight vector for user i. In Person-Specific we see that each user’s parameters are independent. Only the prior parameter values are shared, all else is updated independently.

Now at the kth decision time with the context Si,k = s, Person-Specific selects the treatment Ai,k = 1 with probability

πi,k=Pr{ϕ(s,1)w˜i,k>ϕ(s,0)w˜i,k} (2)

where w˜i,k follows the posterior distribution of the parameters wi in the model (1) given the user’s history up to the current decision time k. We emphasize that in this formulation the posterior distribution of wi is formed based each user’s own data.

The opposite approach is to learn a common bandit model for all users. In this approach, the reward model is a single Bayesian regression model with no individual-level parameters:

Ri,k=ϕ(Si,k,Ai,k)w+ϵi,k. (3)

where the common parameters, w, follows the prior distribution wN(μw,Σw). See Fig. 2 for the graphical representation of this approach. We then use the posterior distribution of the weight vector w to sample treatments for each user. Here the posterior is calculated based on the available data from all users observed up to and including time k. This approach, which we refer to as Complete, may suffer from high bias when there is significant heterogeneity among users.

Figure 2:

Figure 2:

Consider a setting with two users, here we show the relationship between select random variables in our model: Ri,k the reward for user i at decision time k,ϵk the noise at time k and wpop the latent weight vector. In Complete we see that each user’s parameters are the same. With each parameter update the weight vector for every user is also updated.

3.4. Intelligent pooling across bandit problems

IntelligentPooling is an alternative to the two approaches mentioned above. Specifically, in IntelligentPooling data is pooled across users in an adaptive way, i.e., when there is strong homogeneity observed in the current data, the algorithm will pool more from others than when there is strong heterogeneity.

Model specification

We model the reward associated with taking action Ai,k for user i at decision time k by the linear model (1). Unlike Person-Specific where the person-specific weight vectors {wi,i[N]} are assumed to be independent to each other, IntelligentPooling imposes structure on the wi’s, in particular, a random-effects structure [32, 47]:

wi=wpop+ui, (4)

where wpop is a population-level parameter and ui is a random effect that represents the Person-Specific deviation from wpop for user i. The extent to which the posterior means for wpop and ui are based on user i’s data relative to the population depends on the variances of the random effects (for a stylized example of this see Section 3.5). In Section 6 we show how we can modify this structure to include time-specific parameters, or a time-specific random effect. A graphical representation for IntelligentPooling is shown in Fig. 3.

Figure 3:

Figure 3:

Consider a setting with two users, here we show the relationship between select random variables in our model: Ri,k the reward for user i at decision time k,ϵi,k the noise for user i at time k, wpop the latent weight vector and ui the random effect for user i. In IntelligentPooling we see that some parameters (wpop) are shared across the population which others (ui) are user specific.

We assume the prior on wpop is Gaussian with prior mean μw and variance Σw. ui is also assumed to be Gaussian with mean 0 and covariance Σw. Furthermore, we assume uiuj for ij and wpop{ui}. The prior parameters μw, Σw as well as the variance of the random effect Σu, and the residual variance σϵ2 are hyper-parameters. In (4), there is a the random effect, ui on each element of wi. In practice, one can use domain knowledge to specify which of the parameters should include random effects; this will be the case in the feasibility study described in Section 6. Conditioned on the latent variables (wpop, ui), as well as the current context and action, the expected reward is

E[Ri,kwpop,ui,Si,k=s,Ai,k=a]=ϕ(s,a)T(wpop+ui).

Model connections to Gaussian Processes

Under the Gaussian assumption on the distribution of the reward and prior, the Bayesian linear model of the reward (1) together with the random effect model (4) can be viewed as an example of Gaussian Process with a special kernel (see Eqn. 5). We use this connection to derive the posterior distribution and facilitate the hyper-parameter selection. An additional advantage of viewing the Bayesian mixed effects model as a Gaussian Process model is that we can now flexibly redesign our reward model simply by introducing new kernel functions. Here, we assume linear model with a Person-Specific random effects. In Section 6 we discuss a generalization to time-specific random effects. Additionally, one could adopt non-linear kernels and incorporate more complex structures on the reward function.

Posterior distribution of the weights on the feature vector

In the setting where both the prior and the linear model for the reward follow a Gaussian distribution, the posterior distribution of wi follows a Gaussian distribution and there are analytic expressions for these updates, as shown in [58]. Below we provide the explicit formula of the posterior distribution based on the connection to a Gaussian Process regression. Suppose at the time of updating the posterior distribution, the available data collected from all current users is D, where D consists of n tuples of state, action, reward and user index x=(s,a,r,i). The mixed effects model (Eqns. 1 and 4) induces a kernel function K. For any two tuples in D, e.g., xl=(sl,al,rl,il),l=1,2

K(x1,x2)=ϕ(s1,a1)(Σw+1{i1=i2}Σu)ϕ(s2,a2). (5)

Note that the above kernel depends on Σw and Σu (one of the hyper-parameters that will be updated using empirical Bayes approach; see below).

The kernel matrix K is of size n × n and each element is the kernel value between two tuples in D. The posterior mean and variance of wi given the currently available data D can be calculated by

w^i=μw+Mi(K+σϵ2In)1R˜nΣi=Σw+ΣuMi(K+σϵ2In)1Mi (6)

where R˜n is the vector of the rewards centered by the prior means, i.e., each element corresponds to a tuple (s, a, r, j) in D given by rϕ(s,a)μw, and Mi is a matrix of size n by p (recall p is the length of wi), with each row corresponding to a tuple (s, a, r, j) in D given by ϕ(s,a)(Σw+1{j=i}Σu).

Treatment selection

To select a treatment for user i at the kth decision time, we use the posterior distribution of wi formed at the most recent update time T. That is, for the context Si,k of user i at the kth decision time, IntelligentPooling selects the treatment Ai,k = 1 with the probability calculated in the same formula as in (2) but with a different posterior distribution as discussed above.

Setting hyper-parameter values

Recall that the algorithm requires the hyper-parameters μw, Σw, Σu, and σϵ2. The prior mean μw and variance Σw of the population parameter wpop can be set according to previous data or domain knowledge (see Section 5 for a discussion on how the prior distribution is set in the feasibility study). As we mention in Section 3.1, the variance components in the mixed effects model impact how the users pool the data from others (see Section 3.5 for a discussion) and might be difficult to pre-specify. IntelligentPooling uses, at the update times, the empirical Bayes [9] approach to choose/update λ=(Σu,σϵ2) based on the currently available data. To be more specific, suppose at the time of updating the hyper-parameters, the available data is D. We choose λ to maximize l(λD), the marginal log-likelihood of the observed reward, marginalized over the population parameters wpop and the random effects ui. The marginal log-likelihood l(λD) can be expressed as

l(λD)=12{R˜n[K(λ)+σϵ2In]1R˜n+logdet[K(λ)+σϵ2In]+nlog(2π)} (7)

where K(λ) is the kernel matrix as a function of parameters λ=(Σu,σϵ2). The above optimization can be efficiently solved using existing Gaussian Process regression packages; see Section 4.2 for more details.

Algorithm 1.

IntelligentPooling

1: Let T be a set of all times at which the algorithm might deliver a treatment or perform a parameter update.
2: Set w^i,0=μw,Σi,0=Σw+Σu for all i and D={}.
3: for all tT do
4: if t is a decision time then
5:   Receive user index i and decision time index k
6:   Collect state variable Si,k
7:   Calculate randomization probability
  πi,k=Prw˜N(w^i,Σi){ϕ(Si,k,1)w˜>ϕ(Si,k,0)w˜}
8:   Sample treatment Ai,kBern(πi,k)
9:   Collect reward Ri, k
10:   DD{Si,k,Ai,k,Ri,k,i}
11: end if
12: if t is an update time then
13:   Update the hyper-parameters: λ^=argmaxl(λD) in Eqn 7
14:   Update the posterior mean and covariance w^i,Σi for all i in D by Eqns 6 with λ^
15: end if
16: end for

3.5. Intuition for the use of random effects

IntelligentPooling uses random effects to adaptively pool users’ data based on the degree to which users exhibit heterogeneous rewards. That is, the person-specific random effect should outweigh the population term if users are highly heterogeneous. If users are highly homogeneous, the person-specific random effect should be outweighed by the population term. The amount of pooling is controlled by the hyper-parameters, e.g., the variance components of the random effects.

To gain intuition, we consider a simple setting where the feature vector φ in the reward model (Eqn. 1) is one-dimensional (i.e., p = 1) and there are only two users (i.e., i = 1,2). Denote the prior distributions of population parameter wpop by N(0,σw2) and the random effect ui by N(0,σw2). Below we investigate how the hyper-parameter (e.g., σu2 in this simple case) impacts the posterior distribution.

Let ki be the number of decision time of user i at an updating time. In this simple setting, the posterior mean of w^1 can be calculated explicitly:

w^1=[δγ+(1γ2)C2]Y1+δγ2Y2(1γ2)C1C2+δγ(C1+C2)+(δγ)2

where for, i=1,2, Ci=k=1kiϕ(Ai,k,Si,k)2, Yi=k=1kiϕ(Ai,k,Si,k)Ri,k, γ=σw2/(σw2+σu2) and δ=σϵ2/σw2. Similarly, the posterior mean of w2 is given by

w^2=[δγ+(1γ2)C1]Y2+δγ2Y1(1γ2)C1C2+δγ(C1+C2)+(δγ)2

When σu20 (i.e., the variance of random effect goes to 0), we have γ1 and both posterior means (w^1,w^2) approach the posterior mean under Complete (Eqn 3) using prior N(0,σw2)

w^1,w^2Y1+Y2C1+C2+δ.

Alternatively, when σu2, we have γ0 and the posterior means (w^1,w^2) each approach their respective posterior means under Person-Specific (Eqn 1) using a non-informative prior

w^1Y1C1,w^2Y2C2.

Fig. 4 illustrates that when γ goes from 0 to 1, the posterior mean w^i smoothly transitions from the population estimates to the Person-Specific estimates.

Figure 4:

Figure 4:

The posterior mean of wi,w^1. As the variance of random effect σu2 decreases, γ increases and the posterior mean approaches the population-informed estimation (Complete) and departs from the Person-Specific estimation (Person-Specific).

3.6. Regret

We prove a regret bound for a modification of IntelligentPooling similar to that in [2, 55] in a simplified setting. Further details are provided in Appendix A. Let d be the length of the weight vector wi in the Bayesian mixed effects model of the reward in Eqn. 1. Recall that Σw is the prior covariance of the weight vector wpop, Σu is the covariance of the random effect ui and σϵ2 is the variance of the error term. Let Ki be the number of decision times for user i up to a given calendar time and T=i=1NKi be the total number of decision times encountered by all N users in the study up to the calendar time. We define the regret of the algorithm after T decision times by R(T)=i=1Nk=1Kimaxaϕ(Si,k,a)Twiϕ(Si,k,Ai,k)Twi.

Theorem 1

With probability 1 − δ, where δ ∈ (0, 1) the total regret of the modified Thompson Sampling with IntelligentPooling after T total number of decision times is:

R(T)=O˜(dNTlog((Tr(Σw)+Tr(Σu)+Tr(Σu1))d+Tσϵ2dN)log1δ)

Remark

Observe that, up to logarithmic terms, this regret bound is O˜(dNT). Recall that [55] introduces a similar regret bound for a Thompson Sampling algorithm which utilizes user-similarity information. The bound from [55], O˜(dNT/λ), additionally depends on a hyper-parameter λ that is not included in our model. In [55], λ controls the strength of prior user-similarity information. Instead of introducing a hyper-parameter our model follows a mixed effects Bayesian structure which allows user similarities (as expressed in the extent to which users’ data is pooled) to be updated with new data. Thus, in certain regimes of hyper-parameter λ, IntelligentPooling will incur much smaller regret, as demonstrated empirically in Section 4.3.

4. Experiments

This work was conducted to prepare for deployment of IntelligentPooling in a live trial. Thus, to evaluate IntelligentPooling we construct a simulation environment from a precursor trial, HeartStepsV1[29]. This simulation allows us to evaluate the proposed algorithm under various settings that may arise in implementation. For example, heterogeneity in the observed rewards may be due to unknown subgroups across which users’ reward functions differ. Alternatively, this heterogeneity may vary across users in a more continuous manner. We consider both scenarios in simulated trials. In Sections 4.1-4.3 we evaluate the performance of IntelligentPooling against baselines and a state-of-the-art algorithm. In Section 5 we assess feasibility of IntelligentPooling in a pilot deployment in a clinical trial.

4.1. Simulation environment

HeartStepsV1 was a 6-week micro-randomized trial of an Android-based physical activity intervention with 41 sedentary adults. The intervention consisted of two push interventions: planning and contextually-tailored activity suggestions. Activity suggestions acted as action cues and were designed to provide users with actionable options for engaging in short bouts of activity in their current situation. The content of the suggestions was tailored based on the users’ location, weather, time of day, and day of the week. For each individual, on each day of the study, the HeartSteps system randomized whether or not to send an activity suggestion five times a day. The intended outcome of the suggestions—the proximal outcome used to evaluate their efficacy—was the step count in the 30 minutes following suggestion randomization.

HeartStepsV1 data was used to construct all features within the environment, and to guide choices such as how often to update the feature values. Recall that Si,k and Ri,k denote the context features and reward of user i at the kth decision time. The reward is the log step counts in the thirty minutes immediately following a decision time. In HeartStepsV1 three treatment actions were considered: Ai,k = 1 corresponded to a smartphone notification containing an activity suggestion designed to take 3 minutes to perform, Ai,k = 0 corresponded to a smartphone notification containing an anti-sedentary message designed to take approximately 30 seconds to perform and Ai,k = −1 corresponded to not sending a message. However, in the simulation only the actions 1,0 are considered. Fig. 5 describes the simulation while Table 1 describes context features and rewards. Each context feature in Table 1 was constructed from HeartStepsV1 data. For example, we found that in HeartStepsV1 data splitting participants’ prior 30 minute step count into the two categories of high or low best explained the reward. Additional details about this process are included in Section D.

Figure 5:

Figure 5:

Contextual features for a simulated User are composed of both general environmental features (such as time of day) and individual features (such as location). At decision times a simulated user receives a message determined by the current treatment policy. Periodically this policy is updated according to a learning algorithm which outputs a new posterior distribution for each User.

Table 1:

The value used in encoding each feature is shown in parentheses. For example cold (0) indicates that cold is coded as a 0 wherever this feature is used. A user’s state is described as Si,k = {1,time of day, day of the week, preceding activity level, location}.

State (S) Features
Name Value User Specific
Time of day Morning 9:00 and 15:00 (0)
Afternoon 15:00 and 21:00 (1)
No

Day of the week Weekday (0) or Weekend (1) No

Temperature Cold (0) or Hot (1) No

Preceding activity level Low (0) or High (1) Yes
Location Other (0) or Home/work (1) Yes
Intercept 1 Yes

Reward

Step count Continuous on log scale Yes

The temperature and location are updated throughout a simulated day according to probabilistic transition functions constructed from HeartStepsV1. The step counts for a simulated user are generated from participants in HeartStepsV1 as follows. We construct a one-hot feature vector containing the group-ID of a participant, the time of day, the day of the week, the temperature, the preceding activity level, and the location. Then for each possible realization of the one-hot encoding we calculate the empirical mean and empirical standard deviation of all step counts observed in HeartStepsV1. The corresponding empirical mean and empirical standard deviation from HeartStepsV1 form μSi,k σSi,k respectively. At each 30 minute window, if a treatment is not delivered step counts are generated according to

Ri,k=N(μSi,k,σSi,k2). (8)

Heterogeneity

This model, which we denote Heterogeneity, allows us to compare the performance of the approaches under different levels of population heterogeneity. The step count after a decision time is a modification of Eqn. 8 to reflect the interaction between context and treatment on the reward and heterogeneity in treatment effect. Let β be a vector of coefficients of Si,k which weigh the relative contributions of the entries of Si,k that interact with treatment on the reward. The magnitude of the entries of β are set using HeartStepsV1. Step counts (Ri,k) are generated as

Ri,k=N(μSi,k,σSi,k2)+Ai,k(Si,kTβi+Zi). (9)

The inclusion of Zi will allow us to evaluate the relative performance of each approach under different levels of population heterogeneity. Let βil be the entry in βi corresponding to the location term for the ith user. We consider three scenarios (shown in Table 6) to generate Zi, the person-specific effect, and βil the location-dependent effect. The performance of each algorithm under each scenario will be analyzed in Section 4.3. In the smooth scenario, σ is equal to the standard deviation of the observed treatment effects [f(Si,k)β:Si,k HeartStepsV1]. The settings for all Zi and βil terms are discussed in Section D.

Table 6:

Settings for Z in three cases of homogeneous, bimodal and smoothly varying populations.

Homogeneous Bi-modal Smooth
Zi=0βil=0 Zi,βil={0.1,0.lifigroup one0.3,0.lifigroup two ZiN(0,0.35)βilN(0,0.1)

In the bi-modal scenario each simulated user is assigned a base-activity level: low-activity users (group 1) or high-activity users (group 2). When a simulated user joins the trial they are placed into either group one or two with equal probability. Whether or not it is optimal to send a treatment (an activity suggestion) for user i at their kth decision time depends both on their context, and on the values of z1,β1l and z2,β2l. The values of z1,β1l and z2,β2l are set so that for all users in group 1, it is optimal to send a treatment under 75% of the contexts they will experience. Yet for all users in group 2, it is only optimal to send a treatment under 25% of the contexts they will experience. Group membership is not known to any of the algorithms. The settings for all values in Table 6 are included in Section D.

4.2. Model for the reward function in IntelligentPooling

In Section 3 we introduced the feature vector ϕ(Si,k,Ai,k)p. This vector is used in the model for the reward and transforms a user’s contextual state variables Si,k and the action Ai,k as follows:

ϕ(Si,k,Ai,k)T=(Si,kT,πi,kSi,kT,(Ai,kπi,k)Si,k), (10)

where Si,k = {1,time of day, day of the week, preceding activity level, location}. Recall that the bandit algorithms produce πi,k which is the probability that Ai,k = 1. The inclusion of the term (Ai,kπi,k)Si,k is motivated by [7, 23, 36], who demonstrated that action-centering can protect against mis-specification in the baseline effect (e.g., the expected reward under the action 0). In HeartStepsV1 we observed that users varied in their overall responsivity and that a user’s location was related to their responsivity. In the simulation, we assume the Person-Specific random effect on four parameters in the reward model (i.e., the coefficients of terms in S involving the intercept and location).

Finally, we constrain the randomization probability to be within [0.1, 0.8] to ensure continual learning. The update time for the hyper-parameters is set to be every 7 days. All approaches are implemented in Python and we implement GP regression with the software package GPytorch [22].

4.3. Simulation results

In this section, we compare the use of mixed effects model for the reward function in IntelligentPooling to two standard methods used in mHealth, Complete and Person-Specific from Section 3.3. Recall that IntelligentPooling includes Person-Specific random effects, as described in Eqn. 14. In Person-Specific, all users are assumed to be different and there is no pooling of data and in Complete, we treat all users the same and learn one set of parameters across the entire population.

Additionally, to assess IntelligentPooling’s ability to pool across users we compare our approach to Gang of Bandits [11], which we refer to as GangOB. As this model requires a relational graph between users, we construct a graph using the generative model (9) and Table 6 connecting users according to each of the three settings: homogeneous, bi-modal and smooth. For example, with knowledge of the generative model users can be connected to other users as a function of their Zi terms. As we will not have true access to the underlying generative model in a real-life setting we distort the true graph to reflect this incomplete knowledge. That is we add ties to dissimilar users at 50% of the strength of the ties between similar users.

From the generative model (9), the optimal action for user i at the kth decision time is ai,k=1{Si,kTβi+Zi0}. The regret is

regreti,k=|Si,kTβi+Zi|1{ai,kAi,k} (11)

where βi is the optimal β for the ith user.

In these simulations each trial has 32 users. Each user remains in the trial for 10 weeks and the entire length of the trial is 15 weeks, where the last cohort joins in week six. The number of users who join each week is a function of the recruitment rate observed in HeartStepsV1. In all settings we run 50 simulated trials.

First, Fig. 6 provides the regret averaged across all users across 50 simulated trials where the reward distribution follows (9) for each of the Table 6 categories. The horizontal axis in Fig. 6 is the average regret over all users in their nth week in the trial, e.g. in their first week, their second week, etc. In the bi-modal setting there are two groups, where all users in group one have a positive response to treatment when experiencing their typical context, while the users in group two have a negative response to treatment under their typical context. An optimal policy would learn to not typically send treatments to users in the first group, and to typically send them to users in the second. To evaluate each algorithm’s ability to learn this distinction we show the percentage of time each group received a message in Table 3.

Figure 6: Heterogeneity generative model.

Figure 6:

Regret averaged across all users for each week in the trial, i.e. average regret of all users in their first week of the trial.

Table 3:

The fraction of time that messages were sent to users in each group. Recall at each decision time either an activity suggestion or anti-sedentary message is sent. For group one it is typically optimal to send an activity suggestion, while for group two it is typically optimal to send an anti-sedentary message. Here, IntelligentPooling is best able to learn this dynamic.

Group one optimal policy = send activity suggestion Group two optimal policy = send anti-sedentary message
Complete 0.49 0.46
Person-Specific 0.65 0.49
GangOB 0.57 0.35
IntelligentPooling 0.59 0.36

The relative performance of the approaches depends on the heterogeneity of the population. When the population is very homogenous Complete excels, while its performance suffers as heterogeneity increases. Person-Specific is able to personalize; as shown by Table 3, it can differentiate between individuals. However, it learns slowly and can only approach the performance of Complete in the smooth setting of Table 6 where users differ the most in their response to treatment. Both IntelligentPooling and GangOB are more adaptive than either Complete or Person-Specific. GangOB consistently outperforms Person-Specific and achieves lower regret than Complete in some settings. In the homeogenous setting we see that GangOB can utilize social information more effectively than Person-Specific does while in the smooth setting it can adapt to individual differences more effectively than Complete. Yet, IntelligentPooling demonstrates stronger and swifter adaptability than does GangOB, consistently achieving lower regret at quicker rates. Finally, the algorithms differ in their suitability for real-world applications, especially when data is limited. GangOB requires reliable values for hyper-parameters and can depend on fixed knowledge about relationships between users. IntelligentPooling can learn how to pool between individuals over time and without prior knowledge.

5. IntelligentPooling Feasibility Study

The simulated experiments provide insights into the potential of this approach for a live deployment. As we see reasonable performance in the simulated setting, we now discuss an initial pilot deployment of IntelligentPooling in a real-life physical activity clinical trial.

5.1. Feasibility Study Design

The feasibility study of IntelligentPooling involves 10 participants added to a larger 90-day clinical trial of HeartSteps v2, an mHealth physical activity intervention. The purpose of the larger clinical trial is to optimize the intervention for individuals with Stage 1 hypertension. Study participants with Stage 1 hypertension were recruited from Kaiser Permanente Washington in Seattle, Washington. The study was approved by the institutional review board of the Kaiser Permanente Washington Health Research Institute (under number 1257484–14).

HeartSteps v2 is a cross-platform mHealth application that incorporates several intervention components, including weekly activity goals, feedback on goal progress, planning, motivational messages, prompts to interrupt sedentary behavior, and—most relevant to this paper—actionable, contextually-tailored suggestions for individuals to perform a short physical activity (suggesting, roughly, a 3 to 5 minute walk). In this study physical activity is tracked with a commercial wristband tracker, the Fitbit Versa smart watch.

In this version of the intervention, activity suggestions are randomized five times per day for each participant on each day of the 90-day trial. These decision times are specified by each user at the start of the study, and they roughly correspond to the participant’s typical morning commute, lunch time, mid-afternoon, evening commute, and after dinner periods. The treatment options for activity suggestions are binary: at a decision time, the system can either send or not send a notification with an activity suggestion. When provided, the content of the suggestion is tailored to current sensor data (location, weather, time of day, and day of the week). Examples of these suggestions are provided in [30]. At a decision time, activity suggestions are randomized only if the system considers that the user is available for the intervention—i.e., that it is appropriate to intervene at that time (see Figure 8 for criteria used to determine if it is appropriate to send an activity suggestion at a decision time). Subject to these availability criteria, IntelligentPooling determines whether to send a suggestion at each decision time. The posterior distribution was updated once per day, prior to the beginning of each day. Fig. 7 provides a schematic of the feasibility study.

Figure 8:

Figure 8:

Availability criteria

Figure 7:

Figure 7:

Setup of FeasibilityStudy. Users can receive treatments up to five times a day during the 90 days. Users enter the trial asynchronously.

The feasibility study included the second set of 10 participants in the trial of HeartSteps v2, following the initial 10 enrolled participants. IntelligentPooling (Algorithm 1) is deployed for each of the second set of 10 participants. At each decision time for these 10 participants, IntelligentPooling uses all data up to that decision time (i.e. from the initial ten participants as well as from the subsequent ten participants). Thus the feasibility study allows us to assess performance of IntelligentPooling after the beginning of a study instead of the performance at the beginning of the study (when there is little data) or the performance at the end of the study (when there is a large amount of data and the algorithm can be expected to perform well).

In the feasibility study, the features used in the reward model were selected to be predictive of the baseline reward and/or the treatment effect, based on the data analysis of HeartStepsV1; see section 6.2 in [37] for details. All features used in the reward model are shown in Table 4. The feature engagement represents the extent to which a user engages with the mHealth application measured as a function of how many screen views are made within the application within a day. The feature dosage represents the extent to which a user has received treatments (activity suggestions). This feature increases and decreases depending on the number of activity suggestions recently received. The feature location refers to whether a user is at home or work (encoded as a 1) or somewhere else (encoded as a 0). The temperature feature value is set according to the temperature at a user’s current location (based off of phone GPS). The variation feature value is set according to the variation in step count in the hour around that decision point over the prior seven-day period. As before we construct a feature vector φ, however here we only use select terms to estimate the treatment effect. Here,

ϕ(Si,k,Ai,k)T=(Si,kT,πi,kSi,kT,(Ai,kπi,k)Si,k), (12)

where Si,k = {1,temperature,yesterday’s step count, preceding activity level, step variation, engagement, dosage, location} and Si,k = {1, step variation, engagement, dosage, location} is a subset of Si,k.

Table 4:

State feature descriptions for FeasibilityStudy.

State Features
Name Value User Specific Included in treatment effect
Temperature Continuous Yes No

Yesterday’s step count Continuous Yes No

Prior 30-minute step count Continuous Yes No
Step variation level Discrete Yes Yes
Engagement with mobile application Discrete Yes Yes
Dosage Continuous Yes Yes
Location Discrete Yes Yes
Intercept 1 Yes Yes

Reward

Step count Continuous on log scale Yes NA

We provide a full description of these features in Section E. The prior distribution was also constructed based on HeartStepsV1; see Section 6.3 in [37] for more details. As this feasibility study only includes a small number of users, a simple model with only two person-specific random effects, each on the intercept term in S and S’ (Eqn. 12) was deployed.

Here we discuss how much data we have to personalize the policy to each user. Recall the 10 users only receive interventions when they meet the availability criteria outlined in Fig. 8, thus we find that in practice we have a limited number of decision points to learn a personalized policy from. In the case of perfect availability, we would have at most 450 decision points per person. However due to the criteria in Fig. 8, the algorithm is used with only approximately 23% of each user’s decision points. Pooling users’ data allows us to learn more rapidly. On the day that the first pooled user joined the feasibility study there were 107 data points from the first set of 10 users.

The 10 users received an average number of .20 (±0.015) messages a day. The average log step count in the 30-minute window after a suggestion was sent was 4.47, while it was 3.65 in the 30-minute windows after suggestions were not sent. Fig. 9 shows the entire history of treatment selection probabilities for all of the users who received treatment according to IntelligentPooling. We see that the treatment probabilities tended to be low, though they covered the whole range of possible values.

Figure 9:

Figure 9:

We see that IntelligentPooling covers the full range of treatment selection probabilities. The tendency seems to be to send with a lower rather than higher probability.

We would like to assess the ability of IntelligentPooling to personalize and learn quickly. To do so we perform an analysis of the learning algorithms of IntelligentPooling, Complete and Person-Specific on batch data containing tuples of (S, A, R). Note that the actions in this batch data were selected by IntelligentPooling, however, here we are not interested in the action selection components of each algorithm but instead on their ability to learn the posterior distribution of the weights on the feature vector.

Personalization

By comparing how the decisions to treat under IntelligentPooling differ from those under Complete, we gather preliminary evidence concerning whether IntelligentPooling personalizes to users. Fig. 10 shows the posterior mean of the coefficient of the Ai,k term in the estimation of the treatment effect, for all users in the feasibility study on the 90th day after the last user joined the study. We show this term not only for IntelligentPooling but also for Complete and Person-Specific. We see that for some users this coefficient is below zero while for others it is above. While the terms under IntelligentPooling differ from Complete they do not vary as much as those learned by Person-Specific. Yet, crucially, the variance is much lower for these terms.

Figure 10:

Figure 10:

Posterior mean and standard deviation of the coefficient of Ai,k in Eqn. 12 for all users in the feasibility study.

Fig. 11 displays the posterior mean of the coefficient of the Ai,k term in the estimation of the treatment effect. This coefficient represents the overall effect of treatment on one of the users, User A. During the prior 7 days User A had not experienced much variation in activity at this time and the user’s engagement is low. Note that the treatment appears to have a positive effect on a different user, User B, in this context whereas on User A there is little evidence of a positive effect. If Complete had been used to determine treatment, User A might have been over-treated.

Figure 11:

Figure 11:

Posterior mean of the coefficient of Ai,k in Eqn. 12 for users A and B in the feasibility study.

Speed of policy learning

We consider the speed at which IntelligentPooling diverges from the prior, relative to the speed of divergence for Person-Specific. Fig. 12 provides the Euclidean distance between the learned posterior and prior parameter vectors (averaged across the data from the 10 users at each time). From Fig. 12 we see that Person-Specific hardly varies over time in contrast to IntelligentPooling and Complete, which suggests that Person-Specific learns more slowly.

Figure 12:

Figure 12:

Mean squared distance of the posterior mean from prior mean of the coefficients of Ai,k

In conclusion IntelligentPooling was found to be feasible in this study. In particular the algorithm was operationally stable within the computational environment of the study, produced decision probabilities in a timely manner, and did not adversely impact the functioning of the overall mHealth intervention application. Overall, IntelligentPooling produced treatment selection probabilities which covered the full range of available probabilities, though treatments tended to be sent with a low probability.

6. Non-stationary environments

An additional challenge in mHealth settings is that users’ response to treatment can vary over time. To address this challenge we show that our underlying model can be extended to include time-varying random effects. This allows each policy to be aware of how a user’s response to treatment might vary over time. We propose a new simulation to evaluate this approach and show that IntelligentPooling achieves state-of-the-art regret, adjusting to non-stationarity even as user populations vary from heterogenous to homogenous.

6.1. Time-varying random effect

In addition to user-specific random effects we extend our model to include time-specific random effects. Consider the Bayesian mixed effects model with Person-Specific and time-varying effects: for user i at the kth decision time,

Ri,k=ϕ(Si,k,Ai,k)wi,k+ϵi,k. (13)

In addition, we impose the following additive structure on the parameters wi,k:

wi,k=wpop+ui+vk, (14)

where wpop is the population-level parameter, ui represents the person-specific deviation from wpop for user i and vk is the time-varying random effects allowing wi,k to vary with time in the study.

The prior terms for this model are as introduced in Section 3.4. Additionally, vk has mean 0 and covariance Dv. The covariance between two relative decision times in the trial is Cov(vk,vk)=ρ(k,k)Dv, where ρ(k,k)=exp(dist(k,k)2/σρ) for a distance function, dist and θpop{ui}{vk}. There is no change to Algorithm 1 except that now the algorithm would select the action based on the posterior distribution of wi,k, which depends on both the user and time in the study.

6.2. Experiments

We now modify our original simulation environment so that users’ responses will vary over time. To do so we introduce the generative model Disengagement. This generative model captures the phenomenon of disengagement. That is as users are increasingly exposed to treatment over time they can become less responsive. This model adds a further term to (9), Ai,kXwTβw where Xw is defined as follows. Let wi,k be the highest number of weeks user i has completed at time k; Xw encodes a user’s current week in a trial, Xw=[1{wi,k=0},,1{wi,k=11}]. We set βw such that the longer a user has been in treatment, the less they respond to a treatment message. When a simulated user is at a decision time the user will receive a treatment message according to whichever RL policy is being run through the simulation.

In order to evaluate the effectiveness of our time-varying model we compare to Time-Varying Gaussian Process Thompson Sampling (TV-GP) [5]. This approach incorporates temporal information for non-stationary environments and was shown to be competitive to stationary models. To compare this method to IntelligentPooling we use a linear kernel for the spatial component. We then modify Eqn. 6 to compute the posterior distribution by removing the random-effects and modifying the kernel (Eqn. 5) to include the temporal terms introduced in [5].

Fig. 13 provides the regret averaged across all users across 50 simulated trials where the reward distribution follows generative model Disengagement. As before the horizontal axis in Fig. 13 is the average regret over all users in their nthweek in the trial, e.g. in their first week, their second week, etc. In Disengagement, the time-specific response to treatment is set so that a negative response to treatment is introduced in the seventh week of the trial.

Figure 13: Disengagement generative model.

Figure 13:

Regret averaged across all users for each week in the trial, i.e. average regret of all users in their first week of the trial.

In the Disengagement condition as users become increasingly less responsive to treatment good policies should learn to treat less. Thus, Table 5 provides the average number of times a treatment is sent in the last week of the trial for both the first and last cohort. We expect that a policy which learns not to treat will treat less often in the last week of the last cohort than in the last week of the first cohort.

Table 5:

Average fraction of times treatment was sent (action=1), over 50 simulations (generative model Heterogeneity with homogenous Zh setting).

Cohort One Week 10 Cohort Six Week 10
Complete 0.62 0.44
Person-Specific 0.76 0.59
HordeOB- 0.50 0.57
TV-GP 0.64 0.31
IntelligentPooling 0.30 0.06

7. Limitations

A significant limitation with this work is that our pilot study involved a small number of participants. Our results from this work must be considered with caution as preliminary evidence towards the feasibility of deploying IntelligentPooling, and bandit algorithms in general, in mHealth settings. Moreover, we cannot claim to provide generalizable evidence that this algorithm can improve health outcomes; for this larger studies with more participants must be run. We offer our findings as motivation for such future work.

Our proposed model is designed to overcome the challenges faced when learning personalized policies in limited data settings. As such, if data was abundant our model would likely have limited effectiveness compared to more complex models. For example, a more complex model could allow us to pool between users as a function of their similarity. Our current model instead determines the extent to which a given user deviates from the population and does not consider between-user similarities. A limitation with our current understanding of mHealth is that it is unclear what a good similarity measure would be. We leave the question of designing a data-efficient algorithm for learning such a measure as future work.

A component of IntelligentPooling is the use of empirical Bayes to update the model hyper-parameters. Here, we used an approximate procedure. However, with our model it is possible to produce exact updates in a streaming fashion and we are currently developing such an approach.

Ideally, we would evaluate IntelligentPooling against all other approaches in a clinical trial setting. However, here we only demonstrated the feasibility of our approach on a limited number of users and did not have the resources to similarly test the other approaches. To overcome this limitation we constructed a realistic simulation environment so that we could evaluate on different populations without the costly investment of designing multiple arms of a real-life trial. While the simulated experiments and the feasibility study together demonstrate the practicality of our approach, in future work one might deploy all potential approaches in simultaneous live trials.

Finally, IntelligentPooling can incorporate a time-specific random effect to capture the phenomenon of responsivity changing over the course of a study. There is much to be improved with this model. For example, the first cohort in a study will not have prior cohorts to learn from, and the final cohort will have the greatest amount of data to benefit from. Other models might treat different cohorts with greater equality. Furthermore, this representation does not incorporate alternative temporal information, such as continually shifting weather patterns, where temperatures might change slowly and gradually alter one’s desire to exercise outside.

8. Conclusion

When data on individuals is limited a natural tension exists between personalizing (a choice which can introduce variance) and pooling (a choice which can introduce bias). In this work we have introduced a novel algorithm for personalized reinforcement learning, IntelligentPooling that presents a principled mechanism for balancing this tension. We demonstrate the practicality of our approach in the setting of mHealth. In simulation we achieve improvements of 26% over a state-of-the-art-method, while in a live clinical trial we show that our approach shows promise of personalization on even a limited number of users. We view adaptive pooling as a first step in addressing the trade-offs between personalization and pooling. The question of how to quantify the benefits and risks for individual users is an open direction for future work.

Supplementary Material

1

Table 2:

Settings for Z in three cases of homogeneous, bimodal and smoothly varying populations.

Homogeneous Bi-modal Smooth
Zi=0βil=0 Zi,βil={z1,β1lifigroup onez2,β2lifigroup two ZiN(0,σ2)βilN(0,σl2)

Acknowledgements

This material is based upon work supported by: NIH/NIAAA R01AA23187, NIH/NIDA P50DA039838,NIH/NIBIB U54EB020404 and NIH/NCI U01CA229437. The views expressed in this article are those of the authors and do not necessarily reflect the official position of the National Institutes of Health, or any other part of the U.S. Department of Health and Human Services.

Footnotes

Institutional Review Board Approval

The HeartSteps study discussed here was approved by the Kaiser Permanente Washington Region Institutional Review Board under IRB number 1257484-14.

1

More generally, one can consider the setting where users become known to an algorithm over time. For example, users may open or delete accounts on an online shopping platform.

Contributor Information

Sabina Tomkins, Stanford University.

Peng Liao, Harvard University.

Predrag Klasnja, University of Michigan.

Susan Murphy, Harvard University.

References

  • [1].Abeille M, Lazaric A, et al. (2017) Linear thompson sampling revisited. Electronic Journal of Statistics 11(2):5165–5197 [Google Scholar]
  • [2].Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory, pp 39–1 [Google Scholar]
  • [3].Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. In: International Conference on Machine Learning, pp 127–135 [Google Scholar]
  • [4].Pedregosa F (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830 [Google Scholar]
  • [5].Bogunovic I, Scarlett J, Cevher V (2016) Time-varying Gaussian process bandit optimization. In: Artificial Intelligence and Statistics, pp 314–323
  • [6].Bonilla EV, Chai KM, Williams C (2008) Multi-task Gaussian process prediction. In: Advances in neural information processing systems, pp 153–160
  • [7].Boruvka A, Almirall D, Witkiewitz K, Murphy SA (2018) Assessing time-varying causal effect moderation in mobile health. Journal of the American Statistical Association 113(523):1112–1121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Brochu E, Hoffman MW, de Freitas N (2010) Portfolio allocation for Bayesian optimization. arXiv preprint arXiv:10095419
  • [9].Carlin BP, Louis TA (2010) Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC
  • [10].Casella G (1985) An introduction to empirical Bayes data analysis. The American Statistician 39(2):83–87 [Google Scholar]
  • [11].Cesa-Bianchi N, Gentile C, Zappella G (2013) A gang of bandits. In: Advances in Neural Information Processing Systems, pp 737–745
  • [12].Cheung WC, Simchi-Levi D, Zhu R (2018) Learning to optimize under non-stationarity. arXiv preprint arXiv:181003024
  • [13].Chowdhury SR, Gopalan A (2017) On kernelized multi-armed bandits. In: International Conference on Machine Learning, vol 70, pp 844–853 [Google Scholar]
  • [14].Clarke S, Jaimes LG, Labrador MA (2017) mstress: A mobile recommender system for just-in-time interventions for stress. In: Consumer Communications & Networking Conference, pp 1–5 [Google Scholar]
  • [15].Consolvo S, McDonald DW, Toscos T, Chen MY, Froehlich J, Harrison B, Klasnja P, LaMarca A, LeGrand L, Libby R, et al. (2008) Activity sensing in the wild: a field trial of ubifit garden. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 1797–1806 [Google Scholar]
  • [16].Desautels T, Krause A, Burdick JW (2014) Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. The Journal of Machine Learning Research 15(1):3873–3923 [Google Scholar]
  • [17].Deshmukh AA, Dogan U, Scott C (2017) Multi-task learning for contextual bandits. In: Advances in Neural Information Processing Systems, pp 4848–4856
  • [18].Djolonga J, Krause A, Cevher V (2013) High-dimensional gaussian process bandits. In: Advances in Neural Information Processing Systems, pp 1025–1033
  • [19].Finn C, Xu K, Levine S (2018) Probabilistic model-agnostic meta-learning. In: Advances in Neural Information Processing Systems, pp 9516–9527
  • [20].Finn C, Rajeswaran A, Kakade S, Levine S (2019) Online meta-learning. arXiv preprint arXiv:190208438
  • [21].Forman EM, Kerrigan SG, Butryn ML, Juarascio AS, Manasse SM, Ontañón S, Dallal DH, Crochiere RJ, Moskow D (2018) Can the artificial intelligence technique of reinforcement learning use continuously-monitored digital data to optimize treatment for weight loss? Journal of behavioral medicine 42(2):276–290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Gardner J, Pleiss G, Weinberger KQ, Bindel D, Wilson AG (2018) Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In: Advances in Neural Information Processing Systems, pp 7576–7586
  • [23].Greenewald K, Tewari A, Murphy S, Klasnja P (2017) Action centered contextual bandits. In: Advances in neural information processing systems, pp 5977–5985 [PMC free article] [PubMed]
  • [24].Gupta A, Mendonca R, Liu Y, Abbeel P, Levine S (2018) Meta-reinforcement learning of structured exploration strategies. In: Advances in Neural Information Processing Systems, pp 5302–5311
  • [25].Hamine S, Gerth-Guyette E, Faulx D, Green BB, Ginsburg AS (2015) Impact of mhealth chronic disease management on treatment adherence and patient outcomes: a systematic review. Journal of medical Internet research 17(2):e52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Jaimes LG, Llofriu M, Raij A (2016) Preventer, a selection mechanism for just-in-time preventive interventions. IEEE Transactions on Affective Computing 7(3):243–257 [Google Scholar]
  • [27].Kim B, Tewari A (2019) Near-optimal oracle-efficient algorithms for stationary and non-stationary stochastic linear bandits. arXiv preprint arXiv:191205695
  • [28].Kim B, Tewari A (2020) Randomized exploration for non-stationary stochastic linear bandits. In: Conference on Uncertainty in Artificial Intelligence, pp 71–80 [Google Scholar]
  • [29].Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, Murphy SA (2015) Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34(S):1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Klasnja P, Smith S, Seewald NJ, Lee A, Hall K, Luers B, Hekler EB, Murphy SA (2018) Efficacy of Contextually Tailored Suggestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps. Annals of Behavioral Medicine 53(6):573–582 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Krause A, Ong CS (2011) Contextual gaussian process bandit optimization. In: Advances in Neural Information Processing Systems, pp 2447–2455
  • [32].Laird NM, Ware JH, et al. (1982) Random-effects models for longitudinal data. Biometrics 38(4):963–974 [PubMed] [Google Scholar]
  • [33].Lawrence ND, Platt JC (2004) Learning to learn with the informative vector machine. In: International conference on Machine learning, p 65 [Google Scholar]
  • [34].Li L, Chu W, Langford J, Schapire RE (2010) A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the Conference on World wide web, pp 661–670 [Google Scholar]
  • [35].Li S, Kar P (2015) Context-aware bandits. arXiv preprint arXiv:151003164
  • [36].Liao P, Klasnja P, Tewari A, Murphy SA (2016) Sample size calculations for micro-randomized trials in mhealth. Statistics in medicine 35(12):1944–1971 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Liao P, Greenewald K, Klasnja P, Murphy S (2020) Personalized heart-steps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the Conference on Interactive, Mobile, Wearable and Ubiquitous Technologies 4(1):1–22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Luo L, Yao Y, Gao F, Zhao C (2018) Mixed-effects Gaussian process modeling approach with application in injection molding processes. Journal of Process Control 62:37–43 [Google Scholar]
  • [39].Morris CN (1983) Parametric empirical Bayes inference: theory and applications. Journal of the American statistical Association 78(381):47–55 [Google Scholar]
  • [40].Nagabandi A, Finn C, Levine S (2018) Deep online learning via meta-learning: Continual adaptation for model-based rl. arXiv preprint arXiv:181207671
  • [41].Nahum-Shani I, Smith SN, Spring BJ, Collins LM, Witkiewitz K, Tewari A, Murphy SA (2017) Just-in-time adaptive interventions (JI-TAIs) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine 52(6) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Paredes P, Gilad-Bachrach R, Czerwinski M, Roseway A, Rowan K, Hernandez J (2014) Poptherapy: Coping with stress through pop-culture. In: Conference on Pervasive Computing Technologies for Healthcare, pp 109–117 [Google Scholar]
  • [43].Qi Y, Wu Q, Wang H, Tang J, Sun M (2018) Bandit learning with implicit feedback. In: Advances in Neural Information Processing Systems, vol 31, pp 7276–7286 [PMC free article] [PubMed] [Google Scholar]
  • [44].Qian T, Klasnja P, Murphy SA (2019) Linear mixed models under endogeneity: modeling sequential treatment effects with application to a mobile health study. arXiv preprint arXiv:190210861 [DOI] [PMC free article] [PubMed]
  • [45].Rabbi M, Aung MH, Zhang M, Choudhury T (2015) Mybehavior: automatic personalized health feedback from user behaviors and preferences using smartphones. In: Proceedings of the Conference on Pervasive and Ubiquitous Computing, pp 707–718 [Google Scholar]
  • [46].Rabbi M, Philyaw-Kotov M, Lee J, Mansour A, Dent L, Wang X, Cunningham R, Bonar E, Nahum-Shani I, Klasnja P, et al. (2017) SARA: a mobile app to engage users in health data collection. In: Joint Conference on Pervasive and Ubiquitous Computing and the International Symposium on Wearable Computers, pp 781–789 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Raudenbush SW, Bryk AS (2002) Hierarchical linear models: Applications and data analysis methods, vol 1 [Google Scholar]
  • [48].Russac Y, Vernade C, Cappé O (2019) Weighted linear bandits for non-stationary environments. In: Advances in Neural Information Processing Systems, pp 12017–12026
  • [49].Russo D, Van Roy B (2014) Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4):1221–1243 [Google Scholar]
  • [50].Russo DJ, Roy BV, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations and Trends in Machine Learning 11(1):1–96, URL 10.1561/2200000070 [DOI] [Google Scholar]
  • [51].Sæmundsson S, Hofmann K, Deisenroth MP (2018) Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:180307551
  • [52].Shi J, Wang B, Will E, West R (2012) Mixed-effects Gaussian process functional regression models with application to dose–response curve prediction. Statistics in medicine 31(26):3165–3177 [DOI] [PubMed] [Google Scholar]
  • [53].Srinivas N, Krause A, Kakade SM, Seeger M (2009) Gaussian process optimization in the bandit setting: No regret and experimental design. International Conference on Machine Learning p 1015–1022 [Google Scholar]
  • [54].Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294 [Google Scholar]
  • [55].Vaswani S, Schmidt M, Lakshmanan L (2017) Horde of bandits using Gaussian Markov random fields. In: Artificial Intelligence and Statistics, pp 690–699
  • [56].Wang Y, Khardon R (2012) Nonparametric Bayesian mixed-effect model: A sparse Gaussian process approach. arXiv preprint arXiv:12116653
  • [57].Wang Z, Zhou B, Jegelka S (2016) Optimization as estimation with Gaussian processes in bandit settings. In: Artificial Intelligence and Statistics, pp 1022–1031
  • [58].Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning, vol 2. MIT press; Cambridge, MA [Google Scholar]
  • [59].Xia I (2018) The price of personalization: An application of contextual bandits to mobile health. Senior thesis [Google Scholar]
  • [60].Yom-Tov E, Feraru G, Kozdoba M, Mannor S, Tennenholtz M, Hochberg I (2017) Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research 19(10):e338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [61].Zhao P, Zhang L, Jiang Y, Zhou ZH (2020) A simple approach for non-stationary linear bandits. In: Proceedings of the Conference on Artificial Intelligence and Statistics, pp 746–755 [Google Scholar]
  • [62].Zhou M, Mintz Y, Fukuoka Y, Goldberg K, Flowers E, Kaminsky P, Castillejo A, Aswani A (2018) Personalizing mobile fitness apps using reinforcement learning. In: CEUR workshop proceedings, vol 2068 [PMC free article] [PubMed] [Google Scholar]
  • [63].Zintgraf LM, Shiarlis K, Kurin V, Hofmann K, Whiteson S (2019) CAML: Fast context adaptation via meta-learning. In: International Conference on Machine Learning, pp 7693–7702 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES