Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Nov 1.
Published in final edited form as: Proc Mach Learn Res. 2024 May;238:2215–2223.

Online learning in bandits with predicted context

Yongyi Guo *, Ziping Xu , Susan Murphy
PMCID: PMC11501084  NIHMSID: NIHMS1985116  PMID: 39464505

Abstract

We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.

1. Introduction

Contextual bandits (Auer, 2002; Langford and Zhang, 2007) represent a classical sequential decision-making problem where an agent aims to maximize cumulative reward based on context information. At each round t, the agent observes a context and must choose one of K available actions based on both the current context and previous observations. Once the agent selects an action, she observes the associated reward, which is then used to refine future decision-making. Contextual bandits are typical examples of reinforcement learning problems where a balance between exploring new actions and exploiting previously acquired information is necessary to achieve optimal long-term rewards. It has numerous real-world applications including personalized recommendation systems (Li et al., 2010; Bouneffouf et al., 2012), healthcare (Yom-Tov et al., 2017; Liao et al., 2020), and online education (Liu et al., 2014; Shaikh et al., 2019).

Despite the extensive existing literature on contextual bandits, in many real-world applications, the agent never observes the context exactly. One common reason is that the true context for decision-making can only be detected or learned approximately from observable auxiliary data. For instance, consider the Sense2Stop mobile health study, in which the context is whether the individual is currently physiologically stressed (Battalio et al., 2021). A complex predictor of current stress was constructed/validated based on multiple prior studies (Cohen et al., 1985; Sarker et al., 2016, 2017). This predictor was then tuned to each user in Sense2Stop prior to their quit-smoke attempt and then following the user’s attempt to quit smoking, at each minute, the predictor inputs high-dimensional sensor data on the user and outputs a continuous likelihood of stress for use by the decision-making algorithm. In many such applications in health interventions, models using validated predictions as contexts are preferred to raw sensor data because of the high noise in these settings, and that the decision rules are interpretable so they can be critiqued by domain experts. The second reason why the context is not observed exactly is because of measurement error. Contextual variables, such as user preferences in online advertising, content attributes in recommendation systems, and patient conditions in clinical trials, are prone to noisy measurement. This introduces an additional level of uncertainty that must be accounted for when making decisions.

Motivated by the above, we consider the linear contextual bandit problem where at each round, the agent only has access to a noisy observation of the true context. Moreover, the agent has limited knowledge about the underlying distribution of this noisy observation, as in many practical applications (e.g. the above-mentioned ones). This is especially the case when this ‘observation’ is the output of a complex machine learning algorithm. We only assume that the noisy observation is unbiased, its variance is known or can be estimated, and we put no other essential restrictions on its distribution. In healthcare applications, the estimated error variance for context variables can often be derived from data in prior studies (e.g. pilot studies). This setting is intrinsically difficult for two main reasons: First, when estimating the reward model, the agent needs to take into account the misalignment between the noisy context observation and the reward which depends on the true context. Second, even if the reward model is known, the agent may suffer from making bad decisions at each round because of the inaccurate context.

Our contributions.

We present the first online algorithm MEB (Measurement Error Bandit) with sublinear regret in this setting under mild conditions. MEB achieves 𝒪~T2/3 regret compare to a standard benchmark and 𝒪~T1/2 regret compare to a clipped benchmark with minimum exploration probability which is common in many applications (Yang et al., 2020a; Yao et al., 2021). MEB is based on a novel approach to model estimation which removes the systematic bias caused by the noisy context observation. The estimator is inspired by the measurement error literature in statistics (Carroll et al., 1995; Fuller, 2009): We extend this classical method with additional tools in the online decision-making setting due to the policy being dependent on the measurement error.

1.1. Related work

Our work complements several lines of literature in contextual bandits, as listed below.

Latent contextual bandit.

In the latent contextual bandit literature (Zhou and Brunskill, 2016; Sen et al., 2017; Hong et al., 2020a,b; Xu et al., 2021; Nelson et al., 2022; Galozy and Nowaczyk, 2023), the reward is typically modeled as jointly depending on the latent state, the context, and the action. Several works (Zhou and Brunskill, 2016; Hong et al., 2020a,b; Galozy and Nowaczyk, 2023) assume no direct relation between the latent state and the context while setting a parametric reward model. For example, Hong et al. (2020a) assumes the latent state s𝒮l is unknown but constant over time. Hong et al. (2020b) assume that the latent state evolves through a Markov chain. Xu et al. (2021) sets a specific context as well as the latent feature for each action, and models the reward depending on them through a generalized linear model. Different from the aforementioned studies, we specify that the observed context is a noisy version of the latent context (which aligns with the applications we are addressing), and then leverage this structure to design the online algorithm.

In another line of work, Sen et al. (2017); Nelson et al. (2022) consider contextual bandit with latent confounder, where the observed context influences the reward through a latent confounder variable, which ranges within a small discrete set. Our setting is distinct from these works in that the latent context can span an infinite (or even continuous) space.

Bandit with noisy context.

In bandit with noisy context (Yun et al., 2017; Kirschner and Krause, 2019; Yang et al., 2020b; Lin and Moothedath, 2022; Lin et al., 2022), the agent has access to a noisy version of the true context and/or some knowledge of the distribution of the noisy observation. Yun et al. (2017); Park and Faradonbeh (2022); Jose and Moothedath (2024) consider settings where the joint distribution of the true context and the noisy observed context is known (up to a parameter). Other works, such as Kirschner and Krause (2019); Yang et al. (2020b); Lin et al. (2022), assume that the agent knows the exact distribution of the context each time, and observes no context. By assuming a linear reward model, Kirschner and Krause (2019) transforms the problem into a linear contextual bandit, and obtains 𝒪~(dT) regret compared to the policy maximizing the expected reward over the context distribution. Yang et al. (2020b); Lin et al. (2022); Lin and Moothedath (2022) consider variants of the problem such as multiple linear feedbacks, multiple agents, and delayed observation of the exact context. Compared to these works, we consider a practical but more challenging setting where besides an unbiased noisy observation(prediction) for each context, the agent only knows the second-moment information about the distribution. This does not transform into a standard linear contextual bandit as in Kirschner and Krause (2019).

Bandit with inaccurate/corrupted context.

These works consider the setting where the context is simply inaccurate (without randomness), or is corrupted and cannot be recovered. In Yang and Ren (2021a,b), at each round, only an inaccurate context is available to the decision-maker, and the exact context is revealed after the action is taken. In Bouneffouf (2020); Galozy et al. (2020), each context xt is completely corrupted with some probability and the corruption cannot be recovered. In Ding et al. (2022), the context is attacked by an adversarial agent. Because these works focus more on adversarial settings for the context observations, the application of their regret bounds to our setting generally results in a linear regret. For example, in Yang and Ren (2021a), the regret of Thompson sampling is 𝒪~dT+dt[T]x^t-xt2, where x^t-xt2 is the error of the inaccurate context. As is typical in our setting, x^t-xt2 will be non-vanishing through time, so the second term is linear in T. Given the applications we consider, we can exploit the stochastic nature of the noisy context observations in our algorithm to achieve improved performance.

1.2. Notations

Throughout this paper, we use [n] to represent the set {1,2,,n} for nN+. For a,bR, let ab denote the minimum of a and b. Given dN+,Id denotes the d-by-d identity matrix, and 1d denotes the d-dimensional vector with 1 in each entry. For a vector vRd, denote v2 as its 2 norm. For a matrix MRm×n, denote M2 as its operator norm. The notation 𝒪(X) refers to a quantity that is upper bounded by X up to constant multiplicative factors, while 𝒪~(X) refers to a quantity that is upper bounded by X up to poly-log factors.

2. Measurement error adjustment to bandit with noisy context

2.1. Problem setting

We consider a linear contextual bandit with context space 𝒳Rd and binary action space 𝒜={0,1}1. Let T be the time horizon. As discussed above, we consider the setting where at each time t[T], the agent only observes a noisy version of the context x~t instead of the true underlying context xt. Thus, at time t, the observation ot only contains x~t,at,rt, where at is the action and rt is the corresponding reward. We further assume that x~t=xt+ϵt, where the error ϵt is independent of the history t-1:=oττt-1,Eϵt=0,Varϵt=Σe,t. Here, ‘e’ in the subscript means ‘error’. Initially we assume that Σe,tt1 is known. In Section 2.4, we consider the setting where only estimators of Σe,tt1 are available. There is no restriction that the distribution of ϵtt1 belongs to any known (parametric) family. The reward rt=θat*,xt+ηt, where Eηtt-1,ϵt,at=0 and θa*a𝒜 are the unknown parameters. It’s worth noting that besides the policy, all the randomness here comes from the reward noise ηt and the context error ϵt. We treat xtt1 as fixed throughout but unknown to the algorithm (Unlike Yun et al. (2017), we don’t assume xt are i.i.d.). Our goal at each time t is to design policy πtt-1,x~tΔ(𝒜) given past history t-1 and current observed noisy context x~t, so that the agent can maximize the reward by taking action at~πtt-1,x~t.

If Σe,t is non-vanishing, standard contextual bandit algorithms are generally sub-optimal. To see this, notice that rt=θat*,xt+ηt=θat*,x~t+ηt-θat*,ϵt. This means the error in the reward rt after observing the noisy context is ηt-θat*,ϵt, where ϵt and x~t are dependent. Thus, Ertx~t,atθat*,x~t. This is in contrast to the standard linear bandit setting, where given the true context xt,Ertxt,at=θat*,xt, which ensures the sublinear regret of classical bandit algorithms such as UCB and Thompson sampling. Therefore, it is necessary to design an online algorithm that adjusts for the errors ϵtt1. We assume that the context, parameters and the reward are bounded, as below.

Assumption 2.1 (Boundedness). t[T],x~t21; There exists a positive constant Rθ such that a{0,1},θa*2Rθ; There exists a positive constant R such that t[T],rtR.

For any policy π=πtt, we define the (standard) cumulative regret as

RegretT;π*=t[T]Ea~πt*θa*,xt-Ea~πtθa*,xt, (2.1)

where

πt*(a)=1,if a=at*:=argmaxaθa*,xt,0,otherwise. (2.2)

We denote the standard benchmark policy π*=πt*t. This is summarized in the setting below.

Setting 1. (Standard setting) We aim to minimize Regret T;π* among the class of all policies.

In many applications including clinical trials, it’s desirable to design the policy under the constraint that each action is sampled with a minimum probability p0. One reason for maintaining exploration is that we can update and re-optimize the policy for future users to allow for potential non-stationarity (Yang et al., 2020a). Additionally, keeping the exploration is also important for after-study analysis (Yao et al., 2021), especially when the goal of the analysis is not pre-specified prior to collecting data with the online algorithm. In these situations, it is desirable to consider only the policies that always maintain an exploration probability of p0>0 for each arm, and compare the performance to the clipped benchmark policy πt*:

πt*(a)=1-p0,if a=at,*p0,otherwise. (2.3)

This is summarized in the setting below.

Setting 2. (Clipped policy setting) We minimize

RegretT;π*=t[T]Ea~πt*θa*,xt-Ea~πtθa*,xt

among the class of policies that explore any action with probability at least p0.

In this work, we will provide policies with sublinear regret guarantees in both settings.

2.2. Estimation using weighted measurement error adjustment

In this section, we focus on learning the reward model parameters θa*a𝒜 with data t after a policy πτx~τ,τ-1τ[t] has been executed up to time t[T]. Learning a consistent model is important in many bandit algorithms for achieving low regret (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013). As we shall see in Section 2.3, consistent estimation of θa*a𝒜 plays an essential role in controlling the regret of our proposed algorithm.

Inconsistency of the regularized least-squares (RLS) estimator.

UCB and Thompson sampling, the two classical bandit algorithms, both achieve sublinear regret based on the consistency of the estimator θ^a,RLS(t)=λI+τ[t]1aτ=axτxτ-1τ[t]1aτ=axτrτ under certain norms. When x~τ=xτ+ϵτ is observed instead of xτ, the RLS estimator becomes θ^a,RLSCE(t)=λI+τ[t]1aτ=ax~τx~τ-1τ[t]1aτ=ax~τrτ. Here ‘RLSCE’ means the RLS estimator with contextual error. However, when Σe,τ=Varϵτ is non-vanishing, θ^a,RLSCE(t) is generally no longer consistent, which may lead to bad decision-making (see Appendix B for details). In the simple case where xτ,ϵτ,ηττ[t] are i.i.d. and there is no action (i.e. set aτ0), the inconsistency of θ^a,RLSCE(t) is studied in the measurement error model literature in statistics (Fuller, 2009; Carroll et al., 1995), and is known as attenuation.

A naive measurement error adjustment.

A measurement error model is a type of regression model designed to accommodate inaccuracies in the measurement of regressors (i.e., instead of observing xt, we observe xt+ϵt where ϵt is a noise term with zero mean). As conventional regression techniques yield inconsistent estimators, measurement error models rectify this issue with adjustments to the estimator that consider these errors. In the current context when we want to estimate θa* from history t,x~ττ[t] can be viewed as regressors ‘measured with error’, while rττ[t] are dependent variables. If ϵτ,ηττ[t] are i.i.d., Σe,τΣe, and there is no action (i.e. set aτ0,θ^0,me(t):=1tτ[t]x~τx~τ-Σe-11tτ[t]x~τrτ is a consistent estimator for θ0*. When multiple actions are present, a naive generalization of the above estimator, θ^a,me(t), is

τ[t]1aτ=ax~τx~τ-Σe-1τ[t]1aτ=ax~τrτ. (2.4)

Unfortunately, θ^a,me(t) is inconsistent in the multiple-action setting, even if the policy πττ[t] is stationary and not adaptive to history. This difference is essentially due to the interaction between the policy and the measurement error: In the classical measurement error (no action) setting, Ex~τx~τ=xτxτ+Σe, and 1tτ[t]x~τx~τ-Σe concentrates around its expectation 1tτ[t]xτxτ. Likewise, 1tτ[t]x~τrτ concentrates around 1tτ[t]xτxτθ0*. Combining the above parts thus yields a consistent estimator of θ0*. In our setting with multiple actions, however, the agent picks the action aτ based on x~τ, so only certain values of x~τ lead to aτ=a. Therefore, for those τ[t] when we pick aτ=a, it’s more likely that x~τ falls in certain regions depending on the policy, and we shouldn’t expect Ex~τx~τ=xτxτ+Σe anymore. In other words, the policy creates a complicated dependence between x~τx~τ and 1aτ=0 for each τ, which changes the limit of 1tτ[t]1aτ=ax~τx~τ-Σe (and similarly 1tτ[t]1aτ=ax~τrτ. This leads to the inconsistency of the naive estimator (See Appendix B for a concrete example). In Section 3, we provide examples to show that (2.4) not only deviates from the true parameters, but also leads to suboptimal decision-making.

Our proposed estimator.

Inspired by the above observations, for πτax~τ,τ-1 positive, we construct the following estimator for θa* given t, which corrects (2.4) using importance weights:

θ^a(t):=Σ^x~,a(t)-1tΣτ[t]πτnd(a)Σe,τ-1Σ^x~,r,a(t), (2.5)

where

Σ^x~,a(t) =1tτ[t]πτndaτπτaτx~τ,τ-11aτ=ax~τx~τ,
Σ^x~,r,a(t) =1tτ[t]πτndaτπτaτx~τ,τ-11aτ=ax~τrτ.

Here, πτnd()τ[t] is a pre-specified policy (doesn’t depend on x~ττ or τ-1) that can be chosen by the algorithm. We only require the following:

Assumption 2.2. Let wt=dxt,w~t=dx~t be scaled vectors of xt and x~t. Then there exist positive constants ξ,ρ,λ0 s.t. (i) uSd-1,Euw~t4ξ, (ii) Ew~tw~tνId, and (iii) tT,a{0,1},1tτtπτnd(a)wτwτλ0Id.

Remark 2.1. In Assumption 2.2, (i) and (ii) are standard moment assumptions. (iii) is mild. Even restricted to the choice of πτnd(a)1/2, under mild conditions, the assumption can be satisfied with deterministic xττ1 or stochastic xττ1 such as an i.i.d. sequence, a weakly dependent stationary time series (e.g. multivariate ARMA process (Fan and Yao, 2017)), or a sequence with periodicity/seasonality with high probability (See Appendix D for details).

The theorem below gives a high-probability upper bound on θ^a(t)-θa*2 (proof in Appendix D).

Theorem 2.1. For any t[T], denote qt:=infτt,a{0,1}πτax~τ,τ-1. Then under Assumptions 2.1 and 2.2, there exist absolute constants C,C1, such that as long as qtC1maxd(d+logt)λ0t,ξ(d+logt)λ02t, with probability at least 1-8t2,a{0,1},θ^a(t)-θa*2is upper bounded by

CR+Rθdλ0maxd+logtqtt,ν+ξdd+logtqtt. (2.6)

2.2.

Unlike the existing literature on off-policy learning in contextual bandits (e.g. Wang et al. (2017); Zhan et al. (2021); Zhang et al. (2021); Bibaut et al. (2021)), the role of the importance weights here is to correct the dependence of a policy on the observed noisy context with error. The proof idea can be generalized to a large class of off-policy method-of-moment estimators, which might be of independent interest (see Appendix D).

2.3. MEB: Online bandit algorithm with measurement error adjustment

We propose MEB (Measurement Error Bandit), an online bandit algorithm with measurement error adjustment based on the estimator (2.5). The algorithm is presented in Algorithm 1 and is designed for the binary-action setting, although it can be generalized to the case with multiple actions (see Appendix C). For tT0, the algorithm is in a warm-up stage and can pick any policy such that there is a minimum sampling probability p0(t) for each action (Here p0(t)0,12. For instance, the algorithm can do pure exploration with πtax~t,t-112. For t>T0, given the noisy context x~t, the algorithm computes the best action a~t according to θ^a(t-1)a{0,1} calculated from (2.5). Then, it samples a~t with probability 1-p0(t) and keeps an exploration probability of p0(t) to sample the other action. In practice, we can often set p0(t) to be monotonically decreasing in t, in which case qt=infτt,a{0,1}πτax~τ,τ-1=p0(t) for all t[T].

Before presenting the regret analysis, we should first note that our problem is harder than a standard contextual bandit: xt is unknown, and only x~t is observed. Thus, even if θa*a{0,1} is known, we may still perform suboptimally if x~t is too far from xt so that it leads to a different optimal action. Example 2.1 below shows that in general, we cannot avoid a linear regret.

Example 2.1. Let d=1,θ1*,θ0*=(1,-1).xtt[T] are drawn i.i.d. from {±0.2} with equal probability. Px~t=1xt=0.2=0.6,Px~t=-1xt=0.2=0.4;Px~t=1xt=-0.2=0.4,Px~t=-1xt=-0.2=0.6. Intuitively, even if we know θa*a{0,1}, there is still a constant probability at each time t that we cannot make the right choice due to x~t and xt having different signs, and xt is never known (details in Appendix E). This results in a Ω(T) regret.

Fortunately, in practice, we expect that the errors ϵtt[T] are relatively ‘small’ in the sense that the optimal action (given θa*a{0,1} is not affected. Specifically, we assume the following:

Assumption 2.3. There exist a constant ρ(0,1) such that t[T],δθ,ϵtρδθ,xt almost surely. Here δθ:=θ1*-θ0*.

Assumption 2.3 ensures that the perturbation to the suboptimality gap between the two arms caused by ϵt is controlled by the true suboptimality gap. In this way, given θa*a{0,1}, the optimal action based on x~t will not deviate too much from that based on xt. As a special case, this assumption is satisfied with if t,δθ,xtBe,tδθ2/ρ. Here Be,t is an upper bound of ϵt2. Assumption 2.3 can be further weakened to the inequalities holding with high probability (see Appendix E). Note that Assumption 2.3 only guarantees the optimal action is not affected by ϵtt[T] given θa*a{0,1}. To achieve sublinear regret, θa*a{0,1} still needs to be well-estimated. Thus, even with Assumption 2.3, classical bandit algorithms such as UCB may still suffer from linear regret because of the inconsistent estimator θ^a,RLSCE(t) (see Appendix B for a concrete example).

We first prove the following theorem, which states that the regret of MEB can be directly controlled by the estimation error. In fact, this theorem holds regardless of the form or quality of the estimation procedure (i.e. in line 7 of Algorithm 1). The proof is in Appendix E.

Theorem 2.2. Let Assumption 2.1 and 2.3 hold.

  1. For the standard setting, Algorithm 1 outputs a policy with RegretT;π* no more than
    2T0Rθ+21-ρt=T0+1Tp0tRθ+maxa0,1θ^at-1-θa*2.
  2. For the clipped policy setting, Algorithm 1 with the choice of p0(t)p0 outputs a policy with RegretT;π* no more than
    2T0Rθ+21-2p01-ρt=T0+1Tmaxa0,1θ^at-1-θa*2.

The following corollary provides regret guarantees of MEB by combining Theorem 2.1 and 2.2 (proof in Appendix E).

Corollary 2.1. Let Assumption 2.1 to 2.3 hold. There exist universal constants C,C such that:

  1. For the standard setting, TCmax1+1/λ094(d+logT)3,ξ/λ094(d+logT)43, with probability at least 1-16T, Algorithm 1 with the choice of T0=2dT23,p0(t)=min12,t-13 outputs a policy with RegretT;π* no more than
    CdT23Rθ1-ρ+(ν+ξ+1)R+Rθ(1-ρ)λ01+logTd.
  2. For the clipped policy setting, TCmax(d+logT)2/λ0p02,ξ2/λ04(1+logT/d)2, with probability at least 1-16T, Algorithm 1 with the choice of T0=2dT and p0(t)p0 outputs a policy with RegretT;π* no more than
    CdT12Rθ+(ν+ξ+1)1-2p0R+Rθp0(1-ρ)λ01+logTd.

Ignoring other factors, the regret upper bound is of order 𝒪~dT2/3 for the standard setting, and 𝒪~(dT) for the clipped policy setting, depending on the horizon T and dimension d.

In certain scenarios (e.g. when d is large), it is desirable to save computational resources by updating the estimates of θa*a{0,1} less frequently in Algorithm 1. Fortunately, low regret guarantees can still be achieved: Suppose at each time t, the agent only updates the estimators according to (2.5) at selected time points t𝒮[T] (in line 7); Otherwise, the agent simply makes decisions based on the most recently updated estimators. In Appendix E, we show that time points to perform the updates can be very infrequent, such as nkkN+n2,nN+, while still achieving the same rate of regret upper bound as in Corollary 2.1.

2.4. MEB given estimated error variance

In practice, the agent might not have perfect knowledge about Σe,t, the variance of the error ϵt. In this section, we discuss the situation where at each time t, the agent does not know Σe,t, and only has a (potentially adaptive) estimator Σ^e,t for Σe,t. This estimator may be derived from auxiliary data or outside knowledge. In this case, in Algorithm 1, we need to replace the estimator (2.5) with the following estimator for decision-making (i.e. in line 7 of Algorithm 1):

θ~a(t):=Σ^x~,a(t)-1tΣτ[t]πτnd(a)Σ^e,τ-1Σ^x~,r,a(t). (2.7)

In Appendix F, we show that with this modification, the additional regret of Algorithm 1 is controlled by

dΣt=T0+1Tmaxa{0,1}Δt(a)2

up to a constant depending on the assumptions. Here, for each t,Δt(a):=1tτ[t]πτnd(a)Σ^e,τ- Σe,τ is the weighted average of the estimation errors Σ^e,τ-Σe,ττ[t]. In practice, it is reasonable to assume that Δt(a) is small so as not to significantly affect the overall regret: For example, suppose the agent gathers more auxiliary data over time so that dΣ^e,t-Σe,t2d/t, then the additional regret term will be 𝒪(dT) up to a constant depending on the assumptions.

3. Simulation results

In this section, we complement our theoretical analyses with simulation results on a synthetic environment with artificial noise and reward models as well as a simulation environment based on the real dataset, HeartStep V1 (Klasnja et al., 2018).

Compared algorithms.

In both simulation environments, we compare the following algorithms: Thompson sampling (TS) with normal priors (Russo et al., 2018), Linear Upper Confidence Bound (UCB) approach (Chu et al., 2011), MEB (Algorithm 1), and MEB-naive (MEB plugged in with the naive measurement error estimator (2.4) instead of (2.5)). See Appendix A for a detailed description of the algorithms.

3.1. Synthetic environment

We first test our algorithms on a synthetic environment. We consider a contextual bandit environment with d=5,T=50000. In the reward model, we set θ0*=(5,6,4,6,4),θ1*=(6,5,5,5,5), and ηt drawn i.i.d. from 𝒩0,ση2. Let xtt[T] be independently sampled from 𝒩μx,Id, where μx=1d. We further set Σe,tΣe:=Id/4 and consider independent ϵtt[T] with Normal distributionwith covariance Σe. We independently generate bandit data for nexp=100 times, and compare among the candidate algorithms in terms of estimation quality and cumulative regret with a moderate exploration probability p0=0.2.

3.2. HeartStep V1 simulation environment

We also construct a simulation environment with HeartSteps dataset. HeartSteps is a physical activity mobile health application, whose primary goal is to help the user prevent negative health outcomes and adopt and maintain healthy behaviors, for example, higher physical activity level. HeartSteps V1 is a 42-day mobile health trial (Dempsey et al., 2015; Klasnja et al., 2015; Liao et al., 2016), where participants are provided a Fitbit tracker and a mobile phone application. One of the intervention components is a contextually-tailored physical activity suggestion that may be delivered at any of the five user-specified times during each day. The delivery times are roughly separated by 2.5 hours.

Construction of the simulated environment.

We follow the simulation setups in Liao et al. (2020). The true context at the time t is denoted by xt with three main components xt=It,Zt,Bt. Here, It is an indicator variable of whether an intervention At=1 is feasible (e.g. It is 0 when the participant is driving a car, a situation where the suggestion should not be sent). Zt contains some features at time t.Bt is the true treatment burden, which is a function of the participant’s treatment history2. Specifically, Bt+1=λBt+1At=1. We assume that Itt[T] and Ztt[T] are sampled i.i.d with the empirical distribution from the Heartstep V1 dataset, and Btt[T] is given by the aforementioned transition model.

The reward model is rt(x,a;θ)=xα+af(x)β+ηt, where x is the full context, f(x) is a subset of x that is considered to have an impact on the treatment effects, and θ=α,βR9. Here ηt is the Gaussian noise on the reward observation, whose variance ση2 is chosen to be 0.1, 1.0, and 5.0 respectively (Liao et al., 2016). For a detailed list of variables in the context, see Table 2 in Appendix A.

The true parameters θa*a{0,1} is estimated from GEE (Generalized Estimating Equations) with rewards being the log-transformed step count collected 30 minutes after the decision time.

In light of the measurement error setting in this paper, we consider an observation noise on Bt for the following reasons: 1) The burden Bt can be understood as a prediction of the burden level of the participant, which is particularly crucial in mobile health studies; 2) Other variables are normally believed to have low or no observation noise. Thus, we assume that the agent only observes x~t=It,Zt,B~t, where B~t=Bt+ϵt and ϵt is drawn i.i.d. from normal distribution with mean zero and variance σϵ2.

3.3. Results

Table 1 (a) and (b) shows the average regret (cumulative regret divided by T) in both the synthetic environment and the real-data environment based on HeartStep V1. We use the same set of σϵ2{0.1,1.0,2.0}, while different ση2 reflect the change of absolute values in coefficients in two different environments ση2=5.0 is the level of reward noise in HeartStep V1). MEB shows significantly smaller average regret compared to other baseline methods under most combinations of ση2 and σϵ2. 1. In certain instances, MEB-naive exhibits performance comparable to MEB. This is attributed to its ability to reduce the variance of model estimation while incurring some bias compared to MEB, rendering it a feasible alternative in practical contexts. Notably, in two extreme scenarios, UCB surpasses both MEB and MEB-naive. This is as expected, since when contextual noise is sufficiently negligible, traditional bandit algorithms are anticipated to outperform the proposed algorithms. An estimation error plot can be found in Appendix A, which also demonstrates that MEB has a lower estimation error.

Table 1:

Average regret for both synthetic environment and real-data environment under different combinations of ση2 and σϵ2. The results are averages over 100 independent runs and the standard deviations are reported in the full table in Appendix A.

(a) Average regret in the synthetic environment over 50000 steps with clipping probability p = 0.2.
ση2 σϵ2 TS UCB MEB MEB-naive
0.01 0.1 0.047 0.046 0.027 0.038
0.1 0.1 0.047 0.047 0.026 0.039
1.0 0.1 0.048 0.048 0.027 0.038
0.01 1.0 0.757 0.647 0.198 0.371
0.1 1.0 0.769 0.721 0.205 0.392
1.0 1.0 0.714 0.697 0.218 0.404
0.01 2.0 1.492 1.504 0.358 0.616
0.1 2.0 1.195 1.333 0.368 0.584
1.0 2.0 1.299 1.476 0.416 0.625
(b) Average regret in tde real-data environment over 2500 steps witd clipping probability p = 0.2.
ση2 σϵ2 TS UCB MEB MEB-naive
0.05 0.1 0.027 0.027 0.022 0.024
0.1 0.1 0.026 0.024 0.020 0.020
5.0 0.1 1.030 0.743 0.831 1.173
0.05 1.0 0.412 0.408 0.117 0.112
0.1 1.0 0.309 0.316 0.085 0.087
5.0 1.0 1.321 0.918 1.458 1.322
0.05 2.0 0.660 0.634 0.144 0.148
0.1 2.0 0.740 0.704 0.151 0.155
5.0 2.0 1.585 2.415 1.577 1.436

4. Discussion and conclusions

We propose a new algorithm, MEB, which is the first algorithm with sublinear regret guarantees in contextual bandits with noisy context, where we have limited knowledge of the noise distribution. This setting is common in practice, especially where only predictions for unobserved context are available. MEB leverages the novel estimator (2.5), which extends the conventional measurement error adjustment techniques by considering the interplay between the policy and the measurement error.

Limitations and future directions.

Several questions remain for future investigation. First, is 𝒪~T2/3 the optimal rate of regret compared to the standard benchmark policy (2.3), as in some other bandits with semi-parametric reward model (e.g. Xu and Wang (2022))? Providing lower bounds on the regret helps us understand the limit of improvement in the online algorithm. Second, we assume that the agent has an unbiased prediction of the true context. It is important to understand how biased predictions affect the results. Last but not least, it’s interesting to see our method can be extended to more complicated decision-making settings (e.g. Markov decision processes).

A. Additional details for simulation studies

A. 1. Compared algorithms

In both simulation environments, we compare the following algorithms: Thompson sampling (TS, see details in Algorithm 2) given normal priors (Russo et al., 2018), Linear Upper Confidence Bound (UCB, see details in Algorithm 3) approach (Chu et al., 2011), MEB (Algorithm 1), and MEB-naive (MEB plugged in with the naive measurement error estimator (2.4) instead of (2.5)). To make a fair comparison between algorithms, we use the same regularization parameter l=1 for all algorithms. The hyper-parameter ρ,C is set to be ση2 for all results for TS and UCB.

A.

A.

We further compare with robust linear UCB (Algorithm 1 in Ding et al. (2022)) that is shown to achieve minimax rate for adversarial linear bandit.

A. 2. Additional details for HeartStep V1 study

Table 2 presents the list of variables to include in the reward model and in the feature construction for algorithms. Recall that our reward model is rt(x,a,θ)=xα+af(x)β+ηt. All the variables are included in x while only those considered to have an impact on treatment effect will be included in f(x).

Table 2:

List of variables in HeartSteps V1 study.

Variable Type Treatment?
Availability (It) Discrete No
Prior 30-minute step count Continuous No
Yesterday’s step count Continuous No
Prior 30-minute step count Continuous No
Location Discrete Yes
Current temperature Continuous No
Step variation level Discrete Yes
Burden variable (Bt) Continuous Yes

A.3. Additional results on estimation error

Figure 1:

Figure 1:

Log-scaled L2 norm of θ^1-θ1* of four algorithms in the synthetic environment over 50000 steps under σϵ2{0.1,1.0,2.0} and ση2{0.01,0.1,1.0}.

Figure 2:

Figure 2:

Log-scaled L2 norm of θ^1-θ1* of four algorithms in the real-data environment based on HeartStep V1 over 2500 steps under σϵ2{0.1,1.0,2.0} and ση2{0.05,0.1,5.0}.

A. 4. Average regret with standard deviation

Table 3:

Average regret for both synthetic environment and real-data environment under different combinations of ση2 and σϵ2. The numbers in parentheses are the standard deviations calculated from 100 independent runs.

(a) Average regret in the synthetic environment over 50000 steps with clipping probability p = 0.2.
ση2 σϵ2 TS UCB MEB MEB-naive RobustUCB
0.01 0.1 0.047 (0.0015) 0.046 (0.0015) 0.027 (0.0011) 0.038 (0.0013) 0.050 (0.0051)
0.1 0.1 0.047 (0.0015) 0.047 (0.0015) 0.026 (0.0011) 0.039 (0.0013) 0.049 (0.0048)
1.0 0.1 0.048 (0.0015) 0.048 (0.0015) 0.027 (0.0011) 0.038 (0.0013) 0.044 (0.0047)
0.01 1.0 0.757 (0.0164) 0.647 (0.0145) 0.198 (0.0079) 0.371 (0.0107) 0.652 (0.0050)
0.1 1.0 0.769 (0.0160) 0.721 (0.0156) 0.205 (0.0080) 0.392 (0.0110) 0.753 (0.0056)
1.0 1.0 0.714 (0.0155) 0.697 (0.0150) 0.218 (0.0083) 0.404 (0.0112) 0.589 (0.0047)
0.01 2.0 1.492 (0.0281) 1.504 (0.0283) 0.358 (0.0129) 0.616 (0.0169) 1.608 (0.0102)
0.1 2.0 1.195 (0.0244) 1.333 (0.0260) 0.368 (0.0131) 0.584 (0.0164) 1.064 (0.0079)
1.0 2.0 1.299 (0.0257) 1.476 (0.0277) 0.416 (0.0139) 0.625 (0.0170) 1.881 (0.0114)
(b) Average regret in tde real-data environment over 2500 steps witd clipping probability p = 0.2.
ση2 σϵ2 TS UCB MEB MEB-naive RobustUCB
0.05 0.1 0.027 (0.0067) 0.027 (0.0070) 0.022 (0.0057) 0.024 (0.0058) 0.025 (0.0079)
0.1 0.1 0.026 (0.0057) 0.024 (0.0053) 0.020 (0.0046) 0.020 (0.0046) 0.028 (0.0079)
5.0 0.1 1.030 (0.0287) 0.743 (0.0262) 0.831 (0.0267) 1.173 (0.0343) 1.400 (0.0447)
0.05 1.0 0.412 (0.0355) 0.408 (0.0351) 0.117 (0.0148) 0.112 (0.0143) 0.226 (0.0020)
0.1 1.0 0.309 (0.0293) 0.316 (0.0299) 0.085 (0.0112) 0.087 (0.0116) 0.206 (0.0125)
5.0 1.0 1.321 (0.0417) 0.918 (0.0304) 1.458 (0.0422) 1.322 (0.0388) 1.065 (0.0400)
0.05 2.0 0.660 (0.0343) 0.634 (0.0322) 0.144 (0.0129) 0.148 (0.0133) 0.304 (0.0241)
0.1 2.0 0.740 (0.0505) 0.704 (0.0489) 0.151 (0.0145) 0.155 (0.0149) 0.432 (0.0386)
5.0 2.0 1.585 (0.0454) 2.415 (0.0816) 1.577 (0.0508) 1.436 (0.0462) 1.345 (0.0423)

B. Additional explanations on the regularized least-squares (RLS) estimator the naive estimator (2.4) under noisy context

B. 1. Inconsistency of the RLS estimator

Measurement error model and attenuation.

As briefly mentioned in the main text, a measurement error model is a regression model designed to accommodate inaccuracies in the measurement of regressors. Suppose that there is no action (i.e. set aτ0, and xτ,ϵτ,ηττ[t] are i.i.d., then the measurement error model is a useful tool to learn θ0* from t collected as follows:

rτ=θ0*,xτ+ητ,x~τ=xτ+ϵτ. (B.1)

Here in the measurement error model’s perspective, x~ττ[t] are regressors ‘measured with error’, and rττ[t] are dependent variables.

Regression attenuation, a phenomenon intrinsic to measurement error models, refers to the observation that when the predictors are subject to measurement errors, the Ordinary Least Squares (OLS) estimators of regression coefficients become biased (see, for instance, Carroll et al. (1995)). Specifically, in simple linear regression, the OLS estimator for the slope tends to be biased towards zero. Intuitively, this is because the measurement errors effectively ‘dilute’ the true relationship between variables, making it appear weaker than it actually is.

Inconsistency of the RLS estimator.

Before presenting a concrete numerical example to show the inconsistency of the RLS estimator and that it leads to bad decision-making, below we first apply the theory of measurement error model to give a heuristic argument of why the RLS estimator is inconsistent even in the simplified situation where there is no action (i.e. set aτ0), and xτ,ϵτ,ηττ[t] are i.i.d..

From Section 3.3.2 in Carroll et al. (1995), given data x~τ,rττ[t] from (B.1) with multiple covariates (in the simplified case with no action as described above), the OLS estimator for θ0*, denoted as θ^0OLS,t, consistently estimates not θ0* but

θ~0*=Σx+Σe-1Σxθ0*, (B.2)

where Σx=Varxτ,Σe=Varϵτ. In addition, given fixed λ, the regularized least squares (RLS) estimator

θ^0RLS,t(λ)=λtI+1tτ[t]x~τx~τ-11tτ[t]x~τrτ=λtI+1tτ[t]x~τx~τ-11tτ[t]x~τx~τθ^0OLS,t,

where 1tτ[t]x~τx~τVarx~τ=Varxτ+Varϵτ, and λ/t0 as t. This means that as t,θ^0RLS,t(λ) and θ^0OLS,t converges to the same limit, which is θ~0*. Thus, for fixed λ, as t grows, θ^0RLS,t(λ)-θ0*2 converges to θ~0*-θ0*2 and does not converge to zero in general.

Finally, recall that in classical bandit algorithms such as UCB, the sublinear regret relies on the key property that with high probability, for all t,θ^0RLS,t(λ)-θ0*Vtβ, where Vt=λI+τ[t]x~τx~τ,β=𝒪~(d). Here for a vector vRd and positive definite matrix MRd×d,vM:=vMv. We argue that this requirement generally no longer holds in the setting with measurement error ϵττ1. In fact, notice that since x~τ is i.i.d. in this simplified setting, and Vt=λI+τ[t]x~τx~τ, we expect that 1tVt concentrates around Varx~t. As long as λminVarx~t>0, with high probability, for all t,1tVtc for some constant c. If this holds,

θ^0RLS,t(λ)-θ0*Vtctθ^0RLS,tλ-θ0*2,

while the last term θ^0RLS,t(λ)-θ0*2 converges to a nonzero limit θ~0*-θ0*2. This indicates that in general, θ^0RLS,t(λ)-θ0*Vt scales with a rate of at least t, and will not be uniformly bounded by 𝒪~(d) for all t.

An example. The following is an example where given the errors ϵττ[t], the RLS estimator in the classical bandit algorithms inconsistently estimates the true reward model, and in addition, it leads to bad decision-making (linear regret) in the classic bandit algorithms.

Example B.1. Consider the standard setting described in Section 2.1. Let =10000,d=2,θ1*=(1,0),θ0*=(-1,0). Let xt sampled i.i.d. from Unif(𝒮), where 𝒮=(1,3),(-3,1),(-1,-3),(3,-1). Condition on xt,ϵt is uniformly sampled from ρ0xt[1],ρ0xt[1],-ρ0xt[1],-ρ0xt[1], independent from any other variable in the history. Here ρ0=0.9,xt[1] denotes the first entry in xt. We also let ηt be i.i.d. drawn from N(0,0.01).

We conduct 100 independent experiments. For each experiment, we generate data according to the above, and test the performance of UCB (Algorithm 1 in Chu et al. (2011)) and Thompson sampling with normal priors (Russo et al., 2018) using noisy context x~t instead of the true context. We choose the regularization parameter λ=1 in the RLS estimator. Additionally, in UCB (Chu et al., 2011), we choose the parameter α=1. Figure 3 summarizes the estimation error of the RLS estimator and the cumulative regret of both algorithms with respect to time t, showing both the average and standard error across the random experiments. We see that the RLS estimator is unable to estimate the true reward model well. Moreover, it is clear that the regret of both UCB and Thompson sampling is linear in the time horizon. Intuitively, this is because the direction of θ~0* in (B.2) is twisted compared to θ0*, which not only leads to inconsistent estimators, but also the optimal action altered.

Finally, we note that in this setup, Assumption 2.3 is satisfied. This demonstrates that even if the errors ϵtt1 do not affect the optimal action given θa*a{0,1}, the poor performance of the RLS estimator may still lead to linear regret in classical bandit algorithms.

B. 2. Inconsistency of the naive measurement error adjusted estimator (2.4)

Example B.2. Let d=1,xτ1 for all τ, and ϵτ~Unif(-2,2) sampled independently. For the reward model, let θ0*=-1,θ1*=1,ητ~Unif(-0.1,0.1) sampled independently. So in order to maximize expected reward, we should choose action 1 if xτ is positive and action 0 otherwise.

Suppose the agent takes the following policy that is stationary and non-adaptive to history:

πτ(A)=231{A=1}+131{A=0}, if x~τ>ρ131{A=1}+231{A=0}, otherwise. 

Here, ρ is a pre-specified constant. Figure 4 (a) plots the mean and standard deviation of θ^0,me(t) (as in (2.4)) given 100 independent experiments for each t=1,,10000, where ρ=-0.5,0,0.5. Observe that as t grows, θ^0,me(t) converges to different limits for different policies. In general, the limit is not equal to θ0*=-1.

Figure 3:

Figure 3:

Estimation error of the RLS estimator and cumulative regret of UCB (Chu et al., 2011) and Thompson sampling (Russo et al., 2018) under contextual error in Example B.1. The red and pink line corresponds to Thompson sampling and UCB respectively. The solid lines indicate the mean values, while the shaded bands represent the standard error across the independent experiments.

In contrast, Figure 4 (b) shows the mean and standard deviation of θ^0(t) (as in (2.5)) given 100 independent experiments under the same setting with the same three policies as in Figure 4 (a). Unlike the naive estimator (2.4), the proposed estimator (2.5) quickly converges around the true value −1 for all three candidate policies.

C. Generalization to K2 actions

In this section, we assume that 𝒜={1,2,,K} instead of {0,1}. The standard and clipped benchmark become

πt*(a)=1,if a=at*=argmaxaθa*,xt0,otherwise, (C.1)

and

πt*(a)=1-(K-1)p0,if a=at*p0,otherwise. (C.2)

Figure 4:

Figure 4:

Estimated value of θ0* given the naive estimator (2.4) in (a) and our proposed estimator (2.5) in (b) under different policies under 100 independent experiments. The green, blue, and red line corresponds to the policy with parameter ρ=-0.5,0, and 0.5 respectively. The solid lines indicate the mean values, while the shaded bands represent the standard deviation across the independent experiments.

In the K-arm setting, we can still estimate θa* using (2.5) for each a𝒜. Using the same proof ideas as Theorem 2.1, we get the following guarantees for estimation error (proof omitted):

Theorem C.1. For any t[T], let qt=infτ[t],a𝒜πτax~τ,τ-1. Then under Assumption 2.1 and 2.2, there exist constants C and C1 such that as long as qtC1maxd(d+logt)λ0t,ξ(d+logt)λ02t, with probability at least 1-4Kt2,

θ^a(t)-θa*2CR+Rθdλ0maxd+logtqtt,ν+ξdd+logtqtt,a𝒜.

MEB with K actions is shown in Algorithm 4. As in Theorem 2.2, we can control the regret of Algorithm 4 by the estimation error of (2.5). Here, Assumption 2.3 needs to be generalized to the following to adapt to multiple actions:

Assumption C.1. There exists a constant ρ(0,1) such that tT,a1,a2𝒜,θa1*-θa2*,ϵt|ρ|θa1*-θa2*,xt almost surely.

The theorem below is a generalization of Theorem 2.2 to multiple actions (The proof is only slightly different from Theorem 2.2; We briefly discuss the difference in Appendix G.1).

Theorem C.2. Let Assumption 2.1 and C. 1 hold.

graphic file with name nihms-1985116-f0004.jpg

  1. For the standard setting, Algorithm 4 outputs a policy with RegretT;π* no more than
    2T0Rθ+21-ρt=T0+1TK-1p0tRθ+maxa𝒜θ^at-1-θa*2.
  2. For the clipped policy setting, Algorithm 4 with the choice of p0(t)p0 outputs a policy with RegretT;π* no more than
    2T0Rθ+21-Kp01-ρt=T0+1Tmaxa𝒜θ^at-1-θa*2.

Combining Theorems C. 1 and C.2, we obtain the following corollary.

Corollary C.1. Let Assumption 2.1, 2.2 and C. 1 hold. There exist universal constants C,C such that:

  1. For the standard setting, TCmax1+1/λ094(d+logT)3,ξ/λ094(d+logT)43, with probability at least 1-8KT, Algorithm 4 with the choice of T0=2dT23,p0(t)=min12,t-13 outputs a policy with RegretT;π* no more than
    CdT23(K-1)Rθ1-ρ+(ν+ξ+1)R+Rθ(1-ρ)λ01+logTd.
  2. For the clipped policy setting, TCmax(d+logT)2/λ0p02,ξ2/λ04(1+logT/d)2, with probability at least 1-8KT, Algorithm 4 with the choice of T0=2dT and p0(t)p0 outputs a policy with RegretT;π* no more than
    CdT12Rθ+(ν+ξ+1)1-Kp0R+Rθp0(1-ρ)λ01+logTd.

D. Analysis of the proposed estimator (2.5)

D. 1. Proof of Theorem 2.1

We fix some t[T], and control θ^a(t)-θa*2 for a{0,1}. Towards this goal, we combine analysis of the two random terms Σ^x~,a(t) and Σ^x~,r,a(t) in the lemma below.

Lemma D.1. Under the same assumptions of Theorem 2.1, there exists an absolute constant C such that with probability at least 1-4/t2, both of the followings hold:

Σ^x~,a(t)-1tτ[t]πτnd(a)xτxτ+Σe,τ2Cmaxd+logtqtt,ξdd+logtqtt, (D.1)
Σ^x~,r,a(t)-1tτ[t]πτnd(a)xτxτθa*2CRmaxd+logtqtt,νdd+logtqtt. (D.2)

Proof of Lemma D.1 is in Section G.2.

Denote Δ1=Σ^x~,a(t)-1tτ[t]πτndaxτxτ+Σe,τ,Δ2=Σ^x~,r,at-1tτ[t]πτndaxτxτθa*. Then

θ^a(t) =Σ^x~,a(t)-1tτ[t]πτnd(a)Σe,τ-1Σ^x~,r,a(t) =1tτ[t]πτnd(a)xτxτ+Δ1-11tτ[t]πτnd(a)xτxτθa*+Δ2 =θa*-J1+J2,

where

J1:=1tτ[t]πτnd(a)xτxτ+Δ1-1Δ1θa*,J2:=1tτtπτndaxτxτ+Δ1-1Δ2.

It’s easy to verify that under the event where both (D.1) and (D.2) hold, whenever

Cmaxd+logtqtt,ξdd+logtqttλ02d, (D.3)

we have

J122CRθdλ0maxd+logtqtt,ξdd+logtqtt

and

J222CRdλ0maxd+logtqtt,νdd+logtqtt.

(D.3) can be ensured by tC1maxd(d+logt)λ0qt,ξ(d+logt)λ02qt, where C1=max2C,4C2. Given these guarantees, we have with probability at least 1-4/t2,

θ^a(t)-θa*2J12+J222CR+Rθdλ0maxd+logtqtt,ν+ξdd+logtqtt.

Thus we conclude the proof.

D. 2. Additional comments on Assumption 2.2

Following Remark 2.1, all the following examples of xτ allow the existence of λ0 with πτnd(a)1/2 given any reasonably big t with high probability:

  • dxτ is an i.i.d. sequence satisfying dExτxτλ1Id,λ1>0;

  • dxτ is a weakly-dependent stationary time series (a common example is the multivariate ARMA process under regularity conditions, see e.g. (Banna et al., 2016)). The stationary distribution P satisfies dEx~Pxxλ2Id,λ2>0;

  • dxτ is a periodic time series such that there exists t0N+which satisfies dt0τkt0+1,(k+1)t0xτxτλ3Id a.s. kN.

D. 3. Generalization to off-policy method-of-moment estimation

(2.5) can be generalized to a class of method-of-moment estimators for off-policy learning. In this section, we delve into the general framework of off-policy method-of-moment estimation. This framework proves valuable in scenarios where a fully parametric model class for the reward is unavailable, yet there is a desire to estimate certain model parameters using offline batched bandit data.

For simplicity, we assume that Xt,Yt(a):a𝒜t[T] are drawn i.i.d. from an unknown distribution 𝒫. At each time t[T], the action At is drawn from a policy πtXt,t-1, and the agent observes only ot=Xt,At,YtAt together with the action selection probabilities πt. Define the history up to time t as t=oττt. For a0𝒜, we’re interested in estimating θa0*, a d-dimensional parameter in 𝒫a0, which is the joint distribution of Xt,Yta0.

Remark D.1. When the context is i.i.d., the problem of estimating θa*a𝒜 in Section 2.2 is a special case of this setup by taking x~t as Xt and rt as Yt.

The traditional method-of-moment estimator looks for functions f1,,fd as well as a mapping ϕ:RdRd, such that

θa0*=ϕE(X,Y)~𝒫a0f1(X,Y),,E(X,Y)~𝒫a0fd(X,Y).

Then, if given i.i.d. samples Ut,Vtt[n] from 𝒫a0, the estimator takes the form

θ^a0=ϕ1Tt[T]f1Ut,Vt,,1Tt[T]fdUt,Vt.

In fact, the naive estimator (2.4) is of this form. It is clear that we cannot use this estimator for offline batched data T: There are no i.i.d. samples from 𝒫a0 because of the policy πtt[T]. Instead, we propose the following estimator:

θ^a0=ϕ1Tt[T]Wtf1Xt,Yt,,1Tt[T]WtfdXt,Yt,

where Wt=1At=a0πndAtπtAtXt,t-1 for a data-independent probability distribution πnd on 𝒜. Similar to the proof of Theorem 2.1, it’s not difficult to see that θ^a0 is consistent under mild conditions. In fact, (2.5) is a special case of this estimator when πτnd does not depend on τ. A more detailed analysis is left for future work.

E Analysis of MEB

E. 1. Additional comments on Example 2.1

In example 2.1, the joint distribution of xt,x~t is:

Pxt,x~t=(0.2,1)=0.3,Pxt,x~t=(0.2,-1)=0.2,Pxt,x~t=(-0.2,1)=0.2,Pxt,x~t=(-0.2,-1)=0.3.

The optimal action given θa*a{0,1} and xt is

at*xt=argmaxaθa*,xt=1, if xt=0.2,0, if xt=-0.2.

In the standard bandit setting, the benchmark policy is

πt*(a)=1,if a=at*xt,0,otherwise.

For any policy π=πtt1, the instantaneous regret at time t is

Regrettπt,πt*=0.4Eat~πtx~t,t-11atat*xt.

Even if both θa*a{0,1} and the joint distribution of xt,x~t are known, πt can only depend on x~t and history, and cannot be based on xt. Thus, there is always a (constant) positive probability that the action at sampled from πt does not match at*xt (otherwise, at sampled from πtx~t,t-1 should be equal to at*xt a.s.). Thus, the standard cumulative regret will be linear in the time horizon.

E. 2. Proof of Theorem 2.2

We first prove the lemma below. Its proof is in Appendix G.3.

Lemma E.1. Under Assumption 2.3, we have at*=at:=argmaxaθa*,x~t. Consequently, πt*=πt,πt*=πt (given a fixed minimum action selection probability p0), where

πt(a)=1,if a=at,0,otherwiseπt(a)=1-p0,if a=at,p0,otherwise.

In the below, we define

Regret^tπ,π*:=Ea~πt*θa*,x~t-Ea~πtθa*,x~t,Regret^tπ,π*:=Ea~πt*θa*,x~t-Ea~πtθa*,x~t.

Standard setting. In the standard setting, we give the lemma below (proof in Appendix G.4).

Lemma E.2. Under the assumptions of Theorem 2.2, at any time t>T0,

Regret^tπ,π*2p0(t)Rθ+2max a{0,1}θ^a(t-1)-θa*2

Note that for any time t>T0, the instantaneous regret at time t:Regrettπ,π*=πt*1-πt(1)θ1*-θ0*,xt, and that Regret^tπ,π*=πt*(1)-πt(1)θ1*-θ0*,x~t. Moreover,

|θ1*θ0*,x˜t|=|θ1*θ0*,xt+θ1*θ0*,ϵt||θ1*θ0*,xt||θ1*θ0*,ϵt| 1ρ|θ1*θ0*,ϵt||θ1*θ0*,ϵt|=1ρρ|θ1*θ0*,ϵt|.

Here we used Assumption 2.3. Thus we have

Regrettπ,π* =Regret^tπ,π*-πt*(1)-πt(1)θ1*-θ0*,ϵt Regret^tπ,π*+πt*(1)-πt(1)ρ1-ρθ1*-θ0*,x~t =11-ρRegret^tπ,π*.

Combining Lemma E.2, we obtain that for any t>T0,

Regrettπ,π*21-ρp0tRθ+maxa0,1θ^at-1-θa*2.

Finally, when tT0, since xt1,θa*Rθ, the instantaneous regret Regrettπ,π*2Rθ. We conclude the proof by summing up all the instantaneous regret terms.

Clipped policy setting.In the clipped policy setting, we give the lemma below (proof in Appendix G.5).

Lemma E.3. Under the assumptions of Theorem 2.2, at any time t>T0,

Regret^tπ,π*21-2p0maxa{0,1}θ^a(t-1)-θa*2

Note that the instantaneous regret at time t:Regrettπ,π*=πt*(1)-πt(1)θ1*-θ0*,xt, and that Regret^tπ,π*=πt*(1)-πt(1)θ1*-θ0*,x~t. Similar to the standard setting, for t>T0, under Assumption 2.3, we have

Regrettπ,π*11-ρRegret^tπ,π*21-2p01-ρmaxa0,1θ^at-1-θa*2.

We conclude the proof by summing up all the instantaneous regret terms, and noticing that for tT0,Regrettπ,π*2Rθ.

Results with a high-probability version of Assumption 2.3.As briefly mentioned in the main paper, Assumption 2.3 can be weakened to the inequalities holding with high probability. Now instead of Assumption 2.3, we assume the following:

Assumption E.1. There exist constants ρ(0,1),ce[0,1] such that t=1TPAtcce. Here At denotes the event δθ,ϵtρδθ,xt, and δθ=θ1*-θ0*.

It’s easy to see that the result of Lemma E. 1 hold at time t under the event At. Further, following the same arguments, we obtain that under Assumption 2.1, the results for either the standard setting or the clipped policy setting hold under the event t=T0+1TAt. Therefore, with Assumption 2.1 and E.1, in either setting, the results in Theorem 2.2 hold with probability at least 1-ce.

E. 3. Proof of Corollary 2.1

Standard setting. First, notice that qt=minτt,a{0,1}πτax~t,τ-1=p0(t), since p0(t) is monotonically decreasing in t. Theorem 2.1 indicates that, as long as t>T0,

p0(t)C1maxdd+logtλ0t,ξd+logtλ02t, (E.1)

then with probability at least 1-8t2,a{0,1},

θ^a(t)-θa*2CR+Rθdλ0maxd+logtqtt,ν+ξdd+logtqtt (E.2)

Plug it into Theorem 2.2, we have that with high probability,

RegretT;π*2Rθ2dT2/3+21-ρI1,

where

I1 =t=T0+1Tp0(t)Rθ+maxaθ^a(t-1)-θa*2 t=T0+1Tt-13Rθ+t=T0T-1CR+Rθdλ0maxd+logtt23,ν+ξdd+logtt23 2RθT23+CR+Rθdλ0t=T0T-1d+logtt23+CR+Rθdλ0t=T0T-1ν+ξdd+logtt23 2RθT23+3CR+Rθλ0d(d+logT)T13+3C(ν+ξ)R+Rθ2λ0d(d+logT)T23 2RθT23+Cλ0(ν+ξ+1)R+Rθd(d+logT)T23,

for a universal constant C, where the last inequality holds if in addition, T[d(d+logT)]32.

The proof is concluded by combining the above requirement for T as well as (E.1).

Clipped policy setting.

Similar to the standard setting, according to Theorem 2.1, as long as t>T0,

p0C1maxd(d+logt)λ0t,ξ(d+logt)λ02t, (E.3)

then with probability at least 1-8t2,a{0,1},

θ^a(t)-θa*2CR+Rθdλ0maxd+logtp0t,ν+ξdd+logtp0t (E.4)

Plug it into Theorem 2.2, we have that with high probability,

RegretT;π*2Rθ2dT1/2+21-2p01-ρI2

where

I2 =t=T0+1Tmaxaθ^a(t-1)-θa*2 t=T0T-1CR+Rθdλ0maxd+logtp0t,ν+ξdd+logtp0t CR+Rθdλ0t=T0T-1d+logtp0t+CR+Rθdλ0t=T0T-1ν+ξdd+logtp0t 2CR+Rθλ0d(d+logT)logT+2CR+Rθλ0ν+ξp0d(d+logT)T 2CR+Rθ(ν+ξ+1)λ0p0d(d+logT)T,

where the last inequality holds if in addition, Td(d+logT)log2T.

The proof is concluded by plugging the above into the regret upper bound formula and combining the requirements for T.

Results with a high-probability version of Assumption 2.3. Recall that Assumption E. 1 is a weakened version of Assumption 2.3 with a high-probability statement. Given Assumptions 2.1, 2.2, and E. 1 (instead of 2.3), in the standard setting, the results of Theorem 2.1 hold as long as (E.2) or (E.4) for all t>T0,a{0,1} and under the event t=T0+1TAt. Thus, we deduce that the regret upper bound in (i) hold with probability at least 1-16/T-ce. Similarly, given Assumptions 2.1, 2.2, and E.1, in the clipped benchmark setting, the regret upper bound in (ii) hold with probability at least 1-16/T-ce.

E. 4. MEB with infrequent model update

As mentioned at the end of Section 2.3, in certain scenarios (e.g. when d is large), we can save computational resources by updating the estimates of θa*a{0,1} less frequently. In Algorithm 5, we propose a variant of Algorithm 1. At each time t, given the noisy context x~t, the algorithm computes the best action a~t according to the most recently updated estimators of θa*a{0,1}. Then, it samples a~t with probability 1-p0(t) and keeps an exploration probability of p0(t) to sample the other action. In the meantime, the agent only has to update the estimate of θa*a{0,1} once in a while to save computation power: The algorithm specifies a subset 𝒮[T] and updates the estimators according to (2.5) only when t𝒮.

graphic file with name nihms-1985116-f0005.jpg

Under mild conditions, Algorithm 5 achieves the same order of regret upper bound as Algorithm 1, as seen from Theorem E. 1 and Corollary E. 1 below. They are modified versions of Theorem 2.2 and Corollary 2.1.

Theorem E.1. Let smin:=mins𝒮s be the first time Algorithm 5 updates the model. Suppose Assumption 2.1 and 2.3 hold.

  1. For the standard setting, for any T0smin, Algorithm 5 outputs a policy such that
    RegretT;π*2T0Rθ+21-ρtT0,Tp0tRθ+maxa0,1θ^ast-θa*2.
  2. For the clipped policy setting, for any T0smin, Algorithm 5 with the choice of p0(t)p0 outputs a policy such that
    RegretT;π*2T0Rθ+21-2p01-ρΣtT0,Tmaxa0,1θ^ast-θa*2.

Here for any t[T],st:=max{s𝒮:s<t}.

The proof of Theorem E. 1 is very similar to that of Theorem 2.2, and is thus omitted.

Corollary E.1. Let Assumption 2.1 to 2.3 hold. There exist constants C,C such that:

  1. In the standard setting, as long as the set of model update times 𝒮 satisfies: (a) smindT2/3; (b) tdT2/3,T,st=max{s𝒮:s<t}αt for some constant αe-d,1, then TC/αmax1+1/λ094(d+logT)3,ξ/λ094(d+logT)43, with probability at least 1-16T, Algorithm 5 with the choice of p0(t)=min12,t-1/3 achieves RegretT;π*CdT23Rθα+Rθ1-ρ+(ν+ξ+1)R+Rθα1/3λ0(1-ρ)1+logTd.

  2. In the clipped policy setting, as long as the set of model update times 𝒮 satisfies: (a) smindT; (b) t(dT,T],st=max{s𝒮:s<t}αt for some constant αe-d,1, then for any T s.t. TC/αmax(d+logT)2/λ0p02,ξ2/λ04(1+logT/d)2, with probability at least 1-16T, Algorithm 5 with the choice of p0(t)p0 achieves: RegretT;π*CdTRθα+(ν+ξ+1)1-2p0R+Rθλ0(1-ρ)αp01+logTd.

The proof of Corollary E. 1 can be directly obtained by combining Theorem 2.1 and E. 1 with T0=dT2/3/α in the standard setting and T0=dT1/2/α in the clipped benchmark setting. Thus, the proof is omitted here.

In Corollary E.1, condition (a) and (b) essentially requires Algorithm 5 not to start learning the model too late, and to keep updating the learned model at least at time points with a ‘geometric’ growth rate. This covers a wide range of choices of 𝒮 in practice. Two typical examples of 𝒮 could be: (1) 𝒮=t[T]:t=kt0 for some kN+(the model is learned every t0 time points routinely, where t0 is a constant integer); (2) If 1/αN+,𝒮=t[T]:t=(1/α)k for some kN+(the model only needs to be learned 𝒪(logT) times to save computation).

F Analysis with estimated error variance

We consider the setting where at each time t, the agent does not have access to Σe,t, but has a (potentially adaptive) estimator Σ^e,t. In this setting, we estimate the model using (2.7) instead of (2.5) and plug into Algorithm 1. The following theorem controls the estimation error of θ~a(t). Note that compared to Theorem 2.1, the additional error caused by inaccuracty of Σ^e,t can be controlled by Δt(a):=1tτ[t]πτnd(a)Σ^e,τ-Σe,τ, the weighted average of the estimation errors Σ^e,τ-Σe,ττ[t].

Theorem F.1. Recall that qt:=infτt,a{0,1}πτax~τ,τ-1. Then under Assumptions 2.1 and 2.2, there exist constants C and C1 such that as long as qtC1maxd(d+logt)λ0t,ξ(d+logt)λ02t and maxa{0,1}Δt(a)2λ04d, with probability at least 1-8t2,

θ~a(t)-θa*2CR+Rθdλ0maxd+logtqtt,ξ+νdd+logtqtt+2Rθdλ0Δt(a)2,a{0,1}. (F.1)

The proof of Theorem F. 1 is in Appendix G.6.

By combining Theorem F. 1 and 2.2 (with (2.7) instead of (2.5)), we obtain the following regret bounds for Algorithm 1 with (2.7) as the plug-in estimator.

Corollary F.1. Suppose Assumption 2.1 to 2.3 hold. Then there exist universal constants C,C such that:

  1. In the standard setting, if maxtdT2/3,Tmaxa{0,1}Δt(a)2λ04d,TCmax1+1/λ094(d+logT)3,ξ/λ094(d+logT)43, by choosing T0=2dT23,p0(t)=min12,t-13, and (2.7) instead of (2.5) in Algorithm 1, with probability at least 1-16T,
    RegretT;π*CdT23Rθ1-ρ+(ν+ξ+1)R+Rθ(1-ρ)λ01+logTd+4Rθd(1-ρ)λ0t=T0-1Tmaxa{0,1}Δt(a)2.
  2. In the clipped policy setting, as long as maxt[dT,T]maxa{0,1}Δt(a)2λ04d,TCmax{(d+logT)2/λ0p02,ξ2/λ04(1+logT/d)2, by choosing T0=2dT,p0(t)p0, and (2.7) instead of (2.5) in Algorithm 1, with probability at least 1-16T,
    RegretT;π*CdT12Rθ+(ν+ξ+1)1-2p0R+Rθp0(1-ρ)λ01+logTd+41-2p0Rθd(1-ρ)λ0t=T0-1Tmaxa{0,1}Δt(a)2.

The proof is in Appendix G.7.

G Additional proofs

G. 1. Proof of Theorem C. 2

The proof of Theorem C. 2 is very similar to Theorem 2.2, and we only need to note the difference in Lemma E. 2 and E. 3 (for the standard setting and the clipped policy setting respectively), as stated below. Recall that

Regret^tπ,π*:=Ea~πt*θa*,x~t-Ea~πtθa*,x~t,Regret^tπ,π*:=Ea~πt*θa*,x~t-Ea~πtθa*,x~t.

Standard setting. At any time t>T0, we have

Regret^tπ,π* =Ea~πt*θa*,x~t-Ea~πtθa*,x~t =θat**,x~t-1-(K-1)p0(t)θa~t*,x~t+aa~tp0(t)θa*,x~t =p0(t)aat*θat**-θa*,x~t+1at*a~t1-Kp0(t)θat**-θa~t*,x~t 2(K-1)p0(t)Rθ+1at*a~tθat**-θa~t*,x~t. (G.1)

Here note that Lemma E. 1 still holds under Assumption C.1, so at*=argmaxaθa*,xt=argmaxaθa*,x~t,a~t=argmaxaθ^a(t-1),x~t.

Note that at*a~t implies that

θat**,x~tθa~t*,x~tθ^at*(t-1),x~tθ^a~t(t-1),x~t

which leads to

θat**,x~t θa~t*,x~tθ^a~t(t-1),x~t-maxaθ^a(t-1)-θa*2 θ^at*(t-1),x~t-maxaθ^a(t-1)-θa*2θat**,x~t-2maxaθ^a(t-1)-θa*2,

and further implies θat**-θa~t*,x~t2maxaθ^a(t-1)-θa*2. Plugging in the above to (G.1) leads to

Regret^tπ,π*2(K-1)p0(t)Rθ+2maxaθ^a(t-1)-θa*2

The rest of the proof can be done in the same way as the proof of Theorem 2.2.

Clipped policy setting. At any time t>T0, we have

Regret^tπ,π* =Ea~πt*θa*,x~t-Ea~πtθa*,x~t 1-Kp01at*a~tθat**-θa~t*,x~t. (G.2)

Here recall that at*=argmaxaθa*,x~t,a~t:=argmaxaθ^a(t-1),x~t.

Note that at*a~t implies that

θat**,x~tθa~t*,x~tθ^at*(t-1),x~tθ^a~t(t-1),x~t

which leads to

θat**,x~t θat*,x~tθ^a~t(t-1),x~t-maxaθ^a(t-1)-θa*2 θ^at*(t-1),x~t-maxaθ^a(t-1)-θa*2θat**,x~t-2maxaθ^a(t-1)-θa*2,

and further implies θat**-θa˜t*,x~t2maxaθ^a(t-1)-θa*2.

Plugging in the above to (G.2) leads to

Regret^tπ,π*21-Kp0maxaθ^a(t-1)-θa*2

for t>T0. The rest of the proof can be done in the same way as the proof of Theorem 2.2.

G. 2. Proof of Lemma D. 1

We first analyze Σ^x~,a(t). Notice that Σ^x~,a(t)=1tτ[t]Vτ,a, where Vτ,a=πτndAτπτAτx~τ,τ-11Aτ=ax~τx~τ. For any fixed uSd-1:=uRd:u2=1,vu,τ,a:=uτVτ,a-EVτ,aτ-1uτ is a martingale difference sequence. Moreover, we can verify that vu,τ,a2qt and

Varvu,τ,aτ-1 EuVτ,au2τ-1=EπτndAτπτAτx~τ,τ-121Aτ=aux~τx~τu2τ-1 =Eϵτ,Aτ~πτnd()πτndAτπτAτx~τ,τ-11Aτ=aux~τx~τu2τ-11qtEux~τ4ξd2qt.

According to Freedman’s Inequality (Freedman, 1975), for any γ1,γ2>0,

Pτ[t]vu,τ,aγ1,τ[t]Varvu,τ,aτ-1γ2e-γ1222qtγ1+γ2.

Set γ2=ξtd2qt, and we obtain Pτ[t]vu,τ,aγ1e-d2qtγ1222d2γ1+ξt. Applying the same analysis to -vu,τ,aτ and combining the results gives Pτ[t]vu,τ,aγ12e-d2qtγ1222d2γ1+ξt.

Denote Mt=1tτ[t]Vτ,a-EVτ,aτ-1, then the above means that uSd-1,

PuMtuγ1t2e-d2qtγ1222d2γ1+ξt. (G.3)

Let 𝒩 be a 14-net of Sd-1,|𝒩|9d.uSd-1, find u𝒩 s.t. u-u214, and we have

uMtu-uMtuuMtu-u+uMtu-u12Mt2.

This implies that

Mt2=supuSd-1uMtusupu𝒩uMtu+12Mt2,

and thus supu𝒩uMtu12Mt2. Combining the above and (G.3), we obtain that for any γ1>0,

PMt22γ1tPsupu𝒩uMtuγ1t9dPuMtuγ1t=29de-d2qtγ1222d2γ1+ξt.

By choosing γ1=24maxd+logtqt,ξdt(d+logt)qt, and noticing that Mt=Σ^x~,a(t)-1tτ[t]EVτ,aτ-1, we have

PΣ^x~,a(t)-1tτ[t]EVτ,aτ-1248maxd+logtqtt,ξdd+logtqtt2t2. (G.4)

At the same time, we have

EVτ,aτ-1 =EϵτEAτ~πτx~τ,τ-1πτndAτπτAτx~τ,τ-11Aτ=ax~τx~τϵτ,τ-1τ-1 =EϵτEAτ~πτnd()1Aτ=ax~τx~τϵτ,τ-1τ-1 =Eϵτπτnd(a)x~τx~ττ-1 =πτnd(a)xτxτ+Σe,τ. (G.5)

Here we’ve used the facts that (i) πτndτ is data-independent; (ii) Eϵττ-1=0,Varϵττ-1=Σe,τ. Plug (G.5) into (G.4), and we get

PΣ^x~,a(t)-1tτ[t]πτnd(a)xτxτ+Σe,τ248maxd+logtqtt,ξdd+logtqtt2t2. (G.6)

The analysis for Σ^x~,r,a(t) is similar. Write Σ^x~,r,a(t)=1tτ[t]Zτ,a,Zτ,a:=πτndAτπτAτx~τ,τ-11Aτ=aX~τrτ. Then for any u𝒮d-1, it’s easy to verify that Zτ,a-EZτ,aτ-1u2Rqt, and VarZτ,a-EZτ,aτ-1uτ-1EZτ,au2τ-1νR2qtd. Applying Freedman’s Inequality leads to

Pztuγ1t2e-dqtγ124Rdγ1+2R2νt, (G.7)

where zt:=1tτ[t]Zτ,a-EZτ,aτ-1.

Recall that 𝒩 is a 14-net of Sd-1,|𝒩|9d.uSd-1, find uSd-1 s.t. u-u1/4, then ztu-ztu14zt2, and thus

zt2=supuSd-1ztusupu𝒩ztu+14zt2

which implies that supu𝒩ztu34zt2. Taking this and (G.7) into account, we derive that

Pzt243γ1tPsupu𝒩ztuγ1t9d2e-dqtγ124Rdγ1+2R2νt

By choosing γ1=24R maxd+logtqt,ν(d+logt)tdqt and noticing that zt=Σ^x~,r,a(t)-1tτ[t]EZτ,aτ-1, we obtain

PΣ^x~,r,a(t)-1tτ[t]EZτ,aτ-1232R maxd+logtqtt,νdd+logtqtt2t2. (G.8)

Finally, because

EZτ,aτ-1 =EϵτEAτ~πτx~τ,τ-1,ητπτndAτπτAτx~τ,τ-11Aτ=ax~τrττ-1,ϵττ-1 =EϵτEAτ~πτnd(),ητ1Aτ=ax~τrττ-1,ϵττ-1 =EϵτEAτ~πτnd(),ητ1Aτ=ax~τxτθa*+ηττ-1,ϵττ-1 =Eϵτπτnd(a)Eητx~τxτθa*+ηττ-1,ϵττ-1 =πτnd(a)Eϵτx~τxτθa*τ-1=πτnd(a)xτxτθa*,

Plug in (G.8), and we obtain

PΣ^x~,r,a(t)-1tτ[t]πτnd(a)xτxτθa*232R maxd+logtqtt,νdd+logtqtt2t2. (G.9)

Combining (G.9) and (G.6), we conclude the proof.

G. 3. Proof of Lemma E.1

We only need to prove

signθ1*-θ0*,xt=signθ1*-θ0*,x~t. (G.10)

If θ1*-θ0*,xt=0,(G.10) is a direct consequence of Assumption 2.3. If θ1*-θ0*,xt0, without loss of generality, suppose θ1*-θ0*,xt>0. Then according to Assumption 2.3,

θ1*-θ0*,x~t=θ1*-θ0*,xt+θ1*-θ0*,ϵt(1-ρ)θ1*-θ0*,xt>0.

Thus (G.10) is true.

G. 4. Proof of Lemma E.2

We have

Regret^tπ,π* =Ea~πt*θa*,x~t-Ea~πtθa*,x~t =θat**,x~t-1-p0(t)θa~t*,x~t+p0(t)θ1-a~t*,x~t =1at*=a~tp0(t)θat**-θ1-at**,x~t+1at*a~t1-p0(t)θat**-θ1-at**,x~t 2p0(t)Rθ+1at*a~tθat**-θ1-at**,x~t. (G.11)

Here recall that at*=argmaxaθa*,xt=argmaxaθa*,x~t,a~t:=argmaxa{0,1}θ^a(t-1),x~t.

Note that at*a~t implies that

θat**,x~tθ1-at**,x~tθ^at*(t-1),x~tθ^1-at*(t-1),x~t

which leads to

θat**,x~t θ1-at**,x~tθ^1-at*(t-1),x~t-maxaθ^a(t-1)-θa*2 θ^at*(t-1),x~t-maxaθ^a(t-1)-θa*2θat**,x~t-2maxaθ^a(t-1)-θa*2,

and further implies θat**-θ1-at**,x~t2maxaθ^a(t-1)-θa*2.

Plugging in the above to (G.11) leads to

Regret^tπ,π*2p0(t)Rθ+2maxaθ^a(t-1)-θa*2

G. 5. Proof of Lemma E.3

At any time t>T0, we have

Regret^tπ,π* =Ea~πt*θa*,x~t-Ea~πtθa*,x~t =1-p0θat**,x~t+p0θ1-at**,x~t-1-p0θa~t*,x~t+p0θ1-a~t*,x~t 1-2p01at*a~tθat**-θ1-at**,x~t. (G.12)

Here recall that at*=argmaxaθa*,xt=argmaxaθa*,x~t,a~t:=argmaxa{0,1}θ^a(t-1),x~t.

Note that at*a~t implies that

θat**,x~tθ1-at**,x~tθ^at*(t-1),x~tθ^1-at*(t-1),x~t

which leads to

θat**,x~t θ1-at**,x~tθ^1-at*(t-1),x~t-maxaθ^a(t-1)-θa*2 θ^at*(t-1),x~t-maxaθ^a(t-1)-θa*2θat**,x~t-2maxaθ^a(t-1)-θa*2,

and further implies θat**-θ1-at**,x~t2maxaθ^a(t-1)-θa*2.

Plugging in the above to (G.12) leads to

Regret^tπ,π*21-2p0maxaθ^a(t-1)-θa*2

G. 6. Proof of Theorem F.1

Fix t[T] such that the conditions of Theorem 2.3 hold. Fix a{0,1}. As in Appendix D, define Δ1:=Σ^x~,a(t)-1tτ[t]πτnd(a)xτxτ+Σe,τ,Δ2:=Σ^x~,r,a(t)-1tτ[t]πτnd(a)xτxτθa*. We also let Δ3:=-Δt(a)=-1tτ[t]πτnd(a)Σ^e,τ-Σe,τ. Recall Lemma D.1: with probability at least 1-4/t2,

Δ12C maxd+logtqtt,ξdd+logtqtt,Δ22CR maxd+logtqtt,νdd+logtqtt.

Meanwhile,

θ˜a(t)=(Σ^x˜,a(t)1tτ[t]πτnd(a)Σ^e,τ)1Σ^x˜,r,a(t)=(1tτ[t]πτnd(a)xτxτ+Δ1+Δ3)1[(1tτ[t]πτnd(a)xτxτ)θa*+Δ2]=θa*J1+J2,

where

J1:=1tτ[t]πτnd(a)xτxτ+Δ1+Δ3-1Δ1+Δ3θa*,J2:=1tτ[t]πτnd(a)xτxτ+Δ1+Δ3-1Δ2.

Under the events where both (D.1) and (D.2) hold, whenever

C maxd+logtqtt,ξdd+logtqttλ04d (G.13)

and Δ32λ04d, we have

J122dRθλ0C maxd+logtqtt,ξdd+logtqtt+Δ32

and

J222CdRλ0maxd+logtqtt,νdd+logtqtt.

(G.13) can be ensured by tC1maxd(d+logt)λ0qt,ξ(d+logt)λ02qt, where C1=max4C,16C2. Given these guarantees, we have with probability at least 1-4/t2,

θ~a(t)-θa*J12+J222CR+Rθdλ0maxd+logtqtt,ξ+νdd+logtqtt+2Rθdλ0Δt(a)2.

Thus we conclude the proof.

G.7. Proof of Corollary F.1

Standard setting. Notice that p0(t) is monotonically decreasing in t. Theorem F.1 indicates that, as long as t>T0,

p0(t)Cmaxd(d+logt)λ0t,ξ(d+logt)λ02t, (G.14)

then with probability at least 1-8t2,a{0,1},

θ^a(t)-θa*2CR+Rθdλ0maxd+logtt23,ν+ξdd+logtt23+2Rθdλ0Δt(a)2 (G.15)

Plug it into Theorem 2.2, we have that with high probability,

RegretT;π*2Rθ2dT2/3+21-ρI1,

where

I1 =t=T0+1Tp0(t)Rθ+maxaθ~a(t-1)-θa*2 t=T0+1Tt-13Rθ+t=T0T-1CR+Rθdλ0maxd+logtt23,ν+ξdd+logtt23+maxa2Rθdλ0Δt(a)2 2RθT23+Cλ0(ν+ξ+1)R+Rθd(d+logT)T23+2Rθdλ0tT0Δt(a)2,

for a universal constant C, where the last inequality holds if in addition, T[d(d+logT)]32.

The proof is concluded by combining the above requirement for T as well as (G.14).

Clipped policy setting.

Similar to the standard setting, according to Theorem F.1, as long as t>T0,

p0C1maxdd+logtλ0t,ξd+logtλ02t, (G.16)

then with probability at least 1-8t2,a{0,1},

θ~a(t)-θa*2CR+Rθdλ0maxd+logtp0t,ν+ξdd+logtp0t+2Rθdλ0Δt(a)2 (G.17)

Plug it into Theorem 2.2, we have that with high probability,

RegretT;π*2Rθ2dT1/2+21-2p01-ρI2,

where

I2=t=T0+1Tmaxaθ~a(t-1)-θa*2 t=T0T-1CR+Rθdλ0maxd+logtp0t,ν+ξdd+logtp0t+maxa2Rθdλ0Δt(a)2 2CR+Rθν+ξ+1λ0p0dd+logTT+2Rθdλ0t=T0T-1maxaΔta2,

where the last inequality holds if in addition, Td(d+logT)log2T.

The proof is concluded by plugging the above into the regret upper bound formula and combining the requirements for T.

Footnotes

1

For simplicity, we state our results under the binary-action setting, which is common in healthcare (Trella et al., 2022), economics (Athey et al., 2017; Kitagawa and Tetenov, 2018) and other applications. However, all the results presented in this paper can be extended to the setting with multiple actions. See Appendix C.

2

Note this violates contextual bandits assumption and leads to an MDP. We believe this is a good setup to test the robustness of our proposed approach.

References

  1. Abbasi-Yadkori Y, Pál D and Szepesvári C (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24. [Google Scholar]
  2. Agrawal S and Goyal N (2013). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning. PMLR. [Google Scholar]
  3. Athey S, Wager S et al. (2017). Efficient policy learning. Tech. rep [Google Scholar]
  4. Auer P (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 397–422. [Google Scholar]
  5. Banna M, Merlevède F and Youssef P (2016). Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications 5 1650006. [Google Scholar]
  6. Battalio SL, Conroy DE, Dempsey W, Liao P, Menictas M, Murphy S, Nahum-Shani I, Qian T, Kumar S and Spring B (2021). Sense2stop: a micro-randomized trial using wearable sensors to optimize a just-in-time-adaptive stress management intervention for smoking relapse prevention. Contemporary Clinical Trials 109 106534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bibaut A, Dimakopoulou M, Kallus N, Chambaz A and van Der laan M (2021). Post-contextual-bandit inference. Advances in neural information processing systems 34 28548–28559. [PMC free article] [PubMed] [Google Scholar]
  8. Bouneffouf D (2020). Online learning with corrupted context: Corrupted contextual bandits. arXiv preprint arXiv:2006.15194. [Google Scholar]
  9. Bouneffouf D, Bouzeghoub A and Gançarski AL (2012). A contextual-bandit algorithm for mobile context-aware recommender system. In Neural Information Processing: 19th International Conference, ICONIP 2012, Doha, Qatar, November 12–15, 2012, Proceedings, Part III 19. Springer. [Google Scholar]
  10. Carroll RJ, Ruppert D and Stefanski LA (1995). Measurement error in nonlinear models, vol. 105. CRC press. [Google Scholar]
  11. Chu W, Li L, Reyzin L and Schapire R (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. [Google Scholar]
  12. Cohen S, Mermelstein R, Kamarck T and Hoberman HM (1985). Measuring the functional components of social support. Social support: Theory, research and applications 73–94. [Google Scholar]
  13. Dempsey W, Liao P, Klasnja P, Nahum-Shani I and Murphy SA (2015). Randomised trials for the fitbit generation. Significance 12 20–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ding Q, Hsieh C-J and Sharpnack J (2022). Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics. PMLR. [Google Scholar]
  15. Fan J and Yao Q (2017). The elements of financial econometrics. Cambridge University Press. [Google Scholar]
  16. Freedman DA (1975). On tail probabilities for martingales. the Annals of Probability 100–118. [Google Scholar]
  17. FulleR WA (2009). Measurement error models. John Wiley & Sons. [Google Scholar]
  18. Galozy A and Nowaczyk S (2023). Information-gathering in latent bandits. Knowledge-Based Systems 260 110099. [Google Scholar]
  19. Galozy A, Nowaczyk S and Ohlsson M (2020). A new bandit setting balancing information from state evolution and corrupted context. arXiv preprint arXiv:2011.07989. [Google Scholar]
  20. Hong J, Kveton B, Zaheer M, Chow Y, Ahmed A and Boutilier C (2020a). Latent bandits revisited. Advances in Neural Information Processing Systems 33 13423–13433. [Google Scholar]
  21. Hong J, Kveton B, Zaheer M, Chow Y, Ahmed A, Ghavamzadeh M and Boutilier C (2020b). Non-stationary latent bandits. arXiv preprint arXiv:2012.00386. [Google Scholar]
  22. Jose ST and Moothedath S (2024). Thompson sampling for stochastic bandits with noisy contexts: An information-theoretic regret analysis. arXiv preprint arXiv:2401.11565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kirschner J and Krause A (2019). Stochastic bandits with context distributions. Advances in Neural Information Processing Systems 32. [Google Scholar]
  24. Kitagawa T and Tetenov A (2018). Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86 591–616. [Google Scholar]
  25. Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A and MurPhy SA (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34 1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Klasnja P, Smith S, Seewald NJ, Lee A, Hall K, Luers B, Hekler EB and Murphy SA (2018). Efficacy of Contextually Tailored Suggestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps. Annals of Behavioral Medicine 53 573–582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Langford J and Zhang T (2007). The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems 20 96–1. [Google Scholar]
  28. Li L, Chu W, Langford J and Schapire RE (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. [Google Scholar]
  29. Liao P, Greenewald K, Klasnja P and Murphy S (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Liao P, Klasnja P, Tewari A and Murphy SA (2016). Sample size calculations for micro-randomized trials in mhealth. Statistics in medicine 35 1944–1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lin J, Lee XY, Jubery T, Moothedath S, Sarkar S and Ganapathysubramanian B (2022). Stochastic conservative contextual linear bandits. arXiv preprint arXiv:2203.15629. [Google Scholar]
  32. Lin J and Moothedath S (2022). Distributed stochastic bandit learning with context distributions. arXiv preprint arXiv:2207.14391. [Google Scholar]
  33. Liu Y-E, Mandel T, Brunskill E and Popovic Z (2014). Trading off scientific knowledge and user learning with multi-armed bandits. In EDM. [Google Scholar]
  34. Nelson E, Bhattacharjya D, Gao T, Liu M, Bouneffouf D and Poupart P (2022). Linearizing contextual bandits with latent state dynamics. In The 38th Conference on Uncertainty in Artificial Intelligence. [Google Scholar]
  35. Park H and Faradonbeh MKS (2022). Worst-case performance of greedy policies in bandits with imperfect context observations. In 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE. [Google Scholar]
  36. Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z et al. (2018). A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 1–96. [Google Scholar]
  37. Sarker H, Hovsepian K, Chatterjee S, Nahum-Shani I, Murphy SA, Spring B, Ertin E, Al’Absi M, Nakajima M and Kumar S (2017). From markers to interventions: The case of just-in-time stress intervention. Springer. [Google Scholar]
  38. Sarker H, Tyburski M, Rahman MM, Hovsepian K, Sharmin M, Epstein DH, Preston KL, Furr-Holden CD, Milam A, Nahum-Shani I et al. (2016). Finding significant stress episodes in a discontinuous time series of rapidly varying mobile sensor data. In Proceedings of the 2016 CHI conference on human factors in computing systems. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Sen R, Shanmugam K, Kocaoglu M, Dimakis A and Shakkottai S (2017). Contextual bandits with latent confounders: An nmf approach. In Artificial Intelligence and Statistics. PMLR. [Google Scholar]
  40. Shaikh H, Modiri A, Williams JJ and Rafferty AN (2019). Balancing student success and inferring personalized effects in dynamic experiments. In EDM. [Google Scholar]
  41. Trella AL, Zhang KW, Nahum-Shani I, Shetty V, Doshi-Velez F and Murphy SA (2022). Reward design for an online reinforcement learning algorithm supporting oral self-care. arXiv preprint arXiv:2208.07406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Wang Y-X, Agarwal A and Dudik M (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning. PMLR. [Google Scholar]
  43. Xu J and Wang Y-X (2022). Towards agnostic feature-based dynamic pricing: Linear policies vs linear valuation with unknown noise. In International Conference on Artificial Intelligence and Statistics. PMLR. [Google Scholar]
  44. Xu X, Xie H and Lui JC (2021). Generalized contextual bandits with latent features: Algorithms and applications. IEEE Transactions on Neural Networks and Learning Systems. [DOI] [PubMed] [Google Scholar]
  45. Yang J, Eckles D, Dhillon P and Aral S (2020a). Targeting for long-term outcomes. arXiv preprint arXiv:2010.15835. [Google Scholar]
  46. Yang J and Ren S (2021a). Bandit learning with predicted context: Regret analysis and selective context query. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE. [Google Scholar]
  47. Yang J and Ren S (2021b). Robust bandit learning with imperfect context. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35. [Google Scholar]
  48. Yang L, Yang J and Ren S (2020b). Multi-feedback bandit learning with probabilistic contexts. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence Main track. [Google Scholar]
  49. Yao J, Brunskill E, Pan W, Murphy S and Doshi-Velez F (2021). Power constrained bandits. In Machine Learning for Healthcare Conference. PMLR. [PMC free article] [PubMed] [Google Scholar]
  50. Yom-Tov E, Feraru G, Kozdoba M, Mannor S, Tennenholtz M and Hochberg I (2017). Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research 19 e338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Yun S-Y, Nam JH, Mo S and Shin J (2017). Contextual multi-armed bandits under feature uncertainty. arXiv preprint arXiv:1703.01347. [Google Scholar]
  52. Zhan R, Hadad V, Hirshberg DA and Athey S (2021). Off-policy evaluation via adaptive weighting with data from contextual bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. [Google Scholar]
  53. Zhang K, Janson L and Murphy S (2021). Statistical inference with m-estimators on adaptively collected data. Advances in neural information processing systems 34 7460–7471. [PMC free article] [PubMed] [Google Scholar]
  54. Zhou L and Brunskill E (2016). Latent contextual bandits and their application to personalized recommendations for new users. arXiv preprint arXiv:1604.06743. [Google Scholar]

RESOURCES