Abstract
We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation environments based on synthetic and real digital intervention datasets.
1. Introduction
Contextual bandits (Auer, 2002; Langford and Zhang, 2007) represent a classical sequential decision-making problem where an agent aims to maximize cumulative reward based on context information. At each round , the agent observes a context and must choose one of available actions based on both the current context and previous observations. Once the agent selects an action, she observes the associated reward, which is then used to refine future decision-making. Contextual bandits are typical examples of reinforcement learning problems where a balance between exploring new actions and exploiting previously acquired information is necessary to achieve optimal long-term rewards. It has numerous real-world applications including personalized recommendation systems (Li et al., 2010; Bouneffouf et al., 2012), healthcare (Yom-Tov et al., 2017; Liao et al., 2020), and online education (Liu et al., 2014; Shaikh et al., 2019).
Despite the extensive existing literature on contextual bandits, in many real-world applications, the agent never observes the context exactly. One common reason is that the true context for decision-making can only be detected or learned approximately from observable auxiliary data. For instance, consider the Sense2Stop mobile health study, in which the context is whether the individual is currently physiologically stressed (Battalio et al., 2021). A complex predictor of current stress was constructed/validated based on multiple prior studies (Cohen et al., 1985; Sarker et al., 2016, 2017). This predictor was then tuned to each user in Sense2Stop prior to their quit-smoke attempt and then following the user’s attempt to quit smoking, at each minute, the predictor inputs high-dimensional sensor data on the user and outputs a continuous likelihood of stress for use by the decision-making algorithm. In many such applications in health interventions, models using validated predictions as contexts are preferred to raw sensor data because of the high noise in these settings, and that the decision rules are interpretable so they can be critiqued by domain experts. The second reason why the context is not observed exactly is because of measurement error. Contextual variables, such as user preferences in online advertising, content attributes in recommendation systems, and patient conditions in clinical trials, are prone to noisy measurement. This introduces an additional level of uncertainty that must be accounted for when making decisions.
Motivated by the above, we consider the linear contextual bandit problem where at each round, the agent only has access to a noisy observation of the true context. Moreover, the agent has limited knowledge about the underlying distribution of this noisy observation, as in many practical applications (e.g. the above-mentioned ones). This is especially the case when this ‘observation’ is the output of a complex machine learning algorithm. We only assume that the noisy observation is unbiased, its variance is known or can be estimated, and we put no other essential restrictions on its distribution. In healthcare applications, the estimated error variance for context variables can often be derived from data in prior studies (e.g. pilot studies). This setting is intrinsically difficult for two main reasons: First, when estimating the reward model, the agent needs to take into account the misalignment between the noisy context observation and the reward which depends on the true context. Second, even if the reward model is known, the agent may suffer from making bad decisions at each round because of the inaccurate context.
Our contributions.
We present the first online algorithm MEB (Measurement Error Bandit) with sublinear regret in this setting under mild conditions. MEB achieves regret compare to a standard benchmark and regret compare to a clipped benchmark with minimum exploration probability which is common in many applications (Yang et al., 2020a; Yao et al., 2021). MEB is based on a novel approach to model estimation which removes the systematic bias caused by the noisy context observation. The estimator is inspired by the measurement error literature in statistics (Carroll et al., 1995; Fuller, 2009): We extend this classical method with additional tools in the online decision-making setting due to the policy being dependent on the measurement error.
1.1. Related work
Our work complements several lines of literature in contextual bandits, as listed below.
Latent contextual bandit.
In the latent contextual bandit literature (Zhou and Brunskill, 2016; Sen et al., 2017; Hong et al., 2020a,b; Xu et al., 2021; Nelson et al., 2022; Galozy and Nowaczyk, 2023), the reward is typically modeled as jointly depending on the latent state, the context, and the action. Several works (Zhou and Brunskill, 2016; Hong et al., 2020a,b; Galozy and Nowaczyk, 2023) assume no direct relation between the latent state and the context while setting a parametric reward model. For example, Hong et al. (2020a) assumes the latent state is unknown but constant over time. Hong et al. (2020b) assume that the latent state evolves through a Markov chain. Xu et al. (2021) sets a specific context as well as the latent feature for each action, and models the reward depending on them through a generalized linear model. Different from the aforementioned studies, we specify that the observed context is a noisy version of the latent context (which aligns with the applications we are addressing), and then leverage this structure to design the online algorithm.
In another line of work, Sen et al. (2017); Nelson et al. (2022) consider contextual bandit with latent confounder, where the observed context influences the reward through a latent confounder variable, which ranges within a small discrete set. Our setting is distinct from these works in that the latent context can span an infinite (or even continuous) space.
Bandit with noisy context.
In bandit with noisy context (Yun et al., 2017; Kirschner and Krause, 2019; Yang et al., 2020b; Lin and Moothedath, 2022; Lin et al., 2022), the agent has access to a noisy version of the true context and/or some knowledge of the distribution of the noisy observation. Yun et al. (2017); Park and Faradonbeh (2022); Jose and Moothedath (2024) consider settings where the joint distribution of the true context and the noisy observed context is known (up to a parameter). Other works, such as Kirschner and Krause (2019); Yang et al. (2020b); Lin et al. (2022), assume that the agent knows the exact distribution of the context each time, and observes no context. By assuming a linear reward model, Kirschner and Krause (2019) transforms the problem into a linear contextual bandit, and obtains regret compared to the policy maximizing the expected reward over the context distribution. Yang et al. (2020b); Lin et al. (2022); Lin and Moothedath (2022) consider variants of the problem such as multiple linear feedbacks, multiple agents, and delayed observation of the exact context. Compared to these works, we consider a practical but more challenging setting where besides an unbiased noisy observation(prediction) for each context, the agent only knows the second-moment information about the distribution. This does not transform into a standard linear contextual bandit as in Kirschner and Krause (2019).
Bandit with inaccurate/corrupted context.
These works consider the setting where the context is simply inaccurate (without randomness), or is corrupted and cannot be recovered. In Yang and Ren (2021a,b), at each round, only an inaccurate context is available to the decision-maker, and the exact context is revealed after the action is taken. In Bouneffouf (2020); Galozy et al. (2020), each context is completely corrupted with some probability and the corruption cannot be recovered. In Ding et al. (2022), the context is attacked by an adversarial agent. Because these works focus more on adversarial settings for the context observations, the application of their regret bounds to our setting generally results in a linear regret. For example, in Yang and Ren (2021a), the regret of Thompson sampling is , where is the error of the inaccurate context. As is typical in our setting, will be non-vanishing through time, so the second term is linear in . Given the applications we consider, we can exploit the stochastic nature of the noisy context observations in our algorithm to achieve improved performance.
1.2. Notations
Throughout this paper, we use to represent the set for . For , let denote the minimum of and . Given denotes the -by- identity matrix, and denotes the -dimensional vector with 1 in each entry. For a vector , denote as its norm. For a matrix , denote as its operator norm. The notation refers to a quantity that is upper bounded by up to constant multiplicative factors, while refers to a quantity that is upper bounded by up to poly-log factors.
2. Measurement error adjustment to bandit with noisy context
2.1. Problem setting
We consider a linear contextual bandit with context space and binary action space 1. Let be the time horizon. As discussed above, we consider the setting where at each time , the agent only observes a noisy version of the context instead of the true underlying context . Thus, at time , the observation only contains , where is the action and is the corresponding reward. We further assume that , where the error is independent of the history . Here, ‘e’ in the subscript means ‘error’. Initially we assume that is known. In Section 2.4, we consider the setting where only estimators of are available. There is no restriction that the distribution of belongs to any known (parametric) family. The reward , where and are the unknown parameters. It’s worth noting that besides the policy, all the randomness here comes from the reward noise and the context error . We treat as fixed throughout but unknown to the algorithm (Unlike Yun et al. (2017), we don’t assume are i.i.d.). Our goal at each time is to design policy given past history and current observed noisy context , so that the agent can maximize the reward by taking action .
If is non-vanishing, standard contextual bandit algorithms are generally sub-optimal. To see this, notice that . This means the error in the reward after observing the noisy context is , where and are dependent. Thus, . This is in contrast to the standard linear bandit setting, where given the true context , which ensures the sublinear regret of classical bandit algorithms such as UCB and Thompson sampling. Therefore, it is necessary to design an online algorithm that adjusts for the errors . We assume that the context, parameters and the reward are bounded, as below.
Assumption 2.1 (Boundedness). ; There exists a positive constant such that ; There exists a positive constant such that .
For any policy , we define the (standard) cumulative regret as
(2.1) |
where
(2.2) |
We denote the standard benchmark policy . This is summarized in the setting below.
Setting 1. (Standard setting) We aim to minimize Regret among the class of all policies.
In many applications including clinical trials, it’s desirable to design the policy under the constraint that each action is sampled with a minimum probability . One reason for maintaining exploration is that we can update and re-optimize the policy for future users to allow for potential non-stationarity (Yang et al., 2020a). Additionally, keeping the exploration is also important for after-study analysis (Yao et al., 2021), especially when the goal of the analysis is not pre-specified prior to collecting data with the online algorithm. In these situations, it is desirable to consider only the policies that always maintain an exploration probability of for each arm, and compare the performance to the clipped benchmark policy :
(2.3) |
This is summarized in the setting below.
Setting 2. (Clipped policy setting) We minimize
among the class of policies that explore any action with probability at least .
In this work, we will provide policies with sublinear regret guarantees in both settings.
2.2. Estimation using weighted measurement error adjustment
In this section, we focus on learning the reward model parameters with data after a policy has been executed up to time . Learning a consistent model is important in many bandit algorithms for achieving low regret (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013). As we shall see in Section 2.3, consistent estimation of plays an essential role in controlling the regret of our proposed algorithm.
Inconsistency of the regularized least-squares (RLS) estimator.
UCB and Thompson sampling, the two classical bandit algorithms, both achieve sublinear regret based on the consistency of the estimator under certain norms. When is observed instead of , the RLS estimator becomes . Here ‘RLSCE’ means the RLS estimator with contextual error. However, when is non-vanishing, is generally no longer consistent, which may lead to bad decision-making (see Appendix B for details). In the simple case where are i.i.d. and there is no action (i.e. set ), the inconsistency of is studied in the measurement error model literature in statistics (Fuller, 2009; Carroll et al., 1995), and is known as attenuation.
A naive measurement error adjustment.
A measurement error model is a type of regression model designed to accommodate inaccuracies in the measurement of regressors (i.e., instead of observing , we observe where is a noise term with zero mean). As conventional regression techniques yield inconsistent estimators, measurement error models rectify this issue with adjustments to the estimator that consider these errors. In the current context when we want to estimate from history can be viewed as regressors ‘measured with error’, while are dependent variables. If are i.i.d., , and there is no action (i.e. set is a consistent estimator for . When multiple actions are present, a naive generalization of the above estimator, , is
(2.4) |
Unfortunately, is inconsistent in the multiple-action setting, even if the policy is stationary and not adaptive to history. This difference is essentially due to the interaction between the policy and the measurement error: In the classical measurement error (no action) setting, , and concentrates around its expectation . Likewise, concentrates around . Combining the above parts thus yields a consistent estimator of . In our setting with multiple actions, however, the agent picks the action based on , so only certain values of lead to . Therefore, for those when we pick , it’s more likely that falls in certain regions depending on the policy, and we shouldn’t expect anymore. In other words, the policy creates a complicated dependence between and for each , which changes the limit of (and similarly . This leads to the inconsistency of the naive estimator (See Appendix B for a concrete example). In Section 3, we provide examples to show that (2.4) not only deviates from the true parameters, but also leads to suboptimal decision-making.
Our proposed estimator.
Inspired by the above observations, for positive, we construct the following estimator for given , which corrects (2.4) using importance weights:
(2.5) |
where
Here, is a pre-specified policy (doesn’t depend on or ) that can be chosen by the algorithm. We only require the following:
Assumption 2.2. Let be scaled vectors of and . Then there exist positive constants s.t. (i) , (ii) , and (iii) .
Remark 2.1. In Assumption 2.2, (i) and (ii) are standard moment assumptions. (iii) is mild. Even restricted to the choice of , under mild conditions, the assumption can be satisfied with deterministic or stochastic such as an i.i.d. sequence, a weakly dependent stationary time series (e.g. multivariate ARMA process (Fan and Yao, 2017)), or a sequence with periodicity/seasonality with high probability (See Appendix D for details).
The theorem below gives a high-probability upper bound on (proof in Appendix D).
Theorem 2.1. For any , denote . Then under Assumptions 2.1 and 2.2, there exist absolute constants , such that as long as , with probability at least is upper bounded by
(2.6) |
Unlike the existing literature on off-policy learning in contextual bandits (e.g. Wang et al. (2017); Zhan et al. (2021); Zhang et al. (2021); Bibaut et al. (2021)), the role of the importance weights here is to correct the dependence of a policy on the observed noisy context with error. The proof idea can be generalized to a large class of off-policy method-of-moment estimators, which might be of independent interest (see Appendix D).
2.3. MEB: Online bandit algorithm with measurement error adjustment
We propose MEB (Measurement Error Bandit), an online bandit algorithm with measurement error adjustment based on the estimator (2.5). The algorithm is presented in Algorithm 1 and is designed for the binary-action setting, although it can be generalized to the case with multiple actions (see Appendix C). For , the algorithm is in a warm-up stage and can pick any policy such that there is a minimum sampling probability for each action (Here . For instance, the algorithm can do pure exploration with . For , given the noisy context , the algorithm computes the best action according to calculated from (2.5). Then, it samples with probability and keeps an exploration probability of to sample the other action. In practice, we can often set to be monotonically decreasing in , in which case for all .
Before presenting the regret analysis, we should first note that our problem is harder than a standard contextual bandit: is unknown, and only is observed. Thus, even if is known, we may still perform suboptimally if is too far from so that it leads to a different optimal action. Example 2.1 below shows that in general, we cannot avoid a linear regret.
Example 2.1. Let are drawn i.i.d. from with equal probability. . Intuitively, even if we know , there is still a constant probability at each time that we cannot make the right choice due to and having different signs, and is never known (details in Appendix E). This results in a regret.
Fortunately, in practice, we expect that the errors are relatively ‘small’ in the sense that the optimal action (given is not affected. Specifically, we assume the following:
Assumption 2.3. There exist a constant such that almost surely. Here .
Assumption 2.3 ensures that the perturbation to the suboptimality gap between the two arms caused by is controlled by the true suboptimality gap. In this way, given , the optimal action based on will not deviate too much from that based on . As a special case, this assumption is satisfied with if . Here is an upper bound of . Assumption 2.3 can be further weakened to the inequalities holding with high probability (see Appendix E). Note that Assumption 2.3 only guarantees the optimal action is not affected by given . To achieve sublinear regret, still needs to be well-estimated. Thus, even with Assumption 2.3, classical bandit algorithms such as UCB may still suffer from linear regret because of the inconsistent estimator (see Appendix B for a concrete example).
We first prove the following theorem, which states that the regret of MEB can be directly controlled by the estimation error. In fact, this theorem holds regardless of the form or quality of the estimation procedure (i.e. in line 7 of Algorithm 1). The proof is in Appendix E.
Theorem 2.2. Let Assumption 2.1 and 2.3 hold.
- For the standard setting, Algorithm 1 outputs a policy with no more than
- For the clipped policy setting, Algorithm 1 with the choice of outputs a policy with no more than
The following corollary provides regret guarantees of MEB by combining Theorem 2.1 and 2.2 (proof in Appendix E).
Corollary 2.1. Let Assumption 2.1 to 2.3 hold. There exist universal constants such that:
- For the standard setting, , with probability at least , Algorithm 1 with the choice of outputs a policy with no more than
- For the clipped policy setting, , with probability at least , Algorithm 1 with the choice of and outputs a policy with no more than
Ignoring other factors, the regret upper bound is of order for the standard setting, and for the clipped policy setting, depending on the horizon and dimension .
In certain scenarios (e.g. when is large), it is desirable to save computational resources by updating the estimates of less frequently in Algorithm 1. Fortunately, low regret guarantees can still be achieved: Suppose at each time , the agent only updates the estimators according to (2.5) at selected time points (in line 7); Otherwise, the agent simply makes decisions based on the most recently updated estimators. In Appendix E, we show that time points to perform the updates can be very infrequent, such as , while still achieving the same rate of regret upper bound as in Corollary 2.1.
2.4. MEB given estimated error variance
In practice, the agent might not have perfect knowledge about , the variance of the error . In this section, we discuss the situation where at each time , the agent does not know , and only has a (potentially adaptive) estimator for . This estimator may be derived from auxiliary data or outside knowledge. In this case, in Algorithm 1, we need to replace the estimator (2.5) with the following estimator for decision-making (i.e. in line 7 of Algorithm 1):
(2.7) |
In Appendix F, we show that with this modification, the additional regret of Algorithm 1 is controlled by
up to a constant depending on the assumptions. Here, for each is the weighted average of the estimation errors . In practice, it is reasonable to assume that is small so as not to significantly affect the overall regret: For example, suppose the agent gathers more auxiliary data over time so that , then the additional regret term will be up to a constant depending on the assumptions.
3. Simulation results
In this section, we complement our theoretical analyses with simulation results on a synthetic environment with artificial noise and reward models as well as a simulation environment based on the real dataset, HeartStep V1 (Klasnja et al., 2018).
Compared algorithms.
In both simulation environments, we compare the following algorithms: Thompson sampling (TS) with normal priors (Russo et al., 2018), Linear Upper Confidence Bound (UCB) approach (Chu et al., 2011), MEB (Algorithm 1), and MEB-naive (MEB plugged in with the naive measurement error estimator (2.4) instead of (2.5)). See Appendix A for a detailed description of the algorithms.
3.1. Synthetic environment
We first test our algorithms on a synthetic environment. We consider a contextual bandit environment with . In the reward model, we set , and drawn i.i.d. from . Let be independently sampled from , where . We further set and consider independent with Normal distributionwith covariance . We independently generate bandit data for times, and compare among the candidate algorithms in terms of estimation quality and cumulative regret with a moderate exploration probability .
3.2. HeartStep V1 simulation environment
We also construct a simulation environment with HeartSteps dataset. HeartSteps is a physical activity mobile health application, whose primary goal is to help the user prevent negative health outcomes and adopt and maintain healthy behaviors, for example, higher physical activity level. HeartSteps V1 is a 42-day mobile health trial (Dempsey et al., 2015; Klasnja et al., 2015; Liao et al., 2016), where participants are provided a Fitbit tracker and a mobile phone application. One of the intervention components is a contextually-tailored physical activity suggestion that may be delivered at any of the five user-specified times during each day. The delivery times are roughly separated by 2.5 hours.
Construction of the simulated environment.
We follow the simulation setups in Liao et al. (2020). The true context at the time is denoted by with three main components . Here, is an indicator variable of whether an intervention is feasible (e.g. is 0 when the participant is driving a car, a situation where the suggestion should not be sent). contains some features at time is the true treatment burden, which is a function of the participant’s treatment history2. Specifically, . We assume that and are sampled i.i.d with the empirical distribution from the Heartstep V1 dataset, and is given by the aforementioned transition model.
The reward model is , where is the full context, is a subset of that is considered to have an impact on the treatment effects, and . Here is the Gaussian noise on the reward observation, whose variance is chosen to be 0.1, 1.0, and 5.0 respectively (Liao et al., 2016). For a detailed list of variables in the context, see Table 2 in Appendix A.
The true parameters is estimated from GEE (Generalized Estimating Equations) with rewards being the log-transformed step count collected 30 minutes after the decision time.
In light of the measurement error setting in this paper, we consider an observation noise on for the following reasons: 1) The burden can be understood as a prediction of the burden level of the participant, which is particularly crucial in mobile health studies; 2) Other variables are normally believed to have low or no observation noise. Thus, we assume that the agent only observes , where and is drawn i.i.d. from normal distribution with mean zero and variance .
3.3. Results
Table 1 (a) and (b) shows the average regret (cumulative regret divided by ) in both the synthetic environment and the real-data environment based on HeartStep V1. We use the same set of , while different reflect the change of absolute values in coefficients in two different environments is the level of reward noise in HeartStep V1). MEB shows significantly smaller average regret compared to other baseline methods under most combinations of and . 1. In certain instances, MEB-naive exhibits performance comparable to MEB. This is attributed to its ability to reduce the variance of model estimation while incurring some bias compared to MEB, rendering it a feasible alternative in practical contexts. Notably, in two extreme scenarios, UCB surpasses both MEB and MEB-naive. This is as expected, since when contextual noise is sufficiently negligible, traditional bandit algorithms are anticipated to outperform the proposed algorithms. An estimation error plot can be found in Appendix A, which also demonstrates that MEB has a lower estimation error.
Table 1:
Average regret for both synthetic environment and real-data environment under different combinations of and . The results are averages over 100 independent runs and the standard deviations are reported in the full table in Appendix A.
(a) Average regret in the synthetic environment over 50000 steps with clipping probability p = 0.2. | |||||
---|---|---|---|---|---|
TS | UCB | MEB | MEB-naive | ||
0.01 | 0.1 | 0.047 | 0.046 | 0.027 | 0.038 |
0.1 | 0.1 | 0.047 | 0.047 | 0.026 | 0.039 |
1.0 | 0.1 | 0.048 | 0.048 | 0.027 | 0.038 |
0.01 | 1.0 | 0.757 | 0.647 | 0.198 | 0.371 |
0.1 | 1.0 | 0.769 | 0.721 | 0.205 | 0.392 |
1.0 | 1.0 | 0.714 | 0.697 | 0.218 | 0.404 |
0.01 | 2.0 | 1.492 | 1.504 | 0.358 | 0.616 |
0.1 | 2.0 | 1.195 | 1.333 | 0.368 | 0.584 |
1.0 | 2.0 | 1.299 | 1.476 | 0.416 | 0.625 |
(b) Average regret in tde real-data environment over 2500 steps witd clipping probability p = 0.2. | |||||
TS | UCB | MEB | MEB-naive | ||
0.05 | 0.1 | 0.027 | 0.027 | 0.022 | 0.024 |
0.1 | 0.1 | 0.026 | 0.024 | 0.020 | 0.020 |
5.0 | 0.1 | 1.030 | 0.743 | 0.831 | 1.173 |
0.05 | 1.0 | 0.412 | 0.408 | 0.117 | 0.112 |
0.1 | 1.0 | 0.309 | 0.316 | 0.085 | 0.087 |
5.0 | 1.0 | 1.321 | 0.918 | 1.458 | 1.322 |
0.05 | 2.0 | 0.660 | 0.634 | 0.144 | 0.148 |
0.1 | 2.0 | 0.740 | 0.704 | 0.151 | 0.155 |
5.0 | 2.0 | 1.585 | 2.415 | 1.577 | 1.436 |
4. Discussion and conclusions
We propose a new algorithm, MEB, which is the first algorithm with sublinear regret guarantees in contextual bandits with noisy context, where we have limited knowledge of the noise distribution. This setting is common in practice, especially where only predictions for unobserved context are available. MEB leverages the novel estimator (2.5), which extends the conventional measurement error adjustment techniques by considering the interplay between the policy and the measurement error.
Limitations and future directions.
Several questions remain for future investigation. First, is the optimal rate of regret compared to the standard benchmark policy (2.3), as in some other bandits with semi-parametric reward model (e.g. Xu and Wang (2022))? Providing lower bounds on the regret helps us understand the limit of improvement in the online algorithm. Second, we assume that the agent has an unbiased prediction of the true context. It is important to understand how biased predictions affect the results. Last but not least, it’s interesting to see our method can be extended to more complicated decision-making settings (e.g. Markov decision processes).
A. Additional details for simulation studies
A. 1. Compared algorithms
In both simulation environments, we compare the following algorithms: Thompson sampling (TS, see details in Algorithm 2) given normal priors (Russo et al., 2018), Linear Upper Confidence Bound (UCB, see details in Algorithm 3) approach (Chu et al., 2011), MEB (Algorithm 1), and MEB-naive (MEB plugged in with the naive measurement error estimator (2.4) instead of (2.5)). To make a fair comparison between algorithms, we use the same regularization parameter for all algorithms. The hyper-parameter is set to be for all results for TS and UCB.
We further compare with robust linear UCB (Algorithm 1 in Ding et al. (2022)) that is shown to achieve minimax rate for adversarial linear bandit.
A. 2. Additional details for HeartStep V1 study
Table 2 presents the list of variables to include in the reward model and in the feature construction for algorithms. Recall that our reward model is . All the variables are included in while only those considered to have an impact on treatment effect will be included in .
Table 2:
List of variables in HeartSteps V1 study.
Variable | Type | Treatment? |
---|---|---|
Availability (It) | Discrete | No |
Prior 30-minute step count | Continuous | No |
Yesterday’s step count | Continuous | No |
Prior 30-minute step count | Continuous | No |
Location | Discrete | Yes |
Current temperature | Continuous | No |
Step variation level | Discrete | Yes |
Burden variable (Bt) | Continuous | Yes |
A.3. Additional results on estimation error
Figure 1:
Log-scaled L2 norm of of four algorithms in the synthetic environment over 50000 steps under and .
Figure 2:
Log-scaled L2 norm of of four algorithms in the real-data environment based on HeartStep V1 over 2500 steps under and .
A. 4. Average regret with standard deviation
Table 3:
Average regret for both synthetic environment and real-data environment under different combinations of and . The numbers in parentheses are the standard deviations calculated from 100 independent runs.
(a) Average regret in the synthetic environment over 50000 steps with clipping probability p = 0.2. | ||||||
---|---|---|---|---|---|---|
TS | UCB | MEB | MEB-naive | RobustUCB | ||
0.01 | 0.1 | 0.047 (0.0015) | 0.046 (0.0015) | 0.027 (0.0011) | 0.038 (0.0013) | 0.050 (0.0051) |
0.1 | 0.1 | 0.047 (0.0015) | 0.047 (0.0015) | 0.026 (0.0011) | 0.039 (0.0013) | 0.049 (0.0048) |
1.0 | 0.1 | 0.048 (0.0015) | 0.048 (0.0015) | 0.027 (0.0011) | 0.038 (0.0013) | 0.044 (0.0047) |
0.01 | 1.0 | 0.757 (0.0164) | 0.647 (0.0145) | 0.198 (0.0079) | 0.371 (0.0107) | 0.652 (0.0050) |
0.1 | 1.0 | 0.769 (0.0160) | 0.721 (0.0156) | 0.205 (0.0080) | 0.392 (0.0110) | 0.753 (0.0056) |
1.0 | 1.0 | 0.714 (0.0155) | 0.697 (0.0150) | 0.218 (0.0083) | 0.404 (0.0112) | 0.589 (0.0047) |
0.01 | 2.0 | 1.492 (0.0281) | 1.504 (0.0283) | 0.358 (0.0129) | 0.616 (0.0169) | 1.608 (0.0102) |
0.1 | 2.0 | 1.195 (0.0244) | 1.333 (0.0260) | 0.368 (0.0131) | 0.584 (0.0164) | 1.064 (0.0079) |
1.0 | 2.0 | 1.299 (0.0257) | 1.476 (0.0277) | 0.416 (0.0139) | 0.625 (0.0170) | 1.881 (0.0114) |
(b) Average regret in tde real-data environment over 2500 steps witd clipping probability p = 0.2. | ||||||
TS | UCB | MEB | MEB-naive | RobustUCB | ||
0.05 | 0.1 | 0.027 (0.0067) | 0.027 (0.0070) | 0.022 (0.0057) | 0.024 (0.0058) | 0.025 (0.0079) |
0.1 | 0.1 | 0.026 (0.0057) | 0.024 (0.0053) | 0.020 (0.0046) | 0.020 (0.0046) | 0.028 (0.0079) |
5.0 | 0.1 | 1.030 (0.0287) | 0.743 (0.0262) | 0.831 (0.0267) | 1.173 (0.0343) | 1.400 (0.0447) |
0.05 | 1.0 | 0.412 (0.0355) | 0.408 (0.0351) | 0.117 (0.0148) | 0.112 (0.0143) | 0.226 (0.0020) |
0.1 | 1.0 | 0.309 (0.0293) | 0.316 (0.0299) | 0.085 (0.0112) | 0.087 (0.0116) | 0.206 (0.0125) |
5.0 | 1.0 | 1.321 (0.0417) | 0.918 (0.0304) | 1.458 (0.0422) | 1.322 (0.0388) | 1.065 (0.0400) |
0.05 | 2.0 | 0.660 (0.0343) | 0.634 (0.0322) | 0.144 (0.0129) | 0.148 (0.0133) | 0.304 (0.0241) |
0.1 | 2.0 | 0.740 (0.0505) | 0.704 (0.0489) | 0.151 (0.0145) | 0.155 (0.0149) | 0.432 (0.0386) |
5.0 | 2.0 | 1.585 (0.0454) | 2.415 (0.0816) | 1.577 (0.0508) | 1.436 (0.0462) | 1.345 (0.0423) |
B. Additional explanations on the regularized least-squares (RLS) estimator the naive estimator (2.4) under noisy context
B. 1. Inconsistency of the RLS estimator
Measurement error model and attenuation.
As briefly mentioned in the main text, a measurement error model is a regression model designed to accommodate inaccuracies in the measurement of regressors. Suppose that there is no action (i.e. set , and are i.i.d., then the measurement error model is a useful tool to learn from collected as follows:
(B.1) |
Here in the measurement error model’s perspective, are regressors ‘measured with error’, and are dependent variables.
Regression attenuation, a phenomenon intrinsic to measurement error models, refers to the observation that when the predictors are subject to measurement errors, the Ordinary Least Squares (OLS) estimators of regression coefficients become biased (see, for instance, Carroll et al. (1995)). Specifically, in simple linear regression, the OLS estimator for the slope tends to be biased towards zero. Intuitively, this is because the measurement errors effectively ‘dilute’ the true relationship between variables, making it appear weaker than it actually is.
Inconsistency of the RLS estimator.
Before presenting a concrete numerical example to show the inconsistency of the RLS estimator and that it leads to bad decision-making, below we first apply the theory of measurement error model to give a heuristic argument of why the RLS estimator is inconsistent even in the simplified situation where there is no action (i.e. set ), and are i.i.d..
From Section 3.3.2 in Carroll et al. (1995), given data from (B.1) with multiple covariates (in the simplified case with no action as described above), the OLS estimator for , denoted as , consistently estimates not but
(B.2) |
where . In addition, given fixed , the regularized least squares (RLS) estimator
where , and as . This means that as and converges to the same limit, which is . Thus, for fixed , as grows, converges to and does not converge to zero in general.
Finally, recall that in classical bandit algorithms such as UCB, the sublinear regret relies on the key property that with high probability, for all , where . Here for a vector and positive definite matrix . We argue that this requirement generally no longer holds in the setting with measurement error . In fact, notice that since is i.i.d. in this simplified setting, and , we expect that concentrates around . As long as , with high probability, for all for some constant . If this holds,
while the last term converges to a nonzero limit . This indicates that in general, scales with a rate of at least , and will not be uniformly bounded by for all .
An example. The following is an example where given the errors , the RLS estimator in the classical bandit algorithms inconsistently estimates the true reward model, and in addition, it leads to bad decision-making (linear regret) in the classic bandit algorithms.
Example B.1. Consider the standard setting described in Section 2.1. Let . Let sampled i.i.d. from , where . Condition on is uniformly sampled from , independent from any other variable in the history. Here denotes the first entry in . We also let be i.i.d. drawn from .
We conduct 100 independent experiments. For each experiment, we generate data according to the above, and test the performance of UCB (Algorithm 1 in Chu et al. (2011)) and Thompson sampling with normal priors (Russo et al., 2018) using noisy context instead of the true context. We choose the regularization parameter in the RLS estimator. Additionally, in UCB (Chu et al., 2011), we choose the parameter . Figure 3 summarizes the estimation error of the RLS estimator and the cumulative regret of both algorithms with respect to time , showing both the average and standard error across the random experiments. We see that the RLS estimator is unable to estimate the true reward model well. Moreover, it is clear that the regret of both UCB and Thompson sampling is linear in the time horizon. Intuitively, this is because the direction of in (B.2) is twisted compared to , which not only leads to inconsistent estimators, but also the optimal action altered.
Finally, we note that in this setup, Assumption 2.3 is satisfied. This demonstrates that even if the errors do not affect the optimal action given , the poor performance of the RLS estimator may still lead to linear regret in classical bandit algorithms.
B. 2. Inconsistency of the naive measurement error adjusted estimator (2.4)
Example B.2. Let for all , and sampled independently. For the reward model, let sampled independently. So in order to maximize expected reward, we should choose action 1 if is positive and action 0 otherwise.
Suppose the agent takes the following policy that is stationary and non-adaptive to history:
Here, is a pre-specified constant. Figure 4 (a) plots the mean and standard deviation of (as in (2.4)) given 100 independent experiments for each , where . Observe that as grows, converges to different limits for different policies. In general, the limit is not equal to .
Figure 3:
Estimation error of the RLS estimator and cumulative regret of UCB (Chu et al., 2011) and Thompson sampling (Russo et al., 2018) under contextual error in Example B.1. The red and pink line corresponds to Thompson sampling and UCB respectively. The solid lines indicate the mean values, while the shaded bands represent the standard error across the independent experiments.
In contrast, Figure 4 (b) shows the mean and standard deviation of (as in (2.5)) given 100 independent experiments under the same setting with the same three policies as in Figure 4 (a). Unlike the naive estimator (2.4), the proposed estimator (2.5) quickly converges around the true value −1 for all three candidate policies.
C. Generalization to actions
In this section, we assume that instead of . The standard and clipped benchmark become
(C.1) |
and
(C.2) |
Figure 4:
Estimated value of given the naive estimator (2.4) in (a) and our proposed estimator (2.5) in (b) under different policies under 100 independent experiments. The green, blue, and red line corresponds to the policy with parameter , and 0.5 respectively. The solid lines indicate the mean values, while the shaded bands represent the standard deviation across the independent experiments.
In the -arm setting, we can still estimate using (2.5) for each . Using the same proof ideas as Theorem 2.1, we get the following guarantees for estimation error (proof omitted):
Theorem C.1. For any , let . Then under Assumption 2.1 and 2.2, there exist constants and such that as long as , with probability at least ,
MEB with actions is shown in Algorithm 4. As in Theorem 2.2, we can control the regret of Algorithm 4 by the estimation error of (2.5). Here, Assumption 2.3 needs to be generalized to the following to adapt to multiple actions:
Assumption C.1. There exists a constant such that almost surely.
The theorem below is a generalization of Theorem 2.2 to multiple actions (The proof is only slightly different from Theorem 2.2; We briefly discuss the difference in Appendix G.1).
Theorem C.2. Let Assumption 2.1 and C. 1 hold.
- For the standard setting, Algorithm 4 outputs a policy with no more than
- For the clipped policy setting, Algorithm 4 with the choice of outputs a policy with no more than
Combining Theorems C. 1 and C.2, we obtain the following corollary.
Corollary C.1. Let Assumption 2.1, 2.2 and C. 1 hold. There exist universal constants such that:
- For the standard setting, , with probability at least , Algorithm 4 with the choice of outputs a policy with no more than
- For the clipped policy setting, , with probability at least , Algorithm 4 with the choice of and outputs a policy with no more than
D. Analysis of the proposed estimator (2.5)
D. 1. Proof of Theorem 2.1
We fix some , and control for . Towards this goal, we combine analysis of the two random terms and in the lemma below.
Lemma D.1. Under the same assumptions of Theorem 2.1, there exists an absolute constant such that with probability at least , both of the followings hold:
(D.1) |
(D.2) |
Proof of Lemma D.1 is in Section G.2.
Denote . Then
where
It’s easy to verify that under the event where both (D.1) and (D.2) hold, whenever
(D.3) |
we have
and
(D.3) can be ensured by , where . Given these guarantees, we have with probability at least ,
Thus we conclude the proof.
D. 2. Additional comments on Assumption 2.2
Following Remark 2.1, all the following examples of allow the existence of with given any reasonably big with high probability:
is an i.i.d. sequence satisfying ;
is a weakly-dependent stationary time series (a common example is the multivariate ARMA process under regularity conditions, see e.g. (Banna et al., 2016)). The stationary distribution satisfies ;
is a periodic time series such that there exists which satisfies a.s. .
D. 3. Generalization to off-policy method-of-moment estimation
(2.5) can be generalized to a class of method-of-moment estimators for off-policy learning. In this section, we delve into the general framework of off-policy method-of-moment estimation. This framework proves valuable in scenarios where a fully parametric model class for the reward is unavailable, yet there is a desire to estimate certain model parameters using offline batched bandit data.
For simplicity, we assume that are drawn i.i.d. from an unknown distribution . At each time , the action is drawn from a policy , and the agent observes only together with the action selection probabilities . Define the history up to time as . For , we’re interested in estimating , a -dimensional parameter in , which is the joint distribution of .
Remark D.1. When the context is i.i.d., the problem of estimating in Section 2.2 is a special case of this setup by taking as and as .
The traditional method-of-moment estimator looks for functions as well as a mapping , such that
Then, if given i.i.d. samples from , the estimator takes the form
In fact, the naive estimator (2.4) is of this form. It is clear that we cannot use this estimator for offline batched data : There are no i.i.d. samples from because of the policy . Instead, we propose the following estimator:
where for a data-independent probability distribution on . Similar to the proof of Theorem 2.1, it’s not difficult to see that is consistent under mild conditions. In fact, (2.5) is a special case of this estimator when does not depend on . A more detailed analysis is left for future work.
E Analysis of MEB
E. 1. Additional comments on Example 2.1
In example 2.1, the joint distribution of is:
The optimal action given and is
In the standard bandit setting, the benchmark policy is
For any policy , the instantaneous regret at time is
Even if both and the joint distribution of are known, can only depend on and history, and cannot be based on . Thus, there is always a (constant) positive probability that the action sampled from does not match (otherwise, sampled from should be equal to a.s.). Thus, the standard cumulative regret will be linear in the time horizon.
E. 2. Proof of Theorem 2.2
We first prove the lemma below. Its proof is in Appendix G.3.
Lemma E.1. Under Assumption 2.3, we have . Consequently, (given a fixed minimum action selection probability ), where
In the below, we define
Standard setting. In the standard setting, we give the lemma below (proof in Appendix G.4).
Lemma E.2. Under the assumptions of Theorem 2.2, at any time ,
Note that for any time , the instantaneous regret at time , and that . Moreover,
Here we used Assumption 2.3. Thus we have
Combining Lemma E.2, we obtain that for any ,
Finally, when , since , the instantaneous regret . We conclude the proof by summing up all the instantaneous regret terms.
Clipped policy setting.In the clipped policy setting, we give the lemma below (proof in Appendix G.5).
Lemma E.3. Under the assumptions of Theorem 2.2, at any time ,
Note that the instantaneous regret at time , and that . Similar to the standard setting, for , under Assumption 2.3, we have
We conclude the proof by summing up all the instantaneous regret terms, and noticing that for .
Results with a high-probability version of Assumption 2.3.As briefly mentioned in the main paper, Assumption 2.3 can be weakened to the inequalities holding with high probability. Now instead of Assumption 2.3, we assume the following:
Assumption E.1. There exist constants such that . Here denotes the event , and .
It’s easy to see that the result of Lemma E. 1 hold at time under the event . Further, following the same arguments, we obtain that under Assumption 2.1, the results for either the standard setting or the clipped policy setting hold under the event . Therefore, with Assumption 2.1 and E.1, in either setting, the results in Theorem 2.2 hold with probability at least .
E. 3. Proof of Corollary 2.1
Standard setting. First, notice that , since is monotonically decreasing in . Theorem 2.1 indicates that, as long as ,
(E.1) |
then with probability at least ,
(E.2) |
Plug it into Theorem 2.2, we have that with high probability,
where
for a universal constant , where the last inequality holds if in addition, .
The proof is concluded by combining the above requirement for as well as (E.1).
Clipped policy setting.
Similar to the standard setting, according to Theorem 2.1, as long as ,
(E.3) |
then with probability at least ,
(E.4) |
Plug it into Theorem 2.2, we have that with high probability,
where
where the last inequality holds if in addition, .
The proof is concluded by plugging the above into the regret upper bound formula and combining the requirements for .
Results with a high-probability version of Assumption 2.3. Recall that Assumption E. 1 is a weakened version of Assumption 2.3 with a high-probability statement. Given Assumptions 2.1, 2.2, and E. 1 (instead of 2.3), in the standard setting, the results of Theorem 2.1 hold as long as (E.2) or (E.4) for all and under the event . Thus, we deduce that the regret upper bound in (i) hold with probability at least . Similarly, given Assumptions 2.1, 2.2, and E.1, in the clipped benchmark setting, the regret upper bound in (ii) hold with probability at least .
E. 4. MEB with infrequent model update
As mentioned at the end of Section 2.3, in certain scenarios (e.g. when is large), we can save computational resources by updating the estimates of less frequently. In Algorithm 5, we propose a variant of Algorithm 1. At each time , given the noisy context , the algorithm computes the best action according to the most recently updated estimators of . Then, it samples with probability and keeps an exploration probability of to sample the other action. In the meantime, the agent only has to update the estimate of once in a while to save computation power: The algorithm specifies a subset and updates the estimators according to (2.5) only when .
Under mild conditions, Algorithm 5 achieves the same order of regret upper bound as Algorithm 1, as seen from Theorem E. 1 and Corollary E. 1 below. They are modified versions of Theorem 2.2 and Corollary 2.1.
Theorem E.1. Let be the first time Algorithm 5 updates the model. Suppose Assumption 2.1 and 2.3 hold.
- For the standard setting, for any , Algorithm 5 outputs a policy such that
- For the clipped policy setting, for any , Algorithm 5 with the choice of outputs a policy such that
Here for any .
The proof of Theorem E. 1 is very similar to that of Theorem 2.2, and is thus omitted.
Corollary E.1. Let Assumption 2.1 to 2.3 hold. There exist constants such that:
In the standard setting, as long as the set of model update times satisfies: (a) ; (b) for some constant , then , with probability at least , Algorithm 5 with the choice of achieves .
In the clipped policy setting, as long as the set of model update times satisfies: (a) ; (b) for some constant , then for any s.t. , with probability at least , Algorithm 5 with the choice of achieves: .
The proof of Corollary E. 1 can be directly obtained by combining Theorem 2.1 and E. 1 with in the standard setting and in the clipped benchmark setting. Thus, the proof is omitted here.
In Corollary E.1, condition (a) and (b) essentially requires Algorithm 5 not to start learning the model too late, and to keep updating the learned model at least at time points with a ‘geometric’ growth rate. This covers a wide range of choices of in practice. Two typical examples of could be: (1) for some (the model is learned every time points routinely, where is a constant integer); (2) If for some (the model only needs to be learned times to save computation).
F Analysis with estimated error variance
We consider the setting where at each time , the agent does not have access to , but has a (potentially adaptive) estimator . In this setting, we estimate the model using (2.7) instead of (2.5) and plug into Algorithm 1. The following theorem controls the estimation error of . Note that compared to Theorem 2.1, the additional error caused by inaccuracty of can be controlled by , the weighted average of the estimation errors .
Theorem F.1. Recall that . Then under Assumptions 2.1 and 2.2, there exist constants and such that as long as and , with probability at least ,
(F.1) |
The proof of Theorem F. 1 is in Appendix G.6.
By combining Theorem F. 1 and 2.2 (with (2.7) instead of (2.5)), we obtain the following regret bounds for Algorithm 1 with (2.7) as the plug-in estimator.
Corollary F.1. Suppose Assumption 2.1 to 2.3 hold. Then there exist universal constants such that:
- In the standard setting, if , by choosing , and (2.7) instead of (2.5) in Algorithm 1, with probability at least ,
- In the clipped policy setting, as long as , by choosing , and (2.7) instead of (2.5) in Algorithm 1, with probability at least ,
The proof is in Appendix G.7.
G Additional proofs
G. 1. Proof of Theorem C. 2
The proof of Theorem C. 2 is very similar to Theorem 2.2, and we only need to note the difference in Lemma E. 2 and E. 3 (for the standard setting and the clipped policy setting respectively), as stated below. Recall that
Standard setting. At any time , we have
(G.1) |
Here note that Lemma E. 1 still holds under Assumption C.1, so .
Note that implies that
which leads to
and further implies . Plugging in the above to (G.1) leads to
The rest of the proof can be done in the same way as the proof of Theorem 2.2.
Clipped policy setting. At any time , we have
(G.2) |
Here recall that .
Note that implies that
which leads to
and further implies .
Plugging in the above to (G.2) leads to
for . The rest of the proof can be done in the same way as the proof of Theorem 2.2.
G. 2. Proof of Lemma D. 1
We first analyze . Notice that , where . For any fixed is a martingale difference sequence. Moreover, we can verify that and
According to Freedman’s Inequality (Freedman, 1975), for any ,
Set , and we obtain . Applying the same analysis to and combining the results gives .
Denote , then the above means that ,
(G.3) |
Let be a -net of , find s.t. , and we have
This implies that
and thus . Combining the above and (G.3), we obtain that for any ,
By choosing , and noticing that , we have
(G.4) |
At the same time, we have
(G.5) |
Here we’ve used the facts that (i) is data-independent; (ii) . Plug (G.5) into (G.4), and we get
(G.6) |
The analysis for is similar. Write . Then for any , it’s easy to verify that , and . Applying Freedman’s Inequality leads to
(G.7) |
where .
Recall that is a -net of , find s.t. , then , and thus
which implies that . Taking this and (G.7) into account, we derive that
By choosing and noticing that , we obtain
(G.8) |
Finally, because
Plug in (G.8), and we obtain
(G.9) |
G. 3. Proof of Lemma E.1
We only need to prove
(G.10) |
If is a direct consequence of Assumption 2.3. If , without loss of generality, suppose . Then according to Assumption 2.3,
Thus (G.10) is true.
G. 4. Proof of Lemma E.2
We have
(G.11) |
Here recall that .
Note that implies that
which leads to
and further implies .
Plugging in the above to (G.11) leads to
G. 5. Proof of Lemma E.3
At any time , we have
(G.12) |
Here recall that .
Note that implies that
which leads to
and further implies .
Plugging in the above to (G.12) leads to
G. 6. Proof of Theorem F.1
Fix such that the conditions of Theorem 2.3 hold. Fix . As in Appendix D, define . We also let . Recall Lemma D.1: with probability at least ,
Meanwhile,
where
Under the events where both (D.1) and (D.2) hold, whenever
(G.13) |
and , we have
and
(G.13) can be ensured by , where . Given these guarantees, we have with probability at least ,
Thus we conclude the proof.
G.7. Proof of Corollary F.1
Standard setting. Notice that is monotonically decreasing in . Theorem F.1 indicates that, as long as ,
(G.14) |
then with probability at least ,
(G.15) |
Plug it into Theorem 2.2, we have that with high probability,
where
for a universal constant , where the last inequality holds if in addition, .
The proof is concluded by combining the above requirement for as well as (G.14).
Clipped policy setting.
Similar to the standard setting, according to Theorem F.1, as long as ,
(G.16) |
then with probability at least ,
(G.17) |
Plug it into Theorem 2.2, we have that with high probability,
where
where the last inequality holds if in addition, .
The proof is concluded by plugging the above into the regret upper bound formula and combining the requirements for .
Footnotes
For simplicity, we state our results under the binary-action setting, which is common in healthcare (Trella et al., 2022), economics (Athey et al., 2017; Kitagawa and Tetenov, 2018) and other applications. However, all the results presented in this paper can be extended to the setting with multiple actions. See Appendix C.
Note this violates contextual bandits assumption and leads to an MDP. We believe this is a good setup to test the robustness of our proposed approach.
References
- Abbasi-Yadkori Y, Pál D and Szepesvári C (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems 24. [Google Scholar]
- Agrawal S and Goyal N (2013). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning. PMLR. [Google Scholar]
- Athey S, Wager S et al. (2017). Efficient policy learning. Tech. rep [Google Scholar]
- Auer P (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 397–422. [Google Scholar]
- Banna M, Merlevède F and Youssef P (2016). Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications 5 1650006. [Google Scholar]
- Battalio SL, Conroy DE, Dempsey W, Liao P, Menictas M, Murphy S, Nahum-Shani I, Qian T, Kumar S and Spring B (2021). Sense2stop: a micro-randomized trial using wearable sensors to optimize a just-in-time-adaptive stress management intervention for smoking relapse prevention. Contemporary Clinical Trials 109 106534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bibaut A, Dimakopoulou M, Kallus N, Chambaz A and van Der laan M (2021). Post-contextual-bandit inference. Advances in neural information processing systems 34 28548–28559. [PMC free article] [PubMed] [Google Scholar]
- Bouneffouf D (2020). Online learning with corrupted context: Corrupted contextual bandits. arXiv preprint arXiv:2006.15194. [Google Scholar]
- Bouneffouf D, Bouzeghoub A and Gançarski AL (2012). A contextual-bandit algorithm for mobile context-aware recommender system. In Neural Information Processing: 19th International Conference, ICONIP 2012, Doha, Qatar, November 12–15, 2012, Proceedings, Part III 19. Springer. [Google Scholar]
- Carroll RJ, Ruppert D and Stefanski LA (1995). Measurement error in nonlinear models, vol. 105. CRC press. [Google Scholar]
- Chu W, Li L, Reyzin L and Schapire R (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. [Google Scholar]
- Cohen S, Mermelstein R, Kamarck T and Hoberman HM (1985). Measuring the functional components of social support. Social support: Theory, research and applications 73–94. [Google Scholar]
- Dempsey W, Liao P, Klasnja P, Nahum-Shani I and Murphy SA (2015). Randomised trials for the fitbit generation. Significance 12 20–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ding Q, Hsieh C-J and Sharpnack J (2022). Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics. PMLR. [Google Scholar]
- Fan J and Yao Q (2017). The elements of financial econometrics. Cambridge University Press. [Google Scholar]
- Freedman DA (1975). On tail probabilities for martingales. the Annals of Probability 100–118. [Google Scholar]
- FulleR WA (2009). Measurement error models. John Wiley & Sons. [Google Scholar]
- Galozy A and Nowaczyk S (2023). Information-gathering in latent bandits. Knowledge-Based Systems 260 110099. [Google Scholar]
- Galozy A, Nowaczyk S and Ohlsson M (2020). A new bandit setting balancing information from state evolution and corrupted context. arXiv preprint arXiv:2011.07989. [Google Scholar]
- Hong J, Kveton B, Zaheer M, Chow Y, Ahmed A and Boutilier C (2020a). Latent bandits revisited. Advances in Neural Information Processing Systems 33 13423–13433. [Google Scholar]
- Hong J, Kveton B, Zaheer M, Chow Y, Ahmed A, Ghavamzadeh M and Boutilier C (2020b). Non-stationary latent bandits. arXiv preprint arXiv:2012.00386. [Google Scholar]
- Jose ST and Moothedath S (2024). Thompson sampling for stochastic bandits with noisy contexts: An information-theoretic regret analysis. arXiv preprint arXiv:2401.11565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kirschner J and Krause A (2019). Stochastic bandits with context distributions. Advances in Neural Information Processing Systems 32. [Google Scholar]
- Kitagawa T and Tetenov A (2018). Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica 86 591–616. [Google Scholar]
- Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A and MurPhy SA (2015). Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology 34 1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klasnja P, Smith S, Seewald NJ, Lee A, Hall K, Luers B, Hekler EB and Murphy SA (2018). Efficacy of Contextually Tailored Suggestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps. Annals of Behavioral Medicine 53 573–582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langford J and Zhang T (2007). The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems 20 96–1. [Google Scholar]
- Li L, Chu W, Langford J and Schapire RE (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. [Google Scholar]
- Liao P, Greenewald K, Klasnja P and Murphy S (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao P, Klasnja P, Tewari A and Murphy SA (2016). Sample size calculations for micro-randomized trials in mhealth. Statistics in medicine 35 1944–1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin J, Lee XY, Jubery T, Moothedath S, Sarkar S and Ganapathysubramanian B (2022). Stochastic conservative contextual linear bandits. arXiv preprint arXiv:2203.15629. [Google Scholar]
- Lin J and Moothedath S (2022). Distributed stochastic bandit learning with context distributions. arXiv preprint arXiv:2207.14391. [Google Scholar]
- Liu Y-E, Mandel T, Brunskill E and Popovic Z (2014). Trading off scientific knowledge and user learning with multi-armed bandits. In EDM. [Google Scholar]
- Nelson E, Bhattacharjya D, Gao T, Liu M, Bouneffouf D and Poupart P (2022). Linearizing contextual bandits with latent state dynamics. In The 38th Conference on Uncertainty in Artificial Intelligence. [Google Scholar]
- Park H and Faradonbeh MKS (2022). Worst-case performance of greedy policies in bandits with imperfect context observations. In 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE. [Google Scholar]
- Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z et al. (2018). A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 1–96. [Google Scholar]
- Sarker H, Hovsepian K, Chatterjee S, Nahum-Shani I, Murphy SA, Spring B, Ertin E, Al’Absi M, Nakajima M and Kumar S (2017). From markers to interventions: The case of just-in-time stress intervention. Springer. [Google Scholar]
- Sarker H, Tyburski M, Rahman MM, Hovsepian K, Sharmin M, Epstein DH, Preston KL, Furr-Holden CD, Milam A, Nahum-Shani I et al. (2016). Finding significant stress episodes in a discontinuous time series of rapidly varying mobile sensor data. In Proceedings of the 2016 CHI conference on human factors in computing systems. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sen R, Shanmugam K, Kocaoglu M, Dimakis A and Shakkottai S (2017). Contextual bandits with latent confounders: An nmf approach. In Artificial Intelligence and Statistics. PMLR. [Google Scholar]
- Shaikh H, Modiri A, Williams JJ and Rafferty AN (2019). Balancing student success and inferring personalized effects in dynamic experiments. In EDM. [Google Scholar]
- Trella AL, Zhang KW, Nahum-Shani I, Shetty V, Doshi-Velez F and Murphy SA (2022). Reward design for an online reinforcement learning algorithm supporting oral self-care. arXiv preprint arXiv:2208.07406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y-X, Agarwal A and Dudik M (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning. PMLR. [Google Scholar]
- Xu J and Wang Y-X (2022). Towards agnostic feature-based dynamic pricing: Linear policies vs linear valuation with unknown noise. In International Conference on Artificial Intelligence and Statistics. PMLR. [Google Scholar]
- Xu X, Xie H and Lui JC (2021). Generalized contextual bandits with latent features: Algorithms and applications. IEEE Transactions on Neural Networks and Learning Systems. [DOI] [PubMed] [Google Scholar]
- Yang J, Eckles D, Dhillon P and Aral S (2020a). Targeting for long-term outcomes. arXiv preprint arXiv:2010.15835. [Google Scholar]
- Yang J and Ren S (2021a). Bandit learning with predicted context: Regret analysis and selective context query. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE. [Google Scholar]
- Yang J and Ren S (2021b). Robust bandit learning with imperfect context. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35. [Google Scholar]
- Yang L, Yang J and Ren S (2020b). Multi-feedback bandit learning with probabilistic contexts. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence Main track. [Google Scholar]
- Yao J, Brunskill E, Pan W, Murphy S and Doshi-Velez F (2021). Power constrained bandits. In Machine Learning for Healthcare Conference. PMLR. [PMC free article] [PubMed] [Google Scholar]
- Yom-Tov E, Feraru G, Kozdoba M, Mannor S, Tennenholtz M and Hochberg I (2017). Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research 19 e338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yun S-Y, Nam JH, Mo S and Shin J (2017). Contextual multi-armed bandits under feature uncertainty. arXiv preprint arXiv:1703.01347. [Google Scholar]
- Zhan R, Hadad V, Hirshberg DA and Athey S (2021). Off-policy evaluation via adaptive weighting with data from contextual bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. [Google Scholar]
- Zhang K, Janson L and Murphy S (2021). Statistical inference with m-estimators on adaptively collected data. Advances in neural information processing systems 34 7460–7471. [PMC free article] [PubMed] [Google Scholar]
- Zhou L and Brunskill E (2016). Latent contextual bandits and their application to personalized recommendations for new users. arXiv preprint arXiv:1604.06743. [Google Scholar]