Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 6.
Published in final edited form as: Adv Neural Inf Process Syst. 2020 Dec;33:9818–9829.

Inference for Batched Bandits

Kelly W Zhang 1, Lucas Janson 2, Susan A Murphy 3
PMCID: PMC8734616  NIHMSID: NIHMS1641560  PMID: 35002190

Abstract

As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.

1. Introduction

Due to their regret minimizing guarantees bandit algorithms have been increasingly used in in real-world sequential decision-making problems, like online advertising [27], mobile health [42], and online education [34]. However, for many real-world problems it is not enough to just minimize regret on a particular problem instance. For example, suppose we have run an online education experiment using a bandit algorithm where we test different types of teaching strategies. When designing a new online course, ideally we could use the data from the previous experiment to inform the design, e.g., under-performing arms could be eliminated or modified. Moreover, to help others designing online courses we would like to be able to publish our findings about how different teaching strategies compare in their performance. This example demonstrates the need for statistical inference methods on bandit data, which allow practitioners to draw generalizable knowledge from the data they have collected (e.g., how much better one teaching strategy is compared to another) for the sake of scientific discovery and informed decision making.

In this work we will focus on methods to construct confidence intervals for the margin—the difference in expected rewards of two bandit arms—from batched bandit data. Rather than constructing high probability confidence intervals, we are interested in constructing confidence intervals by using the asymptotic distribution of estimators to approximate their finite sample distribution. Asymptotic approximation methods for statistical inference has a long history of being successful in science and leads to much narrower confidence intervals than those constructed using high probability bounds. Most statistical inference methods based on asymptotic approximation assume that treatments are assigned independently [15]. However, bandit data violates this independence assumption because it is collected adaptively, meaning previous actions and rewards inform future action selections. The non-independence makes statistical inference more challenging, e.g., estimators like the sample mean are often biased on bandit data [32, 37].

Throughout, we focus on the batched bandit setting, in which arms of the bandit are pulled in batches. For our asymptotic analysis we fix the total number of batches, T, and allow the arm pulls in each batch, n, to go to infinity. Note that we do not need or expect n to go to infinity for real-world experiments; we use the asymptotic distribution of estimators to approximate their finite-sample distribution when constructing confidence intervals. We focus on the batched setting because it closely reflects many of the problem settings where bandit algorithms are applied. For example, in many mobile health [42, 23, 28] and online education problems [22, 34] multiple users use apps / take courses simultaneously, so a batch corresponds to the number of unique users the bandit algorithm acts on at once. The batched setting is even common in online recommendations and advertising because it is impractical to update the bandit after every action if many users visit the site simultaneously [38, 36, 12, 26]. In many such experimental settings the length of the study, T, cannot be arbitrarily adjusted, e.g., in online education, courses generally cannot be made arbitrarily long, and clinical trials often run for a standard amount of time that depends on the domain science (e.g. the length of mobile health studies is a function of the scientific community’s belief in how long it should take for users to form a habit). On the other hand, the number of users, n, can in principle grow as large as funding allows.

Additionally, in our batched setting, we assume that the means of the arms can change over time, i.e., from batch to batch, which reflects the temporal non-stationarity that is prevalent in many real world bandit application problems. For example, in online recommendation systems, the click through rate of a given recommendation typically varies over time, e.g., breaking news articles become less popular over time [38, 26]. Online education and mobile health are also highly non-stationary problems because users tend to disengage over time, so the same notification may be much less effective if sent near the end of an experiment than sent near the beginning [9, 21, 7]. Our statistical inference method does not need to assume that the number of stationary time periods in the experiment is large and is robust to temporal non-stationarity from batch to batch.

The first contribution of this work is proving that on bandit data, rather surprisingly, whether standard estimators are asymptotically normal can depend on whether the margin is zero.

We prove that for common bandit algorithms, the arm selection probabilities only concentrate if there is a unique optimal arm. Thus, for two-arm bandits, the arm selection probabilities do not concentrate when the margin—the difference in the expected rewards between the arms—is zero. We show that this leads the ordinary least squares (OLS) estimator to be asymptotically normal when the margin is non-zero, and asymptotically not normal when the margin is zero. Since the OLS estimator does not converge uniformly (over values of the margin), standard inference methods (normal approximations, bootstrap1) can lead to inflated Type-1 error and unreliable confidence intervals on bandit data.

The second contribution of this work is introducing the Batched OLS (BOLS) estimator, which can be used for reliable inference—even in non-stationary settings—on data collected with batched bandits.

We prove that, regardless of whether the margin is zero or not, the BOLS estimator for the margin for both multi-arm and contextual bandits is asymptotically normal and thus can be used for both hypothesis testing and obtaining confidence intervals. Moreover, BOLS is also automatically robust to non-stationarity in the rewards and can be used for constructing valid confidence intervals even if there is non-stationarity in the baseline reward, i.e., if the rewards of the arms change from batch to batch, but the margin remains constant. If the margin itself is also non-stationary, BOLS can also be used for constructing simultaneous confidence intervals for the margins for each batch.

2. Related Work

Batched Bandits

Much work on batched bandits focuses on minimizing regret [33, 10] or identifying the best arm with high probability [2, 18]. The best arm identification literature utilizes high probability confidence bounds to construct confidence intervals for bandit parameters; we will discuss this method in the next section. Note that in contrast to other batched bandit literature that allow batch sizes to be adjusted adaptively [33], here we do not have adaptive control over the batch sizes. Batched bandits are closely related to multistage adaptive clinical trials, in which between each batch (or stage of the trial) the data collection procedure can be adjusted depending on the outcome of the previous batches. Our Batched OLS estimator is most closely related to “stage-wise” p-values for group sequential trials that are computed on each stage separately [40]. p-value combination tests are commonly used to combine stage-wise p-values, when the sequence of p-values are shown to be independent or p-clud, meaning that under the null each p-value has a Uniform(0,1) distribution conditional on past p-values [40]. [29] formally establish the independence of stage-wise p-values for two-stage trials in which there are a countable number of adaptive rules; note that this rules out bandit algorithms with real-valued arm selection probabilities, like Thompson Sampling. [4] establishes the p-clud property for two-stage adaptive clinical trials under the assumption that the distribution of the second stage data is known conditioned on the decision rule and first stage data under the null hypothesis. Neither of these methods are sufficient for obtaining independent p-values for adaptive trials (1) with an arbitrary number of stages, (2) where exact distribution of rewards is unknown, and (3) where the action selection probabilities can be real numbers, like for Thompson Sampling.

High Probability Confidence Intervals

High probability confidence intervals provide stronger guarantees than those constructed using asymptotic approximations. In particular, these bounds are guaranteed to hold for finite number of observations and often even hold uniformly over all n and T. These types of bounds are used throughout the bandit and reinforcement learning literature to construct confidence intervals for bandit parameters [14, 20], prove regret bounds [1, 25], and provide guarantees regarding best arm identification [16, 17]. The primary drawback of high probability confidence intervals is that they are much more conservative than those constructed using asymptotic approximations. This means that many more observations will be needed to get a confidence interval of the same width or for a statistical test to have the same power when using high probability confidence intervals compared to those constructed using asymptotic approximation. Since the cost of increasing the the number of users in a study can be large, being able to construct narrow—yet reliable—confidence intervals is crucial to many applications.

In our simulations we compare our method to high probability confidence bounds constructed using the self-normalized martingale bound of [1]. This bound is guaranteed to hold on adaptively collected data and is commonly used in the proof of regret bounds for bandit algorithms. We find that all the approaches based on asymptotic approximations (which we discuss next), significantly outperform the statistical test constructed using a self-normalized martingale bound in terms of power. Moreover, despite the weaker guarantees of statistical inference based on asymptotic approximations, they are generally able to provide reliable coverage of confidence intervals and type-1 error control.

Adaptive Inference based on Asymptotic Approximations

A common approach in the literature for performing inference on bandit data is to use adaptive weights, which are weights that are a function of the history. An early example of using adaptive weights is that of [31] and [30], who use adaptive weights in estimating the expected reward under the optimal policy when one has access to i.i.d. observational data. They use an Augmented-Inverse-Probability-Weighted estimator with adaptive weights that are a function of the estimated standard deviation of the reward. [30] conjecture that their approach can be adapted to the adaptive sampling case. Subsequently [11] developed the adaptively weighted method for inference on bandit data to produce the Adaptively-Weighted Augmented-Inverse-Probability-Weighted Estimator (AW-AIPW) for data collected via multi-arm bandits. They prove a central limit theorem (CLT) for AW-AIPW when the adaptive weights satisfy certain conditions. Note, however, the AW-AIPW estimator does not have guarantees in non-stationary settings.

Adaptive weights are also used by [6] to form the W-decorrelated estimator, a debiased version of OLS, that is asymptotically normal. In the multi-arm bandit setting, the adaptive weights are a function of the number of times an arm was chosen previously. We found that in the two-arm setting, the W-decorrelated estimator down-weights rewards from later in the study (Appendix F). [5] introduce the Online Debiased Estimator that also has bias guarantees on adaptive data, but in the more challenging high-dimensional linear regression setting. They prove the asymptotic normality of their estimator in the Gaussian autoregressive time series and the two-batch settings. Note that none of these estimation methods have guarantees in non-stationary bandit settings.

[24] provide conditions under which the OLS estimator is asymptotically normal on adaptively collected data. However, as noted in [39, 6, 11], classical inference techniques developed for i.i.d. data often empirically have inflated Type-1 error on bandit data. In Section 4.1, we discuss the restrictive nature of [24]’s CLT conditions.

3. Problem Formulation

Setup and Notation

Though our results generalize to K-arm, contextual bandits (see Section 5.2), we first focus on the two-arm bandit for expositional simplicity. Suppose there are T timesteps or batches in a study. In each batch t ∈ [1: T], we select n binary actions {At,i}i=1n{0,1}n. We then observe independent rewards {Rt,i}i=1n, one for each action selected. Note that the distribution of these random variables changes with the batch size, n. For example, the distribution of the actions one chooses for the 2nd batch, {A2,i}i=1n, may change if one has observed n = 10 vs. n = 100 samples {A1,i,R1,i}i=1n in the first batch. For readability, we omit indexing random variables by n, except for the variables Ht1(n) and πt(n), and filtrations like Gt1(n) to be introduced next.

For each t ∈ [1: T], the bandit selects actions {At,i}i=1n~i.i.d.Bernoulli(πt(n)) conditional on Ht1(n){At,i,Rt,i}i=1,t=1i=n,t=t1, the history prior to batch t. Note, the action selection probability πt(n)(At,i=1Ht1(n)) depends on the history Ht1(n). We assume the following conditional mean for rewards:

E[Rt,iHt1(n),At,i]=(1At,i)βt,0+At,iβt,1. (1)

Note in equation (1) we condition on Ht1(n) because the conditional mean of the reward does not depend on prior rewards or actions. Let Xt,i[1At,i,At,i]2; note Xt,i is higher dimensional when we add more arms and/or context variables. We define the errors as ϵt,iRt,i − (Xt,i)βt. Equation (1) implies that {ϵt,i : i ∈ [1 : n], t ∈ [1: T]} are a martingale difference array with respect to the filtration {Gt(n)}t=1T, where Gt(n)σ(Ht1(n){At,i}i=1n); thus, E[ϵt,iGt1(n)]=0, t,i,n. The parameters βt = (βt,0, βt,1) can change across batches t ∈ [1: T], which allows for non-stationarity between batches. Assuming that βt=βt for all t, t′ ∈ [1: T] simplifies to the stationary mean case.

Action Selection Probability Constraint (Clipping)

In order to perform inference on bandit data it is necessary to guarantee that the bandit algorithm explores sufficiently. For example, the CLTs for both the W-decorrelated [6] and the AW-AIPW [11] estimators have conditions that implicitly require that the bandit algorithms cannot sample any given action with probability that goes to zero or one arbitrarily fast. Greater exploration also increases the power of statistical tests regarding the margin [41]. Moreover, if there is non-stationarity in the margin between batches, it is desirable for the bandit algorithm to continue exploring. We explicitly guarantee exploration by constraining the probability that any given action can be sampled (see Definition 1). We allow the action selection probabilities πt(n) to converge to 0 and/or 1 at some rate.

Definition 1. A clipping constraint with rate f(n) means that πt(n) satisfies the following:

limn(πt(n)[f(n),1f(n)])=1 (2)

4. Asymptotic Distribution of the Ordinary Least Squares Estimator

Suppose we are in the stationary case, and we would like to estimate β. Consider the OLS estimator: β^OLS=(X_X_)1X_R, where X_[X1,1,,X1,n,,XT,1,,XT,n]nT×2 and R[R1,1,,R1,n,,RT,1,,RT,n]nT. Note that X_X_=t=1Ti=1nXt,iXt,i.

4.1. Conditions for Asymptotically Normality of the OLS estimator

If (Xt,i, ϵt,i) are i.i.d., E[ϵt,i]=0, E[ϵt,i2]=σ2 and the first two moments of Xt,i exist, a classical result from statistics [3] is that the OLS estimator is asymptotically normal, i.e., as n → ∞,

(X_X_)1/2(β^OLSβ)DN(0,σ2I_p).

[24] generalize this result by proving that the OLS estimator is still asymptotically normal in the adaptive sampling case when X_X_ satisfies a certain stability condition. To show that a similar result holds for the batched setting, we generalize the CLT of [24] to triangular arrays (required since the distribution of our random variables vary as the batch size, n, changes), as stated in Theorem 5.

Condition 1 (Moments). For all t, n, i, E[ϵt,i2Gt1(n)]=σ2 and E[ϵt,i4Gt1(n)]<M<.

Condition 2 (Stability). For some non-random sequence of scalars {ai}i=1, as n → ∞,

an1nTt=1Ti=1nAt,iP1

Theorem 1 (Triangular array version [24], Theorem 3). Assuming Conditions 1 and 2, as n → ∞,

(X_X_)1/2(β^OLSβ)DN(0,σ2I_p).

Note that in the bandit setting, Condition 2 means that prior to running the experiment, the asymptotic rate at which arms will be selected is predictable. We will show that Condition 2 is in a sense necessary for the asymptotic normality of OLS. In Corollary 1 below we state that Conditions 1 and 3, and a non-zero margin are sufficient for stability Condition 2. Later, we will show that when the margin is zero, Condition 2 does not hold for many common bandit algorithms and prove that this leads the OLS estimator to be asymptotically non-normal.

Condition 3 (Conditionally i.i.d. actions). For each t ∈ [1: T], {At,i}i=1n~i.i.d.Bernoulli(πt(n)) i.i.d. over i ∈ [1: n] conditional on Ht1(n).

Corollary 1 (Sufficient conditions for Theorem 5). If Conditions 1 and 3 hold, and the margin is non-zero, data collected in batches using ϵ-greedy, Thompson Sampling, or UCB with clipping constraint with f(n) = c for some 0<c12 (see Definition 1) satisfy Theorem 5 conditions.

4.2. Asymptotic Non-Normality under No Margin

We prove the conjecture of [6] that when the margin is zero, the OLS estimator is asymptotically non-normal under common bandit algorithms, including Thompson Sampling, ϵ-greedy, and UCB. Thus as seen in Figure 1, assuming the OLS estimator is approximately Normal on bandit data can lead to inflated Type-1 error, even asymptotically. The asymptotic non-normality of OLS occurs when the margin is zero because when there is no unique optimal arm, πt(n) does not concentrate as n → ∞ (Appendix C).

Figure 1:

Figure 1:

Empirical distribution of the Z-statistic (σ2 is known) of the OLS estimator for the margin. All simulations are with no margin (β1 = β0 = 0); N(0,1) rewards; T = 25; and n = 100. For ϵ-greedy, ϵ = 0.1.

We state the asymptotic non-normality result for Thompson Sampling in Theorem 2; see Appendix C for the proof and similar results for ϵ-greedy and UCB. It is sufficient to prove asymptotic non-normality for T = 2. Note, Δ^OLS is the difference in the sample means for each arm, so Δ^OLS=β^1OLSβ^0OLS. The Z-statistic of Δ^OLS, which is asymptotically normal under i.i.d. sampling, is as follows:

t=12i=1nAt,i)(t=12i=1n1At,i)2σ2n(Δ^OLSΔ). (3)

Theorem 2 (Asymptotic non-normality of OLS estimator under zero margin for Thompson Sampling). Let T = 2 and π1(n)=12. If ϵt,i~i.i.d.N(0,1), we have independent normal priors on arm means β˜0,β˜1~i.i.d.N(0,1), and π2(n)=πmin[(1πmax)(β˜1>β˜0H1(n))] for constants πmin, πmax with 0 < πminπmax < 1, then (3) is asymptotically not normal when the margin is zero.

Since the OLS estimator is asymptotically normal when Δ ≠ 0 (Corollary 1) and asymptotically not Normal when Δ = 0, the OLS estimator does not converge uniformly on data collected under standard bandit algorithms. The non-uniform convergence of the OLS estimator precludes us from using a normal approximation to perform hypothesis testing and construct confidence intervals (see [19]). In real-world applications, there is rarely exactly zero margin. However, the non-uniform convergence of the OLS estimator at zero margin is still practically important because the asymptotic distribution of the OLS estimator when the margin is zero is indicative of the finite-sample distribution when the margin is statistically difficult to differentiate from zero, i.e., when the signal-to-noise ratio, |Δ|σ, is low. Figure 2 shows that even when the margin is non-zero, when the signal-to-noise ratio is low, confidence intervals constructed using a normal approximation have coverage probabilities below the nominal level. Moreover, for any batch size n and noise variance σ2, there exists a non-zero margin size with a finite-sample distribution that is poorly approximated by a normal distribution.

Figure 2:

Figure 2:

Empirical undercoverage probabilities (coverage probability below 95%) of confidence intervals using on a normal approximation for the OLS estimator. We use Thompson Sampling with N(0,1) priors, a clipping constraint of 0.05πt(n)0.95, N(0,1) rewards, T = 25, and known σ2. Standard errors are < 0.001.

5. Batched OLS Estimator

5.1. Batched OLS Estimator for Multi-Arm Bandits

We now introduce the Batched OLS (BOLS) estimator that is asymptotically normal under a large class of bandit algorithms, even when the margin is zero. Instead of computing the OLS estimator on all the data, we compute the OLS estimator for each batch and normalize it by the variance estimated from that batch. For each t ∈ [1: T], the BOLS estimator of the margin Δt := βt,1βt,0 is:

Δ^tBOLS=i=1n(1At,i)Rt,ii=1n1At,ii=1nAt,iRt,ii=1nAt,i.

Theorem 3 (Asymptotic Normality of Batched OLS estimator for multi-arm bandits). Assuming Conditions 1 (moments) and 3 (conditionally i.i.d. actions), and a clipping rate f(n)=1nα for some 0 ≤ α < 1 (see Definition 1),

[(i=1n1A1,i)(i=1nA1,i)n(Δ^1BOLSΔ1)(i=1n1A2,i)(i=1nA2,i)n(Δ^2BOLSΔ2)(i=1n1AT,i)(i=1nAT,i)n(Δ^TBOLSΔT)]DN(0,σ2I_T)

It is straightforward to generalize Theorem 3 to the case that batches are different sizes but the size of the smallest batch goes to infinity and the batch size is independent of the history.

By Theorem 3, for the stationary margin case, we can test H0 : Δ = c vs. H1 : Δ ≠ c with the following statistic, which is asymptotically normal under the null:

1Tt=1T(i=1n1At,i)(i=1nAt,i)nσ2(Δ^tBOLSc). (4)

This type of test statistic—a weighted combination of asymptotically independent normals—a special case of the inverse normal p-value combination test, has been used in simple settings in which the studies (e.g., batches) are independent (e.g., when conducting meta-analyses across multiple studies) [26]. Here the ability to use this type of test statistic is novel since, due to the bandit algorithm, the batches are not independent. The work here demonstrates asymptotic independence and thus for large n the Z-statistics from each batch should be approximately independently distributed.

The key to proving asymptotic normality for BOLS is that the following ratio converges in probability to one: (i=1n1At,i)(i=1nAt,i)n1nπt(n)(1πt(n))P1. Since πt(n)Gt1(n), 1nπt(n)(1πt(n)) is a constant given Gt1(n). Thus, even if πt(n) does not concentrate, we are still able to apply the martingale CLT [8] to prove asymptotic normality. See Appendix B for more details.

5.2. Batched OLS Estimator for Contextual Bandits

For contextual K-arm bandits, for any two arms x, y ∈ [0: K − 1], we can estimate the margin between them Δt,xyβt,xβt,yd. In each batch, we observe context vectors {Ct,i}i=1n for Ct,id. We redefine the history Ht1(n){Ct,i,At,i,Rt,i}i=1,t=1i=1,t=t1 and define the filtration Ft(n)σ(Ht1(n){At,i,Ct,i}i=1n). The action selection probabilities πt(n) are now functions of the context, so πt(n)(Ct,i)[0,1]K is a vector where the kth dimension equals (At,i=kHt1(n),Ct,i). We assume the following conditional mean model of the reward: E[Rt,iFt1(n)]=k=0K1I(At,i=k)Ct,iβt,k and let ϵt,iRt,ik=0K1I(At,i=k)Ct,iβt,k.

Condition 4 (Conditionally i.i.d. contexts). For each t, Ct,1, Ct,2, …, Ct,n are i.i.d. and its first two moments, μt, Σ_t, are non-random given Ht1(n), i.e., μt, Σ_tσ(Ht1(n)).

Condition 5 (Bounded context). ∥Ct,imaxu for all i, t, n for some constant u. Also, the minimum eigenvalue of Σ_t is lower bounded, i.e., λmin(Σ_t)>l>0.

Definition 2. A conditional clipping constraint with rate f(n) means that the action selection probabilities πt(n):d[0,1]K satisfy the following:

(cd,πt(n)(c)[f(n),1f(n)]K)1

For each t ∈ [1: T], we have the OLS estimator for Δt,xy:Δ^tOLS[C_t,x1+C_t,y1]1(β^t,xOLSβ^t,yOLS), where C_t,ki=1nIAt,i(n)=kCt,iCt,id×d, β^t,kOLS=C_t,k1i=1nIAt,i(n)=kCt,iRt,i.

Theorem 4 (Asymptotic Normality of Batched OLS estimator for contextual bandits). Assuming Conditions 1 (moments)2, 3 (conditionally i.i.d. actions), 4, and 5, and a conditional clipping rate f(n) = c for some 0c<12 (see Definition 2),

[[C_1,x1+C_1,y1]1/2(Δ^1OLSΔ1,xy)[C_2,x1+C_2,y1]1/2(Δ^2OLSΔ2,xy)[C_T,x1+C_T,y1]1/2(Δ^TOLSΔT,xy)]DN(0,σ2I_Td).

5.3. Batched OLS Statistic for Non-Stationary Bandits

Many real-world problems we would like to use bandit algorithms for have non-stationary over time. For example, in online advertising, the effectiveness of an ad may change over time due to exposure to competing ads and general societal changes that could affect perceptions of an ad. We may believe that the expected reward for a given action may vary over time, but that the margin is constant from batch to batch. In the online advertising setting, this would mean whether one ad is always better than another is stable, but the overall effectiveness of both ads may change over time. In this case, we can simply use the BOLS test statistic described earlier in equation (4) to test H0 : Δ = 0 vs. H1 : Δ ≠ 0. Note that the BOLS test statistic for the margin is robust to non-stationarity in the baseline reward without any adjustment. Moreover, in our simulation settings we estimate the variance σ2 separately for each batch, which allows for non-stationarity in the variance between batches as well; see Appendix A for variance estimation details and see Section 6 for simulation results. Additionally, in the case that we believe that the margin itself may vary from batch to batch, the BOLS test statistic can also be used to construct confidence regions that contain the true margin Δt for each batch simultaneously; see Appendix A.5 for details.

6. Simulation Experiments

Procedure

We focus on the two-arm bandit setting and test whether the margin is zero, specifically H0 : Δ = 0 vs. H1 : Δ ≠ 0. We perform experiments for when the noise variance σ2 is estimated. We assume homoscedastic errors throughout. See Appendix A.4 for more details about how we estimate the noise variance and more details regarding our experimental setup. In Figures 6 and 7, we display results for stationary bandits and in Figure 5 we show results for bandits with non-stationary baseline rewards. See Appendix A.5 for results for bandits with non-stationary margins.

Figure 5:

Figure 5:

Non-stationary baseline reward setting: Type-1 error (upper left) and power (upper right) for a two-sided test of H0 : Δ = 0 vs. H1 : Δ ≠ 0 (α = 0.05). In the lower two plots we plot the expected rewards for each arm; note the margin is constant across batches. We use n = 25 and a clipping constraint of 0.1πt(n)0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002.

In our simulations, we found that OLS and AW-AIPW have inflated Type-1 error. Since Type-1 control is a hard constraint, solutions with inflated Type-1 error are infeasible solutions. In the power plots, we adjust the cutoffs of the estimators to ensure proper Type-1 error control; if an estimator has inflated Type-1 error under the null, in the power simulations we use a critical value estimated using the simulations under the null. Note that it is unfeasible to make these cutoff adjustment for real experiments (unless one found the worst case setting), as there are many nuisance parameters—like the expected rewards for each arm and the noise variance—which can affect cutoff values.

Results

Figure 3 shows that for small sample sizes (nT ≲ 300), BOLS has more reliable Type-1 error control than AW-AIPW with variance stabilizing weights. After nT ≥ 500 samples, AW-AIPW has proper Type-1 error, and by Figure 4 it always has slightly greater power than BOLS in the stationary setting. The W-decorrelated estimator has reliable Type-1 error control, but very low power compared to AW-AIPW and BOLS. Finally the high probability, self-normalized martingale bound of [1], which we use for hypothesis testing, has very low power compared to the asymptotic approximation statistical inference methods.

Figure 3:

Figure 3:

Stationary Setting: Type-1 error for a two-sided test of H0 : Δ = 0 vs. H1 : Δ ≠ 0 (α = 0.05). We set β1 = β0 = 0, n = 25, and a clipping constraint of 0.1πt(n)0.9. We use 100k Monte Carlo simulations and standard errors are < 0.001.

Figure 4:

Figure 4:

Stationary Setting: Power for a two-sided test of H0 : Δ = 0 vs. H1 : Δ ≠ 0 (α = 0.05). We set β1 = 0, β0 = 0.25, n = 25, and a clipping constraint of 0.1πt(n)0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002. We account for Type-1 error inflation as described in Section 6.

In Figure 5, we display simulation results for the non-stationary baseline reward setting. Whereas other estimators have no Type-1 error guarantees, BOLS still has proper Type-1 error control in the non-stationary baseline reward setting. Moreover, BOLS can have much greater power than other estimators when there is non-stationarity in the baseline reward. Overall, BOLS is favorable over other estimators in small-sample settings or when one wants to be robust to non-stationarity in the baseline reward—at the cost of losing a little power if the environment is stationary.

7. Discussion

We found that the OLS estimator is asymptotically non-normal when the margin is zero due to the non-concentration of the action selection probabilities. Since the OLS estimator is a canonical example of a method-of-moments estimator [13], our results suggest that the inferential guarantees of standard method-of-moments estimators may fail to hold on adaptively collected data when there is no unique optimal, regret-minimizing policy. We develop the Batched OLS estimator, which is asymptotically normal even when the action selection probabilities do not concentrate. An open question is whether batched versions of general method-of-moments estimators could similarly be used for adaptive inference.

Broader Impact.

Our work has the positive impact of encouraging the use of valid statistical inference methods on bandit data, which ultimately leads to more reliable scientific conclusions. In addition, by providing a valid statistical inference method on bandit data, our work facilitates the use of bandit algorithms in experimentation.

Acknowledgments and Disclosure of Funding

Research reported in this paper was supported by National Institute on Alcohol Abuse and Alcoholism (NIAAA) of the National Institutes of Health under award number R01AA23187, National Institute on Drug Abuse (NIDA) of the National Institutes of Health under award number P50DA039838, and National Cancer Institute (NCI) of the National Institutes of Health under award number U01CA229437. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

A. Simulation Details

A.1. W-Decorrelated Estimator

For the W-decorrelated estimator [6], for a batch size of n and for T batches, we set λ to be the 1nT quantile of λmin(X_X_)/log(nT), where λmin(X_X_) denotes the minimum eigenvalue of X_X_. This procedure of choosing λ is motivated by the conditions of Theorem 4 of [6] and follows the methods used by [6] in their simulation experiments. We had to adjust the original procedure for choosing λ used by [6] (who set λ to the 0.15 quantile of λmin(X_X_)), because they only evaluated the W-decorrelated method for when the total number of samples was nT = 1000 and valid values of λ changes with the sample size.

A.2. AW-AIPW Estimator

Since the AW-AIPW test statistic for the treatment effect is not explicitly written in the original paper [11], we now write the formulas for the AW-AIPW estimator of the treatment effect: Δ^AW-AIPWβ^1AW-AIPWβ^0AW-AIPW. We use the variance stabilizing weights, equal to the square root of the sampling probabilities, πt(n) and 1πt(n). Below, Nt,1=i=1nAt,i and Nt,0=i=1n(1At,i).

Yt,1At,iπt(n)Rt,i+(1At,i(n)πt(n))t=1t1i=1nAt,i(n)Rt,it=1t1Nt,1
Yt,01At,i(n)1πt(n)Rt,i+(11At,i1πt(n))t=1t1i=1n(1At,i)Rt,it=1t1Nt,0
β^1AW-AIPWt=1Ti=1nπt(n)Yt,1t=1Ti=1nπt(n)      and      β^0AW-AIPWt=1Ti=1n1πt(n)Yt,0t=1Ti=1n1πt(n)

The variance estimator for Δ^AW-AIPW is V^0+V^1+2C^0,1 where

V^1t=1Ti=1nπt(n)(Yt,1β^1AW-AIPW)2(t=1Ti=1nπt(n))2     and     V^0t=1Ti=1n(1πt(n))(Yt,0β^0AW-AIPW)2(t=1Ti=1n1πt(n))2
C^0,1t=1Ti=1nπt(n)(1πt(n))(Yt,1β^1AW-AIPW)(Yt,0β^0AW-AIPW)(t=1Ti=1nπt(n))(t=1Ti=1n1πt(n))

A.3. Self-Normalized Martingale Bound

By the self-normalized martingale bound of [1], specifically Theorem 1 and Lemma 6, we have that in the two arm bandit setting,

(T,n1,|β^1OLSβ1|c1,T and |β^0OLSβ0|c0,T)1δ

where

ca,T=σ21+t=1TNt,a(t=1TNt,a)2(1+2log(21+t=1TNt,aδ))

We estimate σ2 using the procedure stated below for the OLS estimator. We reject the null hypothesis that Δ = 0 whenever either the confidence bounds for the two arms are non-overlapping. Specifically when

β^1OLS+c1,Tβ^0OLSc0,T      or      β^0OLS+c0,Tβ^1OLSc1,T

A.4. Estimating Noise Variance

OLS Estimator

Given the OLS estimators for the means of each arm, β^1OLS, β^0OLS, we estimate the noise variance σ2 as follows:

σ^21nT2t=1Ti=1n(Rt,iAt,iβ^1OLS(1At,i)β^0OLS)2.

We use a degrees of freedom bias correction by normalizing by nT − 2 rather than nT. Since the W-decorrelated estimator is a modified version of the OLS estimator, we also use this same noise variance estimator for the W-decorrelated estimator; we found that this worked well in practice, in terms of Type-1 error control.

Batched OLS

Given the Batched OLS estimators for the means of each arm for each batch, β^t,1BOLS, β^t,0BOLS, we estimate the noise variance for each batch σt2 as follows:

σ^t21n2i=1n(Rt,iAt,iβ^t,1BOLS(1At,i)β^t,0BOLS)2.

Again, we use a degrees of freedom bias correction by normalizing by n − 2 rather than n. We prove the consistency of σ^t2 (meaning σ^t2Pσ2) in Corollary 4. Using BOLS to test H0 : Δ = a vs. H1 : Δ ≠ a, we use the following test statistic:

1Tt=1TNt,0Nt,1nσ^t2(Δ^tBOLSa).

Above, Nt,1=i=1nAt,i and Nt,0=i=1n(1At,i). For this test statistic, we use cutoffs based on the Student-t distribution, i.e., for Yt~i.i.d.tn2 we use a cutoff cα/2 such that

(|1Tt=1TYt|>cα/2)=α.

We found cα/2 by simulating draws from the Student-t distribution.

A.5. Non-Stationary Treatment Effect

When we believe that the margin itself varies from batch to batch, we are able to construct a confidence region that contains the true margin Δt for each batch simultaneously with probability 1 − α.

Corollary 2 (Confidence band for margin for non-stationary bandits). Assume the same conditions as Theorem 3. Let zx be xth quantile of the standard Normal distribution, i.e., for Z~N(0,1), (Z<zα)=α. For each t ∈ [1: T], we define the interval

Lt=Δ^tOLS±z1α2Tσ2nNt,0Nt,1.

limn(t[1:T],ΔtLt)1α. Above, Nt,1=i=1nAt,i and Nt,0=i=1n(1At,i).

Proof: Note that by Corollary 3,

(exists some t[1:T] s.t. ΔtLt)t=1T(ΔtLt)t=1TαT=α

where the limit is as n → ∞. Since

(t[1:T],ΔtLt)=1( exists some t[1:T] s.t. ΔtLt)

Thus,

limn(t[1:T],ΔtLt)1α

We can also test the null hypothesis of no margin against the alternative that at least one batch has non-zero margin, i.e., H0 : ∀t ∈ [1: T], Δt = 0 vs. H1 : ∃t ∈ [1: T] s.t. Δt ≠ 0. Note that the global null stated above is of great interest in the mobile health literature [?, ?]. Specifically we use the following test statistic:

t=1TNt,0Nt,1σ2n(Δ^tOLS0)2,

which by Theorem 3 converges in distribution to a chi-squared distribution with T degrees of freedom under the null Δt = 0 for all t.

To account for estimating noise variance σ2, in our simulations for this test statistic, we use cutoffs based on the Student-t distribution, i.e., for Yt~i.i.d.tn2 we use a cutoff cα/2 such that

(1Tt=1TYt2>cα)=α.

We found cα by simulating draws from the Student-t distribution.

In the plots below we call the test statistic in (A.5) “BOLS Non-Stationary Treatment Effect” (BOLS NSTE). BOLS NSTE performs poorly in terms of power compared to other test statistics in the stationary setting; however, in the non-stationary setting, BOLS NSTE significantly outperforms all other test statistics, which tend to have low power when the average treatment effect is close to zero. Note that the W-decorrelated estimator performs well in the left plot of Figure 8; this is because as we show in Appendix F, the W-decorrelated estimator upweights samples from the earlier batches in the study. So when the treatment effect is large in the beginning of the study, the W-decorrelated estimator has high power and when the treatment effect is small or zero in the beginning of the study, the W-decorrelated estimator has low power.

Figure 6:

Figure 6:

Stationary Setting: Type-1 error for a two-sided test of H0 : Δ = 0 vs. H1 : Δ ≠ 0 (α = 0.05). We set β1 = β0 = 0, n = 25, and a clipping constraint of 0.1πt(n)0.9. We use 100k Monte Carlo simulations and standard errors are < 0.001.

Figure 7:

Figure 7:

Stationary Setting: Power for a two-sided test of H0 : Δ = 0 vs. H1 : Δ ≠ 0 (α = 0.05). We set β1 = 0, β0 = 0.25, n = 25, and a clipping constraint of 0.1πt(n)0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002. We account for Type-1 error inflation as described in Section 6.

Figure 8:

Figure 8:

Nonstationary setting: The two upper plots display the power of estimators for a two-sided test of H0 : ∀t ∈ [1: T], βt,1βt,0 = 0 vs. H1 : ∃t ∈ [1: T], βt,1βt,0 ≠ 0 (α = 0.05). The two lower plots display two treatment effect trends; the left plot considers a decreasing trend (quadratic function) and the right plot considers a oscillating trend (sin function). We set n = 25, and a clipping constraint of 0.1πt(n)0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002.

B. Asymptotic Normality of the OLS Estimator

Condition 6 (Weak moments). t,n,i,E[ϵt,i2Gt1(n)]=σ2 and for all t,n,i,E[φ(ϵt,i2)Gt1(n)]<M< a.s. for some function φ where limxφ(x)x.

Condition 7 (Stability). There exists a sequence of nonrandom positive-definite symmetric matrices, V_n, such that

  1. V_n1(t=1Ti=1nXt,iXt,i)12=V_n1(X_X_)12PI_p

  2. maxi[1:n],t[1:T]V_n1Xt,i2P0

Theorem 5 (Triangular array version of Lai & Wei (1982), Theorem 3). Let Xt,ip be non-anticipating with respect to filtration {Gt(n)}t=1T, so Xt,i is Gt1(n) measurable. We assume the following conditional mean model for rewards:

E[Rt,iGt1(n)]=Xt,iβ.

We define ϵt,iRt,iXt,iβ. Note that {ϵt,i}i=1,t=1i=n,t=T is a martingale difference array with respect to filtration {Gt(n)}t=1T.

Assuming Conditions 6 and 7, as n → ∞,

(X_X_)1/2(β^OLSβ)DN(0,σ2I_p)

Note, in the body of the paper we state that this theorem holds in the two-arm bandit case assuming Conditions 2 and 1. Note that Condition 1 is sufficient for Condition 6 and Condition 2 is sufficient for Condition 7 in the two-arm bandit case.

Proof:

β^OLS=((X_X_)1X_(n),R(n)=(X_X_)1X_(X_β+ϵ)
β^OLSβ=(X_X_)1X_ϵ=(t=1Ti=1nXt,iXt,i)1t=1Ti=1nXt,iϵt,i

It is sufficient to show that as n → ∞:

(X_X_)1/2t=1Ti=1nXt,iϵt,iDN(0,σ2I_p)

By Slutsky’s Theorem and Condition 7 (a), it is also sufficient to show that as n → ∞,

V_n1t=1Ti=1nXt,iϵt,iDN(0,σ2I_p)

By Cramer-Wold device, it is sufficient to show multivariate normality if for any fixed cp s.t. ∥c2 = 1, as n → ∞,

cV_n1t=1Ti=1nXt,iϵt,iDN(0,σ2)

We will prove this central limit theorem by using a triangular array martingale central limit theorem, specifically Theorem 2.2 of [8]. We will do this by letting Yt,i=cTV_n1Xt,iϵt,i. The theorem states that as n → ∞, t=1Ti=1nYt,iDN(0,σ2) if the following conditions hold as n → ∞:

  1. t=1Ti=1nE[Yt,iGt1(n)]P0

  2. t=1Ti=1nE[Yt,i2Gt1(n)]Pσ2

  3. δ>0,t=1Ti=1nE[Yt,i2I(|Yt,i|>δ)Gt1(n)]P0

Useful Properties

Note that by Cauchy-Schwartz and Condition 7 (b), as n → ∞,

maxi[1:n],t[1:T]|cV_n1Xt,i|maxi[1:n],t[1:T]c2V_n1Xt,i2P0

By continuous mapping theorem and since the square function on non-negative inputs is order preserving, as n → ∞,

(maxi[1:n],t[1:T]|cV_n1Xt,i|)2=maxi[1:n],t[1:T](cV_n1Xt,i)2P0 (5)

By Condition 7 (a) and continuous mapping theorem, cV_n1(Xt,iXt,i)1/2Pc, so

cV_n1(Xt,iXt,i)1/2(Xt,iXt,i)1/2V_n1cPcc=1

Thus,

cV_n1(t=1Ti=1nXt,iXt,i)V_n1c=t=1Ti=1ncV_n1Xt,iXt,iV_n1cP1

Since cV_n1Xt,i is a scalar, as n → ∞,

t=1Ti=1n(cV_n1Xt,i)2P1 (6)

Condition (a): Martingale

t=1Ti=1nE[cV_n1Xt,iϵt,iGt1(n)]=t=1Ti=1ncV_n1Xt,iE[ϵt,iGt1(n)]=0

Condition (b): Conditional Variance

t=1Ti=1nE[(cV_n1Xt,i)2ϵt,i2Gt1(n)]=t=1Ti=1n(cV_n1Xt,i)2E[ϵt,i2Gt1(n)]=σ2t=1Ti=1n(cV_n1Xt,i)2Pσ2

where the last equality holds by Condition 6 and the limit holds by (6) as n → ∞.

Condition (c): Lindeberg Condition Let δ > 0. We want to show that as n → ∞,

t=1Ti=1nZt,i2E[ϵt,i2I(Zt,i2ϵt,i2>δ2)Gt1(n)]P0

where above, we define Zt,i(n)cV_n1Xt,i. By Condition 6, we have that for all n ≥ 1,

maxt[1:T],i[1:n]E[φ(ϵt,i2)Gt1(n)]<M

Since we assume that limxφ(x)x=, for all m ≥ 1, there exists a bm s.t. φ(x) ≥ mMx for all xbm. So, for all n, t, i,

ME[φ(ϵt,i2)Gt1(n)]E[φ(ϵt,i2)I(ϵt,i2bm)Gt1(n)]mME[ϵt,i2I(ϵt,i2bm)Gt1(n)]

Thus,

maxt[1:T],i[1:n]E[ϵt,i2I(ϵt,i2bm)Gt1(n)]1m

So we have that

t=1Ti=1nZt,i2E[ϵt,i2I(Zt,i2ϵt,i2>δ2)Gt1(n)]=t=1Ti=1nZt,i2(E[ϵt,i2I(Zt,i2ϵt,i2>δ2)Gt1(n)]I(Zt,i2δ2/bm)+E[ϵt,i2I(Zt,i2ϵt,i2>δ2)Gt1(n)]I(Zt,i2>δ2/bm))t=1Ti=1nZt,i2(E[ϵt,i2I(ϵt,i2>bm)Gt1(n)]+σ2I(Zt,i2>δ2/bm))(1m+σ2I(maxt[1:T],j[1:n]Zt,j2>δ2/bm))t=1Ti=1nZt,i2

By Slutsky’s Theorem and (6), it is sufficient to show that as n → ∞,

1m+σ2I(maxt[1:T],j[1:n]Zt,j2>δ2/bm)P0

For any ϵ > 0,

(1m+σ2I(maxt[1:T],j[1:n]Zt,j2>δ2/bm)>ϵ)I(1m>ϵ2)+(σ2I(maxt[1:T],j[1:n]Zt,j2>δ2/bm)>ϵ2)

We can choose m such that 1mϵ2, so (1m>ϵ2)=0. For the second term (note that m is now fixed),

(σ2I(maxt[1:T],j[1:n]Zt,j2>δ2/bm)>ϵ2)(maxt[1:T],j[1:n]Zt,j2>δ2/bm)0

where the last limit holds by (5) as n → ∞. □

B.1. Corollary 1 (Sufficient conditions for Theorem 5)

Under Conditions 1 and 3, when the treatment effect is non-zero data collected in batches using ϵ-greedy, Thompson Sampling, or UCB with a fixed clipping constraint (see Definition 1) will satisfy Theorem 5 conditions.

Proof: The only condition of Theorem 5 that needs to verified is Condition 2. To satisfy Condition 2, it is sufficient to show that for any given Δ, for some constant c ∈ (0, T),

1nt=1Ti=1nAt,i=1nt=1TNt,1Pc.

ϵ-greedy

We assume without loss of generality that Δ > 0 and π1(n)=12. Recall that for ϵ-greedy, for a ∈ [2: T],

πa(n)={1ϵ2 if t1ai=1nAt,iRt,it=1nNt,1>t=1ai=1n(1At,i)Rt,it=1nNt,0ϵ2 otherwise 

Thus to show that πa(n)P1ϵ2 for all a ∈ [2: T], it is sufficient to show that

(t=1ai=1nAt,iRt,it=1aNt,1>t=1ai=1n(1At,i)Rt,it=1aNt,0)1 (7)

To show (7), it is equivalent to show that

(Δ>t=1ai=1n(1At,i)ϵt,it=1aNt,0t=1ai=1nAt,iϵt,it=1aNt,1)1 (8)

To show (8), it is sufficient to show that

t=1ai=1n(1At,i)ϵt,it=1aNt,0t=1ai=1nAt,iϵt,it=1aNt,1P0. (9)

To show (9), it is equivalent to show that

t=1aNt,0t=1aNt,0i=1n(1At,i)ϵt,iNt,0t=1aNt,1t=1aNt,1i=1nAt,iϵt,iNt,1P0. (10)

By Lemma 1, for all t ∈ [1: T],

Nt,1πt(n)nP1

Thus by Slutsky’s Theorem, to show (10), it is sufficient to show that

t=1an(1πt(n))nt=1a(1πt(n))i=1n(1At,i)ϵt,iNt,0t=1anπt(n)nt=1aπt(n)i=1nAt,iϵt,iNt,1P0. (11)

Since πt(n)[ϵ2,1ϵ2] for all t, n, the left hand side of (11) equals the following:

t=1aop(1)i=1n(1At,i)ϵt,iNt,0t=1aop(1)i=1nAt,iϵt,iNt,1P0.

The above limit holds because by Thereom 3, we have that

(i=1nA1,iϵ1,iN1,1,i=1n(1A1,i)ϵ1,iN1,0,,i=1nAT,iϵT,iNT,1,i=1n(1AT,i)ϵT,iNT,0)DN(0,σ2I_2T). (12)

Thus, by Slutsky’s Theorem and Lemma 1, we have that

1nt=1TNt,1P12+(T1)(1ϵ2)     and     1nt=1TNt,0P12+(T1)ϵ2

Thompson Sampling

We assume without loss of generality that Δ > 0 and π1(n)=12. Recall that for Thompson Sampling with independent standard normal priors (β˜1,β˜0~i.i.d.N(0,1)) for a ∈ [2: T],

πa(n)=πmin[πmax(β˜1>β˜0Ha1(n))]

Given the independent standard normal priors on β˜1, β˜0 we have the following posterior distribution:

β˜1β˜0Ha1(n)~N(t=1a1i=1nAt,iRt,iσ2+t=1a1Nt,1t=1a1i=1n(1At,i)Rt,iσ2+t=1a1Nt,0,σ2(σ2+t=1a1Nt,1)+σ2(σ2+t=1a1Nt,0)(σ2+t=1a1Nt,0)(σ2+t=1a1Nt,1))N(μa1(n),(σa1(n))2)

Thus to show that πa(n)Pπmax for all a ∈ [2: T], it is sufficient to show that μa1(n)PΔ and (σa1(n))2P0 for all a ∈ [2: T]. By Lemma 1, for all t ∈ [1: T],

Nt,1πt(n)nP1

Thus, to show (σa1(n))2P0, it is sufficient to show that

σ2(σ2+nt=1a1πt(n))+σ2(σ2+nt=1a1(1πt(n)))(σ2+nt=1a1(1πt(n)))(σ2+nt=1a1πt(n))P0

The above limit holds because πt(n)[πmin,πmax] for 0 < πminπmax < 1 by the clipping condition.

We now show that μa1(n)PΔ, which is equivalent to showing that the following converges in probability to Δ

t=1a1i=1nAt,iRt,iσ2+t=1a1Nt,1t=1a1i=1n(1At,i)Rt,iσ2+t=1a1Nt,0=t=1a1Nt,1σ2+t=1a1Nt,1t=1a1i=1nAt,iRt,it=1a1Nt,1t=1a1Nt,0σ2+t=1a1Nt,0t=1a1i=1n(1At,i)Rt,it=1a1Nt,0=t=1a1Nt,1σ2+t=1a1Nt,1(β1+t=1a1i=1nAt,iϵt,it=1a1Nt,1)t=1a1Nt,0σ2+t=1a1Nt,0(β0+t=1a1i=1n(1At,i)ϵt,it=1a1Nt,0) (13)

Note that

t=1a1Nt,1σ2+t=1a1Nt,1β1t=1a1Nt,0σ2+t=1a1Nt,0β0PΔ (14)

Equation (14) above holds by Lemma 1, because

nt=1a1πt(n)σ2+nt=1a1πt(n)1P                                        nt=1a1(1πt(n))σ2+nt=1a1(1πt(n))P1 (15)

which hold because πt(n)[πmin,πmax] due to our clipping condition.

By Slutsky’s Theorem and (14), to show (13), it is sufficient to show that

t=1a1i=1nAt,iϵt,iσ2+t=1a1Nt,1t=1a1i=1n(1At,i)ϵt,iσ2+t=1a1Nt,0P0. (16)

Equation (16) is equivalent to the following:

t=1a1Nt,1σ2+t=1a1Nt,1i=1nAt,iϵt,iNt,1t=1a1Nt,0σ2+t=1a1Nt,0i=1n(1At,i)ϵt,iNt,0P0 (17)

By Lemma 1, to show (17) it is sufficient to show that

t=1a1nπt(n)σ2+nt=1a1πt(n)i=1nAt,iϵt,iNt,1t=1a1n(1πt(n))σ2+nt=1a1(1πt(n))i=1n(1At,i)ϵt,iNt,0P0 (18)

Since πt(n)[πmin,πmax] due to our clipping condition, the left hand side of (18) equals the following

t=1a1op(1)i=1nAt,iϵt,iNt,1t=1a1op(1)i=1n(1At,i)ϵt,iNt,0P0

The above limit holds by (12).

Thus, by Slutsky’s Theorem and Lemma 1, we have that

1nt=1TNt,1P12+(T1)πmax     and     1nt=1TNt,0P12+(T1)πmin

UCB

We assume without loss of generality that Δ > 0 and π1(n)=12. Recall that for UCB, for a ∈ [2: T],

πa(n)={πmax if Ua1,1>Ua1,01πmax otherwise 

where we define the upper confidence bounds U for any confidence level δ with 0 < δ < 1 as follows:

Ua1,1={ if t=1a1Nt,1=0t=1a1i=1nAt,iRt,it=1a1Nt,1+2log1/δt=1a1Nt,1 otherwise 
Ua1,0={ if N1,0=0t=1a1i=1n(1At,i)Rt,it=1a1Nt,0+2log1/δt=1a1Nt,0 otherwise 

Thus to show that πa(n)Pπmax for all a ∈ [2: T], it is sufficient to show that I(Ua,1>Ua,0)P1, which is equivalent to showing that the following converges in probability to 1:

I(t=1aNt,1>0,t=1aNt,0>0)I(t=1ai=1nAt,iRt,it=1nNt,1+2log1/δt=1αNt,1>t=1ai=1n(1At,i)Rt,it=1αNt,1+2log1/δt=1αNt,0)+I(t=1aNt,1=0,t=1aNt,0>0)=I((β1β0)+t=1ai=1nAt,iϵt,iΣt=1aNt,1+2log1/δt=1aNt,1>t=1ai=1n(1At,i)ϵt,it=1nNt,1+2log1/δt=1aNt,0)+op(1)

Note that to show that the above converges in probability to 1, it is sufficient to show that the following:

t=1ai=1n(1At,i)ϵt,it=1aNt,1+2log1/δt=1aNt,0t=1ai=1nAt,iϵt,it=1aNt,12log1/δt=1aNt,1P0

Note that for fixed δ, we have that 2log1/δt=1aNt,0P0, since Nt,0n/2P1. Also note that t=1ai=1n(1At,i)ϵt,it=1aNt,1t=1ai=1nAt,iϵt,it=1aNt,1P0, by the same argument made in the ϵ-greedy case to show (9).

Thus, by Slutsky’s Theorem and Lemma 1, we have that

1nt=1TNt,1P12+(T1)πmax     and     1nt=1TNt,0P12+(T1)(1πmax)

C. Non-uniform convergence of the OLS Estimator

Definition 3 (Non-concentration of a sequence of random variables). For a sequence of random variables {Yi}i=1n on probability space (Ω, F, ), we say Yn does not concentrate if for each a there exists an ϵa > 0 with

P({ωΩ:|Yn(ω)a|>ϵa})0.

C.1. Thompson Sampling

Proposition 1 (Non-concentration of sampling probabilities under Thompson Sampling). Under the assumptions of Theorem 2, the posterior distribution that arm 1 is better than arm 0 converges as follows:

(β˜1>β˜0H1(n))D{1 if Δ>00 if Δ<0 Uniform [0,1] if Δ=0

Thus, the sampling probabilities πt(n) do not concentrate when Δ = 0.

Proof: Below, Nt,1=i=1nAt,i and Nt,0=i=1n(1At,i). Posterior means:

β˜0H1(n)~N(i=1n(1A1,i)R1,iσa2+N1,0,σ2σa2+N0,1)
β˜1H1(n)~N(i=1nA1,iR1,iσa2+N1,1,σa2σa2+N1,1)
β˜1β˜0H1(n)~N(μn,σn2)

for μni=1nA1,iR1,iσa2+N1,1i=1n(1A1,i)R1,iσa2+N1,0 and σn2σa2(σa2+N1,1)+σa2(σa2+N1,0)(σa2+N1,0)(σa2+N1,1).

P(β˜1>β˜0H1(n))=P(β˜1β˜0>0H1(n))=P(β˜1β˜0μnσn>μnσnH1(n))

For Z~N(0,1) independent of μn, σn.

=P(Z>μnσnH1(n))=P(Z<μnσnH1(n))=Φ(μnσnH1(n))
μnσn=(i=1nA1,iR1,iσa2+N1,1i=1n(1A1,i)R1,iσa2+N1,0)(σa2+N1,0)(σa2+N1,1)2σa4+σa2n=(β1N1,1+i=1nA1,iϵ1,iσa2+N1,1β0N1,0+i=1n(1A1,i)ϵ1,iσa2+N1,0)(σa2+N1,0)(σa2+N1,1)2σa4+σa2n=i=1nA1,iϵ1,iN1,1N1,1(σa2+N1,0)(2σa4+σa2n)(σa2+N1,1)i=1n(1A1,i)ϵ1,iN1,0N1,0(σa2+N1,1)(2σa4+σa2n)(σa2+N1,0)+(β1N1,1σa2+N1,1β0N1,0σa2+N1,0)(σa2+N1,0)(σa2+N1,1)2σa4+σa2nBn+Cn

Let’s first examine Cn. Note that β1 = β0 + Δ, β1N1,1σa2+N1,1β0N1,0σa2+N1,0 equals

=(β0+Δ)N1,1σa2+N1,1β0N1,0σa2+N1,0=ΔN1,1σa2+N1,1+β0(N1,1σa2+N1,1N1,0σa2+N1,0)
=ΔN1,1/n(σa2+N1,1)/n+β0(N1,1(σa2+N1,0)N1,0(σa2+N1,1)(σa2+N1,1)(σa2+N1,1))
=Δ12+o(1)12+o(1)+β0σa2(N1,1N1,0(σa2+N1,1)(σa2+N1,1))=Δ[1+o(1)]+o(1n)

where the last equality holds by the Strong Law of Large Numbers because

1n2(N1,1N1,0)1n2(σa2+N1,1)(σa2+N1,1)=1n[1212+o(1)][12+o(1)][12+o(1)]=1no(1)14+o(1)=o(1n)

Thus,

Cn=[Δ[1+o(1)]+o(1n)](σa2+N1,0)(σa2+N1,1)2σa4+σa2n=[Δ[1+o(1)]+o(1n)]n[12+o(1)][12+o(1)]o(1)+σa2=nΔ[1/(2σa)+o(1)]+o(1n)

Let’s now examine Bn.

N1,1(σa2+N1,0)(2σa4+σa2n)(σa2+N1,1)=[12+o(1)][12+o(1)][σa2+o(1)][12+o(1)]=12σa2+o(1)
N1,0(σa2+N1,1)(2σa4+σa2n)(σa2+N1,0)=[12+o(1)][12+o(1)][σa2+o(1)][12+o(1)]=12σa2+o(1)

Note that by Theorem 3, [1N1,1i=1nϵ1,iA1,i,1N1,0i=1nϵ1,i(1A1,i)]DN(0,I_2). Thus by Slutky’s Theorem,

[i=1nA1,iϵ1,iN1,1N1,1(σa2+N1,0)(2σa4+σa2n)(σa2+N1,1)i=1n(1A1,i)ϵ1,iN1,0N1,0(σa2+N1,1)(2σa4+σa2n)(σa2+N1,0)]=[i=1nA1,iϵ1,iN1,1[12σa2+o(1)]i=1n(1A1,i)ϵ1,iN1,0[12σa2+o(1)]]DN(0,12σa2I_2)

Thus, we have that, BnDN(0,1σa2). Since we assume that the algorithm’s variance is correctly specified, so σa2=1,

Bn+CnD{ if Δ>0 if Δ<0N(0,1) if Δ=0

Thus, by continuous mapping theorem,

(β˜1>β˜0H1(n))=Φ(μnσn)=Φ(Bn+Cn)D{1 if Δ>00 if Δ<0 Uniform [0,1] if Δ=0

Proof of Theorem 2 (Non-uniform convergence of the OLS estimator of the treatment effect for Thompson Sampling): The normalized errors of the OLS estimator for Δ, which are asymptotically normal under i.i.d. sampling are as follows:

(N1,1+N2,1)(N1,0+N2,0)2n(β^1OLSβ^0OLSΔ)=(N1,1+N2,1)(N1,0+N2,0)2n(t=12i=1nAt,iRt,iN1,1+N2,1t=12i=1n(1At,i)Rt,iN1,0+N2,0Δ)=(N1,1+N2,1)(N1,0+N2,0)2n((β1β0)Δ+t=12i=1nAt,iϵt,iN1,1+N2,1t=12i=1n(1At,i)ϵt,iN1,0+N2,0)=N1,0+N2,02nt=12i=1nAt,iϵt,iN1,1+N2,0N1,1+N2,12nt=12i=1n(1At,i)ϵt,iN1,0+N2,0=[1,1,1,1][N1,0+N2,02ni=1nA1,iϵ1,iN1,1+N2,1N1,1+N2,12ni=1n(1A1,t)ϵ1,iN1,0+N2,0N1,0+N2,02ni=1nA2,iϵ2,iN1,1+N2,1N1,1+N2,12ni=1n(1A2,1)ϵ2,iN1,0+N2,0]=[1,1,1,1][N1,0+N2,02(N1,1+N2,1)N1,1n1=1nA1,iϵ1,iN1,1N1,1+N2,12(N1,0+N2,0)N1,0nt=1n(1A1,i)ϵ1,iN1,0N1,0+N2,02(N1,1+N2,1)N2,1ni=1nA2,iϵ2,iN2,1N1,1+N2,12(N1,0+N2,0)N2,0nt=1n(1A2,i)ϵ2,iN2,0] (19)

By Theorem 3, (i=1nA1,iϵ1,iN1,1,i=1n(1A1,i)ϵ1,iN1,0,i=1nA2,iϵ2,iN2,1,i=1n(1A2,i)ϵ2,iN2,0)DN(0,I_4). By Lemma 1 and Slutsky’s Theorem, 2n(N1,1+N2,1)N1,1(N1,0+N2,0)12(12+[1π2])2(12+π2)=1+op(1), thus,

N1,0+N2,02(N1,1+N2,1)N1,1ni=1nA1,iϵ1,iN1,1=(2n(N1,1+N2,1)N1,1(N1,0+N2,0)12(12+[1π2])2(12+π2)+op(1))N1,0+N2,02(N1,1+N2,1)N1,1ni=1nA1,iϵ1,iN1,1=12(12+[1π2])2(12+π2)i=1nA1,iϵ1,iN1,1+op(1)N1,0+N2,02(N1,1+N2,1)N1,1ni=1nA1,iϵ1,iN1,1

Note that N1,0+N2,02(N1,1+N2,1) is stochastically bounded because for any K > 2,

(N1,0+N2,02(N1,1+N2,1)>K)(nN1,1>K)=(1K>N1,1n)0

where the limit holds by the law of large numbers since N1,1(n)~Binomial(n,12). Thus, since N1,1n1 and i=1nA1,iϵ1,iN1,1DN(0,1),

op(1)N1,0+N2,02(N1,1+N2,1)N1,1ni=1nA1,iϵ1,iN1,1=op(1)

We can perform the above procedure on the other three terms. Thus, equation (19) is equal to the following:

[1,1,1,1][1/2+1π24(1/2+π2)i=1nA1,tϵ1,iN1,11/2+π24(1/2+1π2)i=1n(1A1,i)ϵ1,iN1,0(1/2+1π2)π22(1/2+π2)i=1nA2,iϵ2,iN2,1(1/2+π2)(1π2)2(1/2+1π2)i=1n(1A2,i)ϵ2,iN2,0]+op(1)

Recall that we showed earlier in Proposition 1 that

π2(n)=πmin[πmaxΦ(μnσn)]=πmin[πmaxΦ(Bn+Cn)]=πmin[πmaxΦ(i=1nA1,iϵ1,i2N1,1i=1n(1A1,i)ϵ1,i2N1,0+nΔ[12+o(1)]+o(1))]

When Δ > 0, π2(n)Pπmax and when Δ < 0, π2(n)Pπmin. We now consider the Δ = 0 case.

π2(n)=πmin[πmaxΦ(12[i=1nA1,iϵ1,iN1,1i=1n(1A1,i)ϵ1,iN1,0]+o(1))]=πmin[πmaxΦ(12[i=1nA1,iϵ1,iN1,1i=1n(1A1,i)ϵ1,iN1,0])]+o(1)

By Slutsky’s Theorem, for Z1, Z2, Z3, Z4~i.i.d.N(0,1),

[1,1,1,1][1/2+1π24(1/2+π2)i=1nA1,iϵ1,iN1,11/2+π24(1/2+1π2)l=1n(1A1,i)ϵ1,iN1,0(1/2+1π2)π22(1/2+π2)i=1nA2,tϵ2,tN2,1(1/2+π2)(1π2)2(1/2+1π2)i=1n(1A2,i)ϵ2,iN2,0]+op(1)D[1,1,1,1][1/2+1π*4(1/2+π*)Z11/2+π*4(1/2+1π*)Z2(1/2+1πs)π*2(1/2+π*)Z3(1/2+π*)(1π*)2(1/2+1π*)Z4]=1/2+1π*2(1/2+π*)(1/2Z1+π*Z3)1/2+π*2(1/2+1π*)(1/2Z2+1π*Z4)

where π*={πmax if Δ>0πmin if Δ<0πmin(πmaxΦ[1/2(Z1Z2)]) if Δ=0

C.2. ϵ-Greedy

Proposition 2 (Non-concentration of the sampling probabilities under zero treatment effect for ϵ-greedy). Let T = 2 and π1(n)=12 for all n. We assume that {ϵt,i}i=1n~i.i.d.N(0,1), and

π2(n)={1ϵ2 if i=1nA1,iR1,iN1,1>i=1n(1A1,i)R1,iN1,0ϵ2 otherwise 

Thus, the sampling probability π2(n) does not concentrate when β1 = β0.

Proof: We define MnI(i=1nA1,iR1,iN1,1>i=1n(1A1,i)R1,iN1,0)=I((β1β0)+i=1nA1,iϵ1,iN1,1>i=1n(1A1,i)ϵ1,iN1,0). Note that when Mn = 1, π2(n)=1ϵ2 and when Mn = 0, π2(n)=ϵ2.

When the margin is zero, Mn does not concentrate because for all N1,1, N1,0, since ϵ1,i~i.i.d.N(0,1),

(i=1nA1,iϵ1,iN1,1>i=1n(1A1,i)ϵ1,iN1,0)=(1N1,1Z11N1,0Z2>0)=12

for Z1, Z2~i.i.d.N(0,1). Thus, we have shown that π2(n) does not concentrate when β1β0 = 0. □

Theorem 6 (Non-uniform convergence of the OLS estimator of the treatment effect for ϵ-greedy). Assuming the setup and conditions of Proposition 2, and that β1 = b, we show that the normalized errors of the OLS estimator converges in distribution as follows:

N1,1+N2,1(β^1OLSb)DY
Y={Z1 if β1β0013ϵ(Z12ϵZ3)I(Z1>Z2)+11+ϵ(Z1ϵZ3)I(Z1<Z2) if β1β0=0

for Z1, Z2, Z3~i.i.d.N(0,1). Note the β1β0 = 0 case, Y is non-normal.

Proof: The normalized errors of the OLS estimator for β1 are

N1,1+N2,1(t=12i=1nAt,iRt,iN1,1+N2,1b)=t=12i=1nAt,iϵt,iN1,1+N2,1=[1,1][i=1nA1,iϵ1,iN1,1+N2,1i=1nA2,iϵ2,iN1,1+N2,1]=[1,1][N1,1N1,1+N2,1i=1nA1,iϵ1,iN1,1N2,1N1,1+N2,1i=1nA2,iϵ2,iN2,1]

By Slutsky’s Theorem and Lemma 1, (1/21/2+π2(n)N1,1+N2,1N1,1,π2(n)1/2+π2(n)N1,1+N2,1N2,1)P(1,1), so

=[1,1][(1/21/2+π2(n)N1,1+N2,1N1,1+op(1))N1,1N1,1+N2,1i=1nA1,iϵ1,iN1,1(π2(n)1/2+π2(n)N1,1+N2,1N2,1+op(1))N2,1N1,1+N2,1i=1nA2,iϵ2,iN2,1]
=[1,1][1/21/2+π2(n)i=1nA1,iϵ1,iN1,1+op(1)π2(n)1/2+π2(n)i=1nA2,iϵ2,iN2,1+op(1)]

The last equality holds because by Theorem 3, (i=1nA1,iϵ1,iN1,1,i=1nA2,iϵ2,iN2,1)DN(0,I_2).

Let’s define Mn(i=1nA1,iR1,iN1,1>i=1n(1A1,i)R1,iN1,0)=I((β1β0)+i=1nA1,iϵ1,iN1,1>i=1n(1A1,i)ϵ1,iN1,0).

Note that when Mn = 1, π2(n)=1ϵ2 and when Mn = 0, π2(n)=ϵ2.

Mn=I((β1β0)+i=1nA1,iϵ1,iN1,1>i=1n(1A1,1)ϵ1,iN1,0)=I(N1,0(β1β0)+N1,0N1,1i=1nA1,iϵ1,iN1,1>i=1n(1A1,i)ϵ1,iN1,0)=I(N1,0(β1β0)+[1+op(1)]i=1nA1,1ϵ1,iN1,1>i=1n(1A1,t)ϵ1,iN1,0)

where the last equality holds because N1,0N1,1P1 by Lemma 1, Slutsky’s Theorem, and continuous mapping theorem. Thus, by Proposition 2,

M(n)P{1 if β1β0>00 if β1β0<0 does not concentrate  if β1β0=0

Note that

[1212+π2(n)t=1nA1,iϵ1,iN1,1+op(1)π2(n)12+π2(n)1=1nA2,iϵ2,iN2,1+op(1)]=[112+1ε21=1nA1,tϵ1,iN1,1+op(1)1ϵ/212+1ϵ21=1nA2,1ϵ2,iN2,1+op(1)]Mn+[1122+ε2i=1nA1,tϵ1,tN1,1+op(1)ϵ212+ϵ2i=1nA2,iϵ2,iN2,1+op(1)](1Mn)

Also note that by Theorem 3, (i=1nA1,iϵ1,iN1,1,i=1n(1A1,i)ϵ1,iN1,0,t=1nA2,1ϵ2,iN2,1,1=1n(1A2,i)ϵ2,iN2,1)DN(0,I_4).

When β1 > β0, MnP1 and when β1 < β0, MnP0; in both these cases the normalized errors are asymptotically normal. We now focus on the case that β1 = β0. By continuous mapping theorem and Slutsky’s theorem for Z1, Z2, Z3, Z4~i.i.d.N(0,1),

=[1,1][1212+1ϵ2i=1nA1,iϵ1,iN1,1+op(1)1ϵ/212+1ϵ2i=1nA2,iϵ2,iN2,1+op(1)]I([1+o(1)]i=1nA1,tϵ1,tN1,1>i=1n(1A1,i)ϵ1,iN1,0)+[1,1][1212+ϵ2i=1nA1,iϵ1,iN1,1+op(1)ϵ212+12i=1nA2,iϵ2,iN2,1+op(1)](1I([1+o(1)]i=1nA1,iϵ1,tN1,1>i=1n(1A1,i)ϵ1,iN1,0))D[1,1][1/21/2+1ϵ/2Z11ϵ/21/2+1ϵ/2Z3]I(Z1>Z2)+[1,1][1/21/2+ϵ/2Z1ϵ/21/2+ϵ/2Z3]I(Z1<Z2)=(13ϵZ1+2ϵ3ϵZ3)I(Z1>Z2)+(11+ϵZ1+ϵ1+ϵZ3)I(Z1<Z2)
+[1,1][1212+ϵ2i=1nA1,iϵ1,iN1,1+op(1)ϵ212+12i=1nA2,iϵ2,iN2,1+op(1)](1I([1+o(1)]i=1nA1,iϵ1,tN1,1>i=1n(1A1,i)ϵ1,iN1,0))D[1,1][1/21/2+1ϵ/2Z11ϵ/21/2+1ϵ/2Z3]I(Z1>Z2)+[1,1][1/21/2+ϵ/2Z1ϵ/21/2+ϵ/2Z3]I(Z1<Z2)

Thus,

t=12i=1nAt,iϵt,iN1,1+N2,1DY
Y{13ϵ(Z12ϵZ3) if β1β0>011+ϵ(Z1ϵZ3) if β1β0<013ϵ(Z12ϵZ3)I(Z1>Z2)+11+ϵ(Z1ϵZ3)I(Z1<Z2) if β1β0=0

C.3. UCB

Theorem 7 (Asymptotic non-Normality under zero treatment effect for clipped UCB). Let T = 2 and π1(n)=12 for all n. We assume that {ϵt,i}i=1n~i.i.d.N(0,1), and

π2(n)={πmax if U1>U01πmax otherwise 

where we define the upper confidence bounds U for any confidence level δ with 0 < δ < 1 as follows:

U1={ if N1,1=0i=1nA1,iR1,iN1,1+2log1/δN1,1 otherwise 
U0={ if N1,0=0i=1n(1A1,i)R1,tN1,1+2log1/δN1,0 otherwise 

Assuming above conditions, and that β1 = b, we show that the normalized errors of the OLS estimator converges in distribution as follows:

N1,1+N2,1(β^1OLSb)DY
Y={Z1 if Δ=0(12+πmaxZ1+πmax2+πmaxZ3)I(Z1>Z2)+(332πmaxZ1+1πmax2πmaxZ3)I(Z1<Z2) if Δ=0

for Z1, Z2, Z3~i.i.d.N(0,1). Note the Δ := β1β0 = 0 case, Y is non-normal.

Proof: The proof is very similar to that of asymptotic non-normality result for ϵ-Greedy. By the same arguments made as in the ϵ-Greedy case, we have that

N1,1+N2,1(t=12i=1nAt,iRt,iN1,1+N2,1b)=[1,1][1/21/2+π2(n)i=1nA1,iϵ1,iN1,1+op(1)π2(n)1/2+π2(n)i=1nA2,tϵ2,tN2,1+op(1)]

Assuming n ≥ 1, we then define

MnI(U1>U0)=I(N1,1>0,N1,0>0)I(i=1nA1,1R1,tN1,1+2log1/δN1,1>i=1n(1A1,t)R1,iN1,1+2log1/δN1,0)+I(N1,1=0,N1,0>0)=I(N1,1>0,N1,0>0)I((β1β0)+1=1nA1,iϵ1,1N1,1+2log1/δN1,1>1=1n(1A1,1)ϵ1,tN1,1+2log1/δN1,0)+I(N1,1=0,N1,0>0)=I(N1,1>0,N1,0>0)I(N1,0(β1β0)+N1,0N1,1[i=1nA1,iϵ1,iN1,1+2log1/δ]>1=1n(1A1,1)ϵ1,1N1,1+2log1/δ)+I(N1,1=0,N1,0>0)

Note that N1,0N1,1P1 by Lemma 1. Thus by Slutsky’s Theorem and continuous mapping theorem,

=I(N1,0(β1β0)+[1+op(1)]1=1nA1,iϵ1,iN1,1+op(1)>i=1n(1A1,i)ϵ1,iN1,1)+op(1) (20)

Note that

[1212+π2(n)i=1nA1,iϵ1,iN1,1+op(1)π2(n)12+π2(n)1=1nA2,iϵ2,iN2,1+op(1)]=[12i=1nA1,1ϵ1,tN1,1+op(1)12+πmaxi=1nA2,1ϵ2,iN2,1+op(1)]Mn+[112+1πmaxi=1nA1,iϵ1,iN1,1+op(1)1πmax2+1πmaxi=1nA2,iϵ2,iN2,1+op(1)](1Mn)

Let (Z1(n),Z2(n),Z3(n),Z4(n))(i=1nA1,iϵ1,iN1,1,i=1n(1A1,i)ϵ1,iN1,0,i=1nA2,iϵ2,1N2,1,i=1n(1A2,1)ϵ2,iN2,1).

Note that by Theorem 3, (Z1(n),Z2(n),Z3(n),Z4(n))DN(0,I_4).

When β1 > β0, MnP1 and when β1 < β0, MnP0; in both these cases the normalized errors are asymptotically normal. We now focus on the case that β1 = β0. By continuous mapping theorem and Slutsky’s theorem,

=[1,1][12Z1(n)+op(1)12+πmaxπmax12+πmaxZ3(n)+op(1)][I([1+op(1)]Z1(n)+op(1)>Z2(n))+op(1)]+[1,1][112+1πmaxZ1(n)+op(1)1πmax2+1πmaxZ3(n)+op(1)][1I([1+op(1)]Z1(n)+op(1)>Z2(n))+op(1)])D[1,1][12+πmaxZ1πmax2+πmaxZ3]I(Z1>Z2)+[1,1][1212+1πmaxZ11πmax2+1πmaxZ3]I(Z1<Z2)=(1212+πmaxZ1+πmax12+πmaxZ3)I(Z1>Z2)+(1232πmaxZ1+1πmax32πmaxZ3)I(Z1<Z2)

Note that (20) implies that if β1 = β0, that π2(n) will not concentrate.

D. Asymptotic Normality of the Batched OLS Estimator: Multi-Arm Bandits

Theorem 3 (Asymptotic normality of Batched OLS estimator for multi-arm bandits) Assuming Conditions 6 (weak moments) and 3 (conditionally i.i.d. actions), and a clipping rate of f(n)=ω(1n) (Definition 1),

[[N1,000N1,1]1/2(β^1BOLSβ1)[N2,000N2,1]1/2(β^2BOLSβ2)[NT,000NT,1]1/2(β^TBOLSβT)]DN(0,σ2I_2T)

where βt = (βt,0, βt,1), Nt,1=i=1nAt,i and Nt,0=i=1n(1At,i). Note in the body of this paper, we state Theorem 3 with conditions that are are sufficient for the weaker conditions we use here.

Lemma 1. Assuming the conditions of Theorem 3, for any batch t ∈ [1: T],

Nt,1nπt(n)=i=1nAt,inπt(n)P1     and     Nt,0n(1πt(n))=i=1n(1At,i)n(1πt(n))P1

Proof of Lemma 1: To prove that Nt,1nπt(n)P1, it is equivalent to show that 1nπt(n)i=1n(At,iπt(n))P0. Let ϵ > 0.

(|1nπt(n)i=1n(At,iπt(n))|>ϵ)=(|1nπt(n)i=1n(At,iπt(n))|[I(πt(n)[f(n),1f(n)])+I(πt(n)[f(n),1f(n)])]>ϵ)(|1nπt(n)i=1n(At,iπt(n))|I(πt(n)[f(n),1f(n)])>ϵ2)+(|1nπt(n)i=1n(At,iπt(n))|I(πt(n)[f(n),1f(n)])>ϵ2)

Since by our clipping assumption, I(πt(n)[f(n),1f(n)])P1, the second probability in the summation above converges to 0 as n → ∞. We will now show that the first probability in the summation above also goes to zero. Note that E[1nπt(n)i=1n(At,iπt(n))]=E[1nπt(n)i=1n(E[At,iHt1(n)]πt(n))]=0. So by Chebychev inequality, for any ϵ > 0,

(|1nπt(n)i=1n(At,iπt(n))|I(πt(n)[f(n),1f(n)])>ϵ)1ϵ2n2E[1(πt(n))2(i=1n(At,iπt(n)))2I(πt(n)[f(n),1f(n)])]1ϵ2n2i=1nj=1nE[1(πt(n))2(At,iπt(n))(At,jπt(n))I(πt(n)[f(n),1f(n)])]=1ϵ2n2i=1nj=1nE[1(πt(n))2I(πt(n)[f(n),1f(n)])E[At,iAt,jπt(n)(At,i+At,j)+(πt(n))2Ht1(n)]]=1ϵ2n2i=1nj=1nE[1(πt(n))2I(πt(n)[f(n),1f(n)])(E[At,iAt,jHt1(n)](πt(n))2)] (21)

Note that if ij, since At,i~i.i.d.Bernoulli(πt(n)), E[At,iAt,jHt1(n)]=E[At,iHt1(n)]E[At,jHt1(n)]=(πt(n))2, so (21) above equals the following

=1ϵ2n2i=1nE[1(πt(n))2I(πi(n)[f(n),1f(n)])(E[At,iHt1(n)](πt(n))2)]
=1ϵ2n2i=1nE[1πt(n)πt(n)I(πi(n)[f(n),1f(n)])]=1ϵ2nE[1πt(n)πt(n)I(πt(n)[f(n),1f(n)])]1ϵ2n1f(n)0

where the limit holds because we assume f(n)=ω(1n) so f(n)n → ∞. We can make a very similar argument for Nt,0n(1πt(n))P1. □

Proof for Theorem 3 (Asymptotic normality of Batched OLS estimator for multi-arm bandits): For readability, for this proof we drop the (n) superscript on πt(n). Note that

[Nt,000Nt,1]1/2(β^tBOLSβt)=[Nt,000Nt,1]1/2i=1n[1At,iAt,i]ϵt,i.

We want to show that

[[N0,100N1,1]1/2i=1n[1A1,iA1,i][N0,200N1,2]1/2i=1n[1A2,iA2,i]ϵ2,i[Nt,000Nt,1]1/2i=1n[1AT,iAT,i]ϵT,i]=[N0,11/2i=1n(1A1,i)ϵ1,iN1,11/2i=1nA1,iϵ1,iN0,21/2i=1n(1A2,i)ϵ2,iN1,21/2i=1nA2,iϵ2,iNt,01/2i=1n(1AT,i)ϵT,iNt,11/2i=1nAT,iϵT,i]DN(0,σ2I_2T).

By Lemma 1 and Slutsky’s Theorem it is sufficient to show that as n → ∞,

[1n(1π1)i=1n(1A1,i)ϵ1,i1nπ1i=1nA1,iϵ1,i1n(1π2)i=1n(1A2,i)ϵ2,i1nπ2i=1nA2,iϵ2,i1n(1πT)i=1n(1AT,i)ϵT,i1nπTi=1nAT,iϵT,i]=[1n[1π1,100π1,1]1/2i=1n[1A1,iA1,i]ϵ1,i1n[1π2(n)00π2(n)]1/2i=1n[1A2,iA2,i]ϵ2,i1n[1πt(n)00πt(n)]1/2i=1n[1AT,iAT,i]ϵT,i]DN(0,σ2I_2T)

By Cramer-Wold device, it is sufficient to show that for any fixed vector c2Ts.t.c2=1 that as n → ∞,

c[n1/2[1π1,100π1,1]1/2i=1n[1A1,iA1,i]ϵ1,in1/2[1π2(n)00π2(n)]1/2i=1n[1A2,iA2,i]ϵ2,in1/2[1πt(n)00πt(n)]1/2i=1n[1AT,iAT,i]ϵT,i]DN(0,σ2)

Let us break up c so that c=[c1,c2,,cT]2T with ct2 for t ∈ [1: T]. The above is equivalent to

t=1Tn1/2ct[1πt(n)00πt(n)]1/2i=1n[1At,iAt,i]ϵt,iDN(0,σ2)

Let us define Yt,in1/2ct[1πt,i00πt,i]1/2[1At,iAt,i]ϵt,i.

The sequence {Y1,1,Y1,2, …,Y1,n, …,YT,1, YT,2, …,YT,n} is a martingale with respect to sequence of histories {Ht(n)}t=1T, since

E[Yt,iHt1(n)]=n1/2ct[1πt(n)00πt(n)]1/2E[[1At,iAt,i]ϵt,iHt1(n)]=n1/2ct[1πt(n)00πt(n)]1/2E[[(1πt(n))E[ϵt,iHt1(n),At,i=0]πt,iE[ϵt,iHt1(n),At,i=1]]Ht1(n)]=0

for all i ∈ [1: n] and all t ∈ [1: T]. We then apply [8] martingale central limit theorem to Yt,i to show the desired result (see the proof of Theorem 5 in Appendix B for the statement of the martingale CLT conditions).

Condition(a): Martingale Condition The first condition holds because E[Yt,iHt1(n)]=0 for all i ∈ [1: n] and all t ∈ [1: T].

Condition(b): Conditional Variance

t=1Ti=1nE[Yn,t,i2Ht1(n)]=t=1Ti=1nE[(1nct[1πt(n)00πt(n)]1/2[1At,iAt,i]ϵt,i)2Ht1(n)]=t=1Ti=1nE[1nct[1πt(n)00πt(n)]1/2[1At,i00At,i][1πt(n)00πt(n)]1/2ctϵt,i2Ht1(n)]=t=1Ti=1n1nct[1πt(n)00πt(n)]1/2[E[(1At,i)ϵt,i2Ht1(n)]00E[At,iϵt,i2Ht1(n)]][1πt(n)00πt(n)]1/2ct

Since E[At,iϵt,i2Ht1(n)]=πt(n)E[ϵt,i2Ht1(n),At,i=1]=σ2πt and E[(1At,i)ϵt,i2Ht1(n)]=(1πt)E[ϵt,i2Ht1(n),At,i=0]=σ2(1πt),

=t=1Ti=1nn1ctctσ2=t=1Tctctσ2=σ2

Condition(c): Lindeberg Condition Let δ > 0.

t=1Ti=1nE[Yt,i2I(Yt,i2>δ2)Ht1(n)]=t=1Ti=1nE[(n1/2ct[1πt(n)00πt(n)]1/2[1At,iAt,i]ϵt,i)2I(Yt,i2>δ2)Ht1(n)]=t=1T1ni=1nE[ct[1πt(n)00πt(n)]1/2[1At,i00At,i][1πt(n)00πt(n)]1/2ctϵt,i2I(Yt,i2>δ2)Ht1(n)]=t=1T1ni=1nct[1πt(n)00πt(n)]12[E[(1At,i)ϵt,i2I(Yi,i2>δ2)Ht1(n)]00E[At,iϵt,i2I(Yt,i2>δ2)Ht1(n)]][1πt(n)00πt(n)]12ct

Note that for ct = [ct,0, ct,1], E[(1At,i)ϵt,i2I(Yt,i2>δ2)Ht1(n)]     =E[ϵt,i2I(ct,021π(n)ϵt,i2>nδ2)Ht1(n),At,i=0](1πt) and E[At,iϵt,i2I(Yt,i2>δ2)Ht1(n)]=E[ϵt,i2I(ct,12πt(n)ϵt,i2>nδ2)Ht1(n),At,i=1]πt.Thus, we have that

=t=1T1ni=1nct,02E[ϵt,i2I(ϵt,i2>nδ2(1πt)ct2,0)Ht1(n),At,i=0]+ct,12E[ϵt,i2I(ϵt,i2>nδ2πt(n)ct,12)Ht1(n),At,i=1]
t=1Tmaxi[1:n]{ct,02E[ϵt,i2I(ϵt,i2>nδ2(1πt)ct,02)Ht1(n),At,i=0]+ct,12E[ϵt,i2I(ϵt,i2>nδ2πt(n)c2))Ht1(n),At,i=1]}

Note that for any t ∈ [1: T] and i ∈ [1: n]

E[ϵt,i2I(ϵt,i2>nδ2πi(n)ci2)αHt1(n),At,i=1]=E[ϵt,i2I(ϵt,i2>nδ2πi(n)ct,12)Ht1(n),At,i=1](I(πt(n)[f(n),1f(n)])+I(πt(n)[f(n),1f(n)]))E[ϵt,i2I(ϵt,i2>nδ2f(n)ct,12)Ht1(n),At,i=1]+σ2I(πt(n)[f(n),1f(n)])

The second term converges in probability to zero as n → ∞ by our clipping assumption. We now show how the first term goes to zero in probability. Since we assume f(n)=ω(1n), nf(n) → ∞. So, it is sufficient to show that for all t, n,

limmmaxi[1:n]{E[ϵt,i2I(ϵt,i2>m)Ht1(n),At,i=1]}=0

By Condition 6, we have that for all n ≥ 1,

maxt[1:T],i[1:n]E[φ(ϵt,i2)Ht1(n),At,i=1]<M

Since we assume that limxφ(x)x=, for all m, there exists a bm s.t. φ(x) ≥ mMx for all xbm. So, for all n, t, i,

ME[φ(ϵt,i2)Ht1(n),At,i=1]E[φ(ϵt,i2)I(ϵt,i2bm)Ht1(n),At,i=1]mME[ϵt,i2I(ϵt,i2bm)Ht1(n),At,i=1]

Thus,

maxt[1:T],i[1:n]E[ϵt,i2I(ϵt,i2bm)Ht1(n),At,i=1]1m

We can make a very similar argument that for all t ∈ [1: T], as n → ∞,

maxi[1:n]E[ϵt,i2I(ϵt,i2>nδ2(1πt)ct,02)Ht1(n),At,i=0]P0

Corollary 3 (Asymptotic Normality of the Batched OLS Estimator of Margin; two-arm bandit setting). Assume the same conditions as Theorem 3. For each t ∈ [1: T], we have the BOLS estimator of the margin β1β0:

Δ^tBOLS=i=1n(1At,i)Rt,iNt,0i=1nAt,iRt,iNt,1

We show that as n → ∞,

[N1,0N1,1n(Δ^1BOLSΔ1)N2,0N2,1n(Δ^2BOLSΔ2)NT,0NT,1n(Δ^TBOLSΔT)]DN(0,σ2I_T)

Proof:

Nt,0Nt,1n(Δ^tBOLSΔt)=Nt,0Nt,1n(i=1n(1At,i)ϵt,iNt,0i=1nAt,iϵt,iNt,1)=Nt,1ni=1n(1At,i)ϵt,iNt,0Nt,0ni=1nAt,iϵt,iNt,1=[Nt,1nNt,0n][Nt,000Nt,1]1/2i=1n[1At,iAt,i]ϵt,i

By Slutsky’s Theorem and Lemma 1, it is sufficient to show that as n → ∞,

[1n[π1(n)1π1(n)][1π1(n)00π1(n)]     i=1n[1A1,iA1,i]ϵ1,i1n[π2(n)1π2(n)][1π2(n)00π2(n)]1/2i=1n[1A2,iA2,i]ϵ2,i1n[πt(n)1πt(n)][1πt(n)00πt(n)]1/2i=1n[1AT,iAT,i]ϵT,i]DN(0,σ2I_T)

By Cramer-Wold device, it is sufficient to show that for any fixed vector dTs.t.d2=1 that

d[1n[π1(n)1π1(n)][1π1(n)00π1(n)]1/2i=1n[1A1,iA1,i]ϵ1,i1n[π2(n)1π2(n)][1π2(n)00π2(n)]1/2i=1n[1A2,iA2,i]ϵ2,i1n[πt(n)1πt(n)][1πt(n)00πt(n)]1/2i=1n[1AT,iAT,i]ϵT,i]DN(0,σ2)

Let [d1,d2,,dT]dT. The above is equivalent to

t=1T1ndt[πt(n)1πt(n)][1πt(n)00πt(n)]1/2i=1n[1At,iAt,i]ϵt,iDN(0,σ2)

Define Yt,i1ndt[πt(n)1πt(n)][1πt(n)00πt(n)]1/2[1At,iAt,i]ϵt,i. {Y1,1,Y1,2, …,Y1,n, …,YT,1,YT,2, …,YT,n} is a martingale difference array with respect to the sequence of histories {Ht(n)}t=1T because for all i ∈ [1: n] and t ∈ [1: T],

E[Yt,iHt1(n)]=1ndt[πt(n)1πt(n)][1πt(n)00πt(n)]1/2E[[1At,iAt,i]ϵt,iHt1(n)]=dtn[πt(n)1πt(n)][1πt(n)00πt(n)]12E[[(1πt(n))E[ϵt,iHt1(n),At,i=0]πt,iE[ϵt,iHt1(n),At,i=1]]Ht1(n)]=0

We now apply [8] martingale central limit theorem to Yt,i to show the desired result. Verifying the conditions for the martingale CLT is equivalent to what we did to verify the conditions in the conditions in the proof of Theorem 3—the only difference is that we replace ct in the Theorem 3 proof with dt[1πt(n)πt(n)] in this proof. Even though ct is a constant vector and dt[1πt(n)πt(n)] is a random vector, the proof still goes through with this adjusted ct vector, since (i) dt[1πt(n)πt(n)]Ht1(n), (ii) [1πt(n)πt(n)]2=1, and (iii) nδ2πi(n)ct,12=nδ2πi(n)dt2πt(n) and nδ2(1πt)ct,02=nδ2(1πt)dt2(1πt). □

Corollary 4 (Consistency of BOLS Variance Estimator). Assuming Conditions 1 (moments) and 3 (conditionally i.i.d. actions), and a clipping rate of f(n)=ω(1n) (Definition 1), for all t ∈ [1: T], as n → ∞,

σ^t2=1n2i=1n(Rt,iAt,iβ^t,1BOLS(1At,i)β^t,0BOLS)2Pσ2

Proof:

σ^t2=1n2i=1n(Rt,iAt,iβ^t,1BOLS(1At,i)β^t,0BOLS)2=1n2i=1n([At,iβt,1+(1At,i)βt,0+ϵt,i]At,i[βt,1+i=1nAt,iϵt,iNt,1](1At,i)[βt,0+i=1n(1At,i)ϵt,iNt,0])2=1n2i=1n(ϵt,iAt,ii=1nAt,iϵt,iNt,1(1At,i)i=1n(1At,i)ϵt,iNt,0)2=1n2i=1n(ϵt,i22At,iϵt,ii=1nAt,iϵt,iNt,12(1At,i)ϵt,ii=1n(1At,i)ϵt,iNt,0+At,i[i=1nAt,iϵt,iNt,1]2+(1At,i)[i=1n(1At,i)ϵt,iNt,0]2)=(1n2i=1nϵt,i2)2(i=1nAt,iϵt,i)2(n2)Nt,12(i=1n(1At,i)ϵt,i)2(n2)Nt,0+Nt,1n2[i=1nAt,iϵt,iNt,1]2+Nt,0n2[i=1n(1At,i)ϵt,iNt,0]2=(1n2i=1nϵt,i2)(i=1nAt,iϵt,i)2(n2)Nt,1(i=1n(1At,i)ϵt,i)2(n2)Nt,0

Note that 1n2i=1nϵt,i2Pσ2 because for all δ > 0,

(|[1n2i=1nϵt,i2]σ2|>δ)(|[1n2i=1nϵt,i2]σ2(n2)n|>δ/2)+(|σ2(n2)nσ2|>δ/2)=(|1n2i=1n(ϵt,i2σ2)|>δ/2)+(σ2|2n|>δ/2)

Since the second term in the summation above goes to zero for sufficiently large n, we now focus on the first term in the summation above. By Chebychev inequality,

(|1n2i=1n(ϵt,i2σ2)|>δ/2)4δ2(n2)2E[i=1nj=1n(ϵt,i2σ2)(ϵt,j2σ2)]=4δ2(n2)2E[i=1n(ϵt,i2σ2)2]

where the equality above holds because for ij, E[(ϵt,i2σ2)(ϵt,j2σ2)]=E[E[(ϵt,i2σ2)(ϵt,j2σ2)Ht1(n)]]=E[E[ϵt,i2σ2Ht1(n)]E[ϵt,j2σ2Ht1(n)]]=0. By Condition 1 E[ϵt,i4Ht=1(n)]<M<,

=4δ2(n2)2E[i=1nE[(ϵt,i42ϵt,i2σ2+σ4)Ht1(n)]]4n(M+σ4)δ2(n2)20

Thus by Slutsky’s Theorem it is sufficient to show that (i=1nAt,iϵt,i)2(n2)Nt,1+(i=1n(1At,i)ϵt,i)2(n2)Nt,0P0. We will only show that (i=1nAt,iϵt,i)2(n2)Nt,1P0; (i=1n(1At,i)ϵt,i)2(n2)Nt,0P0 holds by a very similar argument.

Note that by Lemma 1, Nt,1nπt(n)P1. Thus, to show that (i=1nAt,iϵt,i)2(n2)Nt,1P0 by Slutsky’s Theorem it is sufficient to show that (i=1nAt,iϵt,i)2(n2)nπt(n)P0. Let δ > 0. By Markov inequality,

(|(i=1nAt,iϵt,i)2(n2)nπt(n)|>δ)E[1δ(n2)nπt(n)(i=1nAt,iϵt,i)2]=E[1δ(n2)nπt(n)j=1ni=1nAt,jAt,iϵt,iϵt,j]

Since πt(n)Ht1(n),

=E[1δ(n2)nπt(n)j=1ni=1nE[At,jAt,iϵt,iϵt,jHt1(n)]]

Since for ij, E[At,jAt,iϵt,jϵt,iHt1(n)]=E[At,jϵt,jHt1(n)]E[At,iϵt,iHt1(n)]=0,

=E[1δ(n2)nπt(n)i=1nE[At,iϵt,i2Ht1(n)]]

Since E[At,iϵt,i2Ht1(n)]=E[ϵt,i2Ht1(n),At,i=1]πt(n)=σ2πt(n),

=E[1δ(n2)nπt(n)nσ2πt(n)]=σ2δ(n2)0

E. Asymptotic Normality of the Batched OLS Estimator: Contextual Bandits

Theorem 4 (Asymptotic Normality of the Batched OLS Statistic) For a K-armed contextual bandit, we for each t ∈ [1: T], we have the BOLS estimator:

β^tBOLS=[C_t,00000C_t,10000C_t,20000C_t,K1]1i=1n[IAt,i=0Ct,iIAt,i=1Ct,iIAt,i=K1Ct,i]Rt,iKd

where C_t,ki=1nIAi,i(n)=kCt,i(Ct,i)d×d. Assuming Conditions 6 (weak moments), 3 (conditionally i.i.d. actions), 4 (conditionally i.i.d. contexts), and 5 (bounded contexts), and a conditional clipping rate f(n) = c for some 0c<12 (see Definition 2), we show that as n → ∞,

[Diagonal [C_1,0,C_1,1,,C_1,K1]1/2(β^1BOLSβ1)Diagonal [C_2,0,C_2,1,,C_2,K1]1/2(β^2BOLSβ2)Diagonal [C_T,0,C_T,1,,C_T,K1]1/2(β^TBOLSβT)]N(0,σ2I_TKd)D

Lemma 2. Assuming the conditions of Theorem 4, for any batch t ∈ [1: T] and any arm k ∈ [0: K − 1], as n → ∞,

[i=1nIAt,i=kCt,iCt,i][nZ_t,kPt,k]1PI_d (22)
[i=1nIAt,i=kCt,iCt,i]1/2[nZ_t,kPt,k]1/2PI_d (23)

where Pt,k(At,i=kHt1(n)) and Z_t,kE[Ct,iCt,iHt1(n),At,i=k].

Proof of Lemma 2: We first show that as n → ∞, 1ni=1n(IAt,i=kCt,iCt,iZ_t,kPt,k)P0_. It is sufficient to show that convergence holds entry-wise so for any r,s ∈ [0: d − 1], as n → ∞, 1ni=1n(IAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s))P0. Note that

E[IAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s)]=E[E[Ct,iCt,i(r,s)Ht1,At,i=k]Pt,kPt,kZ_t,k(r,s)]=0

By Chebychev inequality, for any ϵ > 0,

(|1ni=1nIAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s)|>ϵ)1ϵ2n2E[(i=1nIAt,i=kCt,iCt,iPt,kZ_t,k(r,s))2]=1ϵ2n2i=1nj=1nE[[IAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s)][IAt,i=kCt,jCt,j(r,s)Pt,kZ_t,k(r,s)]] (24)

By conditional independence and by law of iterated expectations (conditioning on Ht1(n)), for ij, E[(IAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s))(IAt,j=kCt,jCt,j(r,s)Pt,kZ_t,k(r,s))]=0. Thus, (24) above equals the following:

=1ϵ2n2i=1nE[(IAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s))2]
=1ϵ2n2i=1nE[IAt,i=k[Ct,iCt,i(r,s)]22IAt,i=kCt,iCt,i(r,s)Pt,kZ_t,k(r,s)+Pt,k2[Z_t,k(r,s)]2]
=1ϵ2n2i=1nE[IAt,i=k[Ct,iCt,i(r,s)]2Pt,k2[Z_t,k(r,s)]2]
=1ϵ2nE[IAt,i=k[Ct,iCt,i(r,s)]2Pt,k2[Z_t,k(r,s)]2]2dmax(u2,1)ϵ2n0

as n → ∞. The last inequality above holds by Condition 5.

Proving Equation (22):

It is sufficient to show that

2max(du2,1)ϵ2n[nZ_t,kPt,k]1op=2max(du2,1)ϵ2n2Pt,kZ_t,k1op0P (25)

We define random variable Mt(n)=I(cd,At(Ht1(n),c)[f(n),1f(n)]K), representing whether the conditional clipping condition is satisfied. Note that by our conditional clipping assumption, Mt(n)P1 as n → ∞. The left hand side of (25) is equal to the following

2max(du2,1)ϵ2n2Pt,kZ_t,k1(Mt(n)+(1Mt(n)))op=2max(du2,1)ϵ2n2Pt,kZ_t,k1Mt(n)op+op(1) (26)

By our conditional clipping condition and Bayes rule we have that for all c ∈ [−u, u]d,

(Ct,i=cAt,i=k,Ht1(n),Mt(n)=1)=(At,i=kCt,i=c,Ht1(n),Mt(n)=1)(Ct,i=cHt1(n),Mt(n)=1)(At,i=kHt1(n),Mt(n)=1)f(n)(Ct,i=cHt1(n),Mt(n)=1)1.

Thus, we have that

Z_t,kMt(n)=E[Ct,iCt,iHt1(n),At,i=k]Mt(n)=E[Ct,iCt,iHt1(n),At,i=k,Mt(n)=1]Mt(n)f(n)E[Ct,iCt,iHt1(n),Mt(n)=1]Mt(n)=f(n)E[Ct,iCt,iHt1(n)]Mt(n)=f(n)Σ_t(n)Mt(n).
λmax(Z_t,k1Mt(n))1f(n)λmax((Σ_t(n))1)Mt(n)1l  f(n)

By apply matrix inverses to both sides of the above inequality, we get that

λmax(Z_t,k1Mt(n))1f(n)λmax((Σ_t(n))1)Mt(n)1l  f(n) (27)

where the last inequality above holds for constant l by Condition 5. Recall that Pt,k=(At,i=kHt1(n)), so Pt,k(Mt(n)=1)f(n). Thus, equation (26) is bounded above by the following

2max(du2,1)ϵ2n2l f(n)2+op(1)P0

where the limit above holds because we assume that f(n) = c for some 0<c12 □.

Proving Equation (23): By Condition 5, 1nC_t,kmaxu and Z_t,kPt,kmaxu. Thus, any continuous function of 1nC_t,k and Z_t,kPt,k will have compact support and thus be uniformly continuous. For any uniformly continuous function f:d×dd×d, for any ϵ > 0, there exists a δ > 0 such that for any matrices A_,B_d×d, whenever A_B_op <δ, then f(A_)f(B_)op <ϵ. Thus, for any ϵ > 0, there exists some δ > 0 such that

((1ni=1nI(At,k=k)Ct,iCt,i)Z_t,kPt,kop >δ)0

implies

(f(1ni=1nI(At,k=k)Ct,iCt,i)f(Z_t,kPt,k)op >ϵ)0

Thus, by letting f be the matrix square-root function,

(1ni=1nI(At,k=k)Ct,iCt,i)1/2(Z_t,kPt,k)1/2P0_.

We now want to show that for some constant r>0,(Z_t,k11Pt,kop>r), because this would imply that

[(1ni=1nI(At,k=k)Ct,iCt,i)1/2(Z_t,kPt,k)1/2](Z_t,kPt,k)1/2P0_.

Recall that for Mt(n)=I(cd,At(Ht1(n),c)[f(n),1f(n)]K), representing whether the conditional clipping condition is satisfied,

Z_t,k1=Z_t,k1(Mt(n)+(1Mt(n)))=Z_t,k1Mt(n)+op(1).

Thus it is sufficient to show that (Z_t,k11Pt,kMt(n)op >r). Recall that by equation (27) we have that

λmax(Z_t,k1Mt(n))1f(n)λmax((Σ_t(n))1)Mt(n)1l  f(n)

Also note that Pt,k=(At,i=kHt1(n)), so Pt,k(Mt(n)=1)f(n). Thus we have that

(Z_t,k11Pt,kMt(n)op >r)I(11f(n)2>r)=0

for r>1l  f(n)2=1lc2, since we assume that f(n) = c for some 0<c12. □

Proof of Theorem 4: We define Pt,k(At,i=kHt1(n)) and Z_t,kE[Ct,iCt,iHt1(n),At,i=k]. We also define

Dt(n)Diagonal[C_t,0,C_t,1,,C_t,K1]1/2(β^tβt)=i=1n[C_t,01/2Ct,iIAt,i=0Ct,11/2Ct,iIAt,i=1C_t,K11/2Ct,iIAt,i=K1]ϵt,i

We want to show that [D1(n),D2(n),,DT(n)]DN(0,σ2I_TKd). By Lemma 2 and Slutsky’s Theorem, it sufficient to show that as n → ∞, [Q1(n),Q2(n),,QT(n)]DN(0,σ2I_TKd) for

Qt(n)i=1n[1nPt,0Z_t,01/2Ct,iIAt,i=01nPt,1Z_t,11/2Ct,iIAt,i=11nPt,K1Z_t,K11/2Ct,iIAt,i=K1]ϵt,i

By Cramer Wold device, it is sufficient to show that for any bTKd with b2=1, where b = [b1, b2, …, bT] for btKd, as n → ∞.

t=1TbtQt(n)DN(0,σ2) (28)

We can further define for all t ∈ [1: T], bt = [bt,0, bt,1, …,bt,K−1] with bt,kd. Thus to show (28) it is equivalent to show that

t=1Tk=0K1bt,k1nPt,kZ_t,k1/2i=1nIAt,i=kCt,iϵt,iDN(0,σ2)

We define Yt,i(n)k=0K1bt,k1nPt,kIAt,i=kZ_t,k1/2Ct,iϵt,i. The sequence Y1,1(n),Y1,2(n),,Y1,n(n),YT,1(n),YT,2(n),,YT,n(n) is a martingale difference array with respect to the sequence of histories {Ht1(n)}t=1T because E[Yt,i(n)Ht1(n)]=E[E[Yt,i(n)Ht1(n),At,i,Ct,i]Ht1(n)]=0 for all i ∈ [1: n] and all t ∈ [1: T]. We then apply the martingale central limit theorem of [8] to Yt,i(n) to show the desired result (see the proof of Theorem 5 in Appendix B for the statement of the martingale CLT conditions). Note that the first condition (a) of the martingale CLT is already satisfied, as we just showed that Yt,i(n) form a martingale difference array with respect to Ht1(n).

Condition(b): Conditional Variance

t=1Ti=1nE[Yt,i2Ht1(n)]=t=1Ti=1nE[(k=0K1bt,k1nPt,kZ_t,k1/2IAt,i=kCt,iϵt,i)2Ht1(n)]=t=1Ti=1nk=0K11nPt,kbt,kZ_t,k1/2E[IAt,i=kCt,iCt,iϵt,i2Ht1(n)]Z_t,k1/2bt,k

By law of iterated expectations (conditioning on Ht1(n), At,i, Ct,i) and Condition 6,

=1nt=1Ti=1nk=0K11Pt,kbt,kZ_t,k1/2E[IAt,i=kCt,iCt,iHt1(n)]Z_t,k1/2bt,kσ2
=1nt=1Ti=1nk=0K11Pt,kbt,kZ_t,k1/2E[Ct,iCt,iHt1(n),At,i=k]Pt,kZ_t,k1/2bt,kσ2
=1nt=1Ti=1nk=0K1bt,kI_dbt,kσ2=σ2t=1Tk=0K1bt,kbt,k=σ2

Condition(c): Lindeberg Condition Let δ > 0.

t=1Ti=1nE[Yt,i2I(|Yt,i|>δ)Ht1(n)]=t=1Ti=1nE[(k=0K1bt,k1nPt,kZt,i1/2IAt,i=kCt,iϵt,i)2I(Yt,i2>δ2)Ht1(n)]=t=1Ti=1nk=0K11nPt,kbt,kZt,i1/2E[IAt,i=kCt,iCt,iϵt,i2I(Yt,i2>δ2)Ht1(n)]Zt,i1/2bt,k

It is sufficient to show that for any t ∈ [1: T] and any k ∈ [0 : K − 1] the following converges in probability to zero:

i=1n1nPt,kbt,kZt,i1/2E[IAt,i=kCt,iCt,iϵt,i2I(Yt,i2>δ2)Ht1(n)]Zt,i1/2bt,k

Recall that Yt,i=k=0K1bt,k1nPt,kIAt,i=kZ_t,k1/2Ct,iϵt,i.

=1ni=1nbt,kZt,i1/2E[Ct,iCt,iϵt,i2I(1nPt,kbt,kZt,k1/2Ct,iCt,i,kZt,k1/2bt,kϵt,i2>δ2)Ht1(n),At,i=k]Zt,i1/2bt,k

Since c ∈ [−u,u], by the Gershgorin circle theorem, we can bound the maximum eigenvalue of cc by some constant a > 0.

ani=1nbt,kZt,i1bt,kE[ϵt,i2I(anPt,kbt,kZt,k1bt,kϵt,i2>δ2)Ht1(n),At,i=k]

We define random variable Mt(n)=I(cd,At(Ht1(n),c)[f(n),1f(n)]K), representing whether the conditional clipping condition is satisfied. Note that by our conditional clipping assumption, Mt(n)P1 as n → ∞.

=ani=1nbt,kZ_t,k1bt,kE[ϵt,i2I(anPt,kbt,kZt,k1bt,kϵt,i2>δ2)Ht1(n),At,i=k](Mt(n)+(1Mt(n)))
=ani=1nbt,kZ_t,k1bt,kE[ϵt,i2I(anPt,kbt,kZt,k1bt,kϵt,i2>δ2)Ht1(n),At,i=k]Mt(n)+op(1) (29)

By equation (27), have that

λmax(Z_t,k1)1f(n)λmax((Σ_t(n))1)1l  f(n)

Recall that Pt,k=(At,i=kHt1(n)), so Pt,k(Mt(n)=1)f(n). Thus we have that equation (29) is upper bounded by the following:

1ni=1nbt,kbt,klf(n)E[ϵt,i2I(anf(n)bt,kbt,kτ(n)ϵt,i2>δ2)Ht1(n),At,i=k]+op(1)
=1ni=1nbt,kbt,klf(n)E[ϵt,i2I(ϵt,i2>δ2lnf(n)2abt,kbt,k)Ht1(n),At,i=k]+op(1)

It is sufficient to show that

limnmaxi[1:n]1f(n)E[ϵt,i2I(ϵt,i2>δ2lnf(n)2abt,kbt,k)Ht1(n),At,i=k]=0. (30)

By Condition 6, we have that for all n ≥ 1, maxt[1:T],i[1:n]E[φ(ϵt,i2)Ht1(n),At,i=k]<M. Since we assume that limxφ(x)x=, for all m ≥ 1, there exists a bm s.t. φ(x) ≥ mMx for all xbm. So, for all n, t, i,

ME[φ(ϵt,i2)Ht1(n),At,i=k]E[φ(ϵt,i2)I(ϵt,i2bm)Ht1(n),At,i=k]mME[ϵt,i2I(ϵi,i2bm)Ht1(n),At,i=k]

Thus, maxi[1:n]E[ϵt,i2I(ϵt,i2bm)Ht1(n),At,i=k]1m; so limmmaxi[1:n]E[ϵt,i2I(ϵt,i2bm)Ht1(n),At,i=k]=0. Since by our conditional clipping assumption, f(n) = c for some 0<c12 thus nf(n)2 → ∞. So equation (30) holds. □

Corollary 5 (Asymptotic Normality of the Batched OLS for Margin with Context Statistic). Assume the same conditions as Theorem 4. For any two arms x,y ∈ [0: K − 1] for all t ∈ [1: T], we have the BOLS estimator for Δt,x−y := βt,xβt,y. We show that as n → ∞,

[[C_1,x1+C_1,y1]1/2(Δ^1,xyBOLSΔ1,xy)[C_2,x1+C_2,y1]1/2(Δ^2,xyBOLSΔ2,xy)[C_T,x1+C_T,y1]1/2(Δ^T,xyBOLSΔT,xy)]DN(0,σ2I_Td)

where

Δ^t,xyBOLS=[C_t,x1+C_t,y1]1(C_t,y1i=1nAt,iCt,iRt,iC_t,x1i=1n(1At,i)Ct,iRt,i).

Proof: By Cramer-Wold device, it is sufficient to show that for any fixed vector dTds.t.d2=1, where d = [d1,d2, …,dT] for dtd, t=1Tdt[C_t,x1+C_t,y1]1/2(Δ^t,xyBOLSΔt,xy)DN(0,σ2), as n → ∞.

t=1Tdt[C_t,x1+C_t,y1]1/2(Δ^t,xyBOLS Δt,xy)=t=1Tdt[C_t,x1+C_t,y1]1/2(C_t,y1i=1nAt,iCt,iϵt,iC_t,x1i=1n(1At,i)Ct,iϵt,i)

By Lemma 2, as n → ∞, 1nPt,xZ_t,x1C_t,xPI_d and 1nPt,yZ_t,y1C_t,yPI_d, so by Slutsky’s Theorem it is sufficient to that as n → ∞,

t=1Tdt[C_t,x1+C_t,y1]1/2(1nPt,yZ_t,y1i=1nAt,iCt,iϵt,i1nPt,xZ_t,x1i=1n(1At,i)Ct,iϵt,i)DN(0,σ2)

We know that [1Pt,xZ_t,x1+1Pt,yZ_t,y1]1/2[1Pt,xZ_t,x1+1Pt,yZ_t,y1]1/2PI_d.

By Lemma 2 and continuous mapping theorem, nPt,xZ_t,xC_t,x1PI_d and nPt,yZ_t,yC_t,y1PI_d. So by Slutsky’s Theorem,

[1Pt,xZ_t,x1+1Pt,yZ_t,y1]1/2[nC_t,x1+nC_t,y1]1/2PI_d

So, returning to our CLT, by Slutsky’s Theorem, it is sufficient to show that as n → ∞,

t=1Tdt[1nPt,xZ_t,x1+1nPt,yZ_t,y1]1/21nPt,yZ_t,y1i=1nAt,iCt,iϵt,it=1Tdt[1nPt,xZ_t,x1+1nPt,yZ_t,y1]1/21nPt,xZ_t,x1i=1n(1At,i)Ct,iϵt,iDN(0,σ2)

The above sum equals the following:

=t=1Tdt[1nPt,xZ_t,x1+1nPt,yZ_t,y1]1/21nPt,xZ_t,x1/2(1nPt,xZ_t,x1/2i=1nAt,iCt,iϵt,i)t=1Tdt[1nPt,xZ_t,x1+1nPt,xZ_t,y1]1/21nPt,yZ_t,y1/2(1nPt,yZ_t,y1/2i=1n(1At,i)Ct,iϵt,i)

Asymptotic normality holds by the same martingale CLT as we used in the proof of Theorem 4. The only difference is that we adjust our bt,k vector from Theorem 4 to the following:

bt,k:={0 if k{x,y}dt[1nPt,xZ_t,x1+1nPt,yZ_t,y1]1/21nPt,xZ_t,x1/2 if k=xdt[1nPt,xZt,x1+1nPt,yZt,y1]1/21nPt,yZ_t,y1/2 if k=y

The proof still goes through with this adjustment because for all k ∈ [0: K − 1], (i) bt,kHt1(n), (ii) t=1Tk=0K1bt,kbt,k=t=1Tdtdt=1. and (iii) l  nf(n)2abt,kbt,k still holds because bt,kbt,k is bounded above by one. □

F. W-Decorrelated Estimator [6]

To better understand why the W-decorrelated estimator has relatively low power, but is still able to guarantee asymptotic normality, we now investigate the form of the W-decorrelated estimator in the two-arm bandit setting.

F.1. Decorrelation Approach

We now assume we are in the unbatched setting (i.e., batch size of one), as the W-decorrelated estimator was developed for this setting; however, these results easily translate to the batched setting. We now let n index the number of samples total (previously this was nT) and examine asymptotics as n → ∞. We assume the following model:

Rn=X_nβ+ϵn

where Rn, ϵnn and X_nn×p and βp. The W-decorrelated OLS estimator is defined as follows:

β^d=β^OLS+W_n(RnX_nβ^OLS)

With this definition we have that,

β^dβ=β^OLS+W_n(RnX_nβ^OLS)β=β^OLS+W_n(X_nβ+ϵn)W_nX_nβ^OLSβ=(I_pW_nX_n)(β^OLSβ)+W_nϵn

Note that if E[W_nϵn]=E[i=1nWiϵi]=0 (where Wi is the ith column of W_n), then E[(I_pW_nX_n)(β^OLSβ)] would be the bias of the estimator. We assume {ϵi} is a martingale difference sequence w.r.t. filtration {Gi}i=1n. Thus, if we constrain Wi to be Gi1 measurable,

E[W_nϵn]=E[i=1nWiϵi]=i=1nE[E[WiϵiGi1]]=i=1nE[WiE[ϵiGi1]]=0

Trading off Bias and Variance

While decreasing E[(I_pW_nX_n)(β^OLSβ)] will decrease the bias, making W_n larger in norm will increase the variance. So the trade-off between bias and variance can be adjusted with different values of λ for the following optimization problem:

I_pW_nX_nF2+λW_nF2=I_pW_nX_nF2+λTr(W_nW_n)

Optimizing for W_n

The authors propose to optimize for W_n in a recursive fashion, so that the ith column, Wi, only depends on {Xi}ji ∪ {ϵj}ji−1 (so i=1nE[Wiϵi]=0). We let W0 = 0, X0 = 0, and recursively define W_nW_n1Wn] where

Wn=argminWpI_pW_n1X_n1WXnF2+λW22

where W_n1=[W1;W2;;Wn1]p×(n1) and X_n1=[X1;X2;;Xn1](n1)×p. Now, let us find the closed form solution for each step of this minimization:

ddWI_pW_n1X_n1WXnF2+λW22=2(I_pW_n1X_n1WXn)(Xn)+2λW

Note that since the Hessian is positive definite, so we can find the minimizing W by setting the first derivative to 0:

d2dWdWI_pW_n1X_n1WXnF2+λW22=2XnXn+2λI_p0
0=2(I_pW_n1X_n1WXn)(Xn)+2λW
(I_pW_n1X_n1WXn)Xn=λW
(I_pW_n1X_n1)Xn=λW+WXnXn=(λ+Xn22)W
W*=(I_pW_n1X_n1)Xnλ+Xn22

Proposition 3 (W-decorrelated estimator and time discounting in the two-arm bandit setting). Suppose we have a 2-arm bandit. Ai is an indicator that equals 1 if arm 1 is chosen for the ith sample, and 0 if arm 0 is chosen. We define Xi[1Ai,Ai]2. We assume the following model of rewards:

Ri=Xiβ+ϵi=Aiβ1+(1Ai)β0+ϵi

We further assume that {ϵi}i=1n are a martingale difference sequence with respect to filtration {Gi}i=1n. We also assume that Xi are non-anticipating with respect to filtration {Gi}i=1n. Note the W-decorrelated estimator:

β^d=β^OLS+W_n(RnX_nβ^OLS)

We show that for W_n=[W1;W2;;Wn]p×n and choice of constant λ,

Wi=[(11λ+1)i=1n(1Ai)1λ+1(11λ+1)i=1nAi1λ+1]2

Moreover, we show that the W-decorrelated estimator for the mean of arm 1, β1, is as follows:

β^1d=(1i=1nAt1λ+1(11λ+1)N1,i1)β^1OLS+i=1nAtRt1λ+1(11λ+1)N1,i1

where β^1OLS=i=1nAiRiN1,n for N1,n=i=1nAi. Since [6] require that λ ≥ 1 for their CLT results to hold, thus, the W-decorrelated estimators is down-weighting samples drawn later on in the study and up-weighting earlier samples.

Proof: Recall the formula for Wi,

Wi=(I_pW_i1X_i1)Xiλ+Xi22

We let Wi = [W0,i,W1,i]. For notational simplicity, we let r=1λ+1. We now solve for

W1,1=(10)rA1=rA1
W1,2=(1W1,1A1)rA2=(1rA1)rA2
W1,3=(1i=12W1,iAi)rA3=(1rA1(1rA1)rA2)rA3=(1rA1)(1rA2)rA3
W1,4=(1i=13W1,iAi)rA4=(1rA1(1rA1)rA2(1rA1)(1rA2)rA3)rA4=(1rA1)(1rA2(1rA2)rA3))rA4=(1rA1)(1rA2)(1rA3)rA4

We have that for arbitrary n,

W1,n=(1i=1n1W1,iAi)rAn=rAni=1n1(1rAi)=rAn(1r)i=1n1Ai=rAn(1r)N1,n1

By symmetry, we have that

W0,n=(1i=1n1W1,i(1Ai))r(1An)=r(1An)(1r)N0,n1

Note the W-decorrelated estimator for β1:

β^1d=β^1OLS+i=1nAi(Riβ^1OLS)r(1r)N1,i1=(1i=1nAir(1r)N1,i1)β^1OLS+i=1nAiRir(1r)N1,i1

Footnotes

1

Note that the validity of bootstrap methods rely on uniform convergence [35].

2

Assume an analogous moment condition for the contextual bandit case, where Gt(n) is replaced by Ft(n).

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Contributor Information

Kelly W. Zhang, Department of Computer Science, Harvard University

Lucas Janson, Departments of Statistics, Harvard University.

Susan A. Murphy, Departments of Statistics and Computer Science, Harvard University

References

  • [1].Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [Google Scholar]
  • [2].Agarwal Arpit, Agarwal Shivani, Assadi Sepehr, and Khanna Sanjeev. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75, 2017. [Google Scholar]
  • [3].Amemiya Takeshi. Advanced Econometrics. Harvard University Press, 1985. [Google Scholar]
  • [4].Brannath W, Gutjahr G, and Bauer P. Probabilistic foundation of confirmatory adaptive designs. Journal of the American Statistical Association, 107(498):824–832, 2012. [Google Scholar]
  • [5].Deshpande Yash, Javanmard Adel, and Mehrabi Mohammad. Online debiasing for adaptively collected high-dimensional data. arXiv preprint arXiv:1911.01040, 2019. [Google Scholar]
  • [6].Deshpande Yash, Mackey Lester, Syrgkanis Vasilis, and Taddy Matt. Accurate inference for adaptive linear models. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [Google Scholar]
  • [7].Druce Katie L, Dixon William G, and McBeth John. Maximizing engagement in mobile health studies: lessons learned and future directions. Rheumatic Disease Clinics, 45(2):159–172, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Dvoretzky Aryeh. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [Google Scholar]
  • [9].Eysenbach Gunther. The law of attrition. Journal of medical Internet research, 7(1):e11, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Gao Zijun, Han Yanjun, Ren Zhimei, and Zhou Zhengqing. Batched multi-armed bandits problem. Conference on Neural Information Processing Systems, 2019. [Google Scholar]
  • [11].Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Han Yanjun, Zhou Zhengqing, Zhou Zhengyuan, Blanchet Jose, Glynn Peter W, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020. [Google Scholar]
  • [13].Hazelton Martin L.. Methods of Moments Estimation, pages 816–817. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. [Google Scholar]
  • [14].Howard Steven R, Ramdas Aaditya, McAuliffe Jon, and Sekhon Jasjeet. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv preprint arXiv:1810.08240, 2018. [Google Scholar]
  • [15].Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
  • [16].Jamieson Kevin, Malloy Matthew, Nowak Robert, and Bubeck Sébastien. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014. [Google Scholar]
  • [17].Jamieson Kevin and Nowak Robert. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014. [Google Scholar]
  • [18].Jun Kwang-Sung, Jamieson Kevin G, Nowak Robert D, and Zhu Xiaojin. Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pages 139–148, 2016. [Google Scholar]
  • [19].Kasy Maximilian. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [Google Scholar]
  • [20].Kaufmann Emilie and Koolen Wouter. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419, 2018. [Google Scholar]
  • [21].Kizilcec René F, Piech Chris, and Schneider Emily. Deconstructing disengagement: analyzing learner subpopulations in massive open online courses. In Proceedings of the third international conference on learning analytics and knowledge, pages 170–179, 2013. [Google Scholar]
  • [22].Kizilcec René F, Reich Justin, Yeomans Michael, Dann Christoph, Brunskill Emma, Lopez Glenn, Turkay Selen, Williams Joseph Jay, and Tingley Dustin. Scaling up behavioral science interventions in online education. Proceedings of the National Academy of Sciences, 117(26):14900–14905, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Klasnja Predrag, Smith Shawna, Seewald Nicholas J, Lee Andy, Hall Kelly, Luers Brook, Hekler Eric B, and Murphy Susan A. Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 53(6):573–582, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Lai Tze Leung and Wei Ching Zong. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [Google Scholar]
  • [25].Lattimore Tor and Szepesvári Csaba. Bandit algorithms. Cambridge University Press, 2020. [Google Scholar]
  • [26].Li Chang and De Rijke Maarten. Cascading non-stationary bandits: Online learning to rank in the non-stationary cascade model. arXiv preprint arXiv:1905.12370, 2019. [Google Scholar]
  • [27].Li Lihong, Chu Wei, Langford John, and Schapire Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [Google Scholar]
  • [28].Liao Peng, Greenewald Kristjan, Klasnja Predrag, and Murphy Susan. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Liu Qing, Proschan Michael A, and Pledger Gordon W. A unified theory of two-stage adaptive designs. Journal of the American Statistical Association, 97(460):1034–1041, 2002. [Google Scholar]
  • [30].Luedtke Alexander R and van der Laan Mark J. Parametric-rate inference for one-sided differentiable parameters. Journal of the American Statistical Association, 113(522):780–788, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Luedtke Alexander R and Van Der Laan Mark J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Nie Xinkun, Tian Xiaoying, Taylor Jonathan, and Zou James. Why adaptively collected data have negative bias and how to correct for it. International Conference on Artificial Intelligence and Statistics, 2018. [Google Scholar]
  • [33].Perchet Vianney, Rigollet Philippe, Chassang Sylvain, Snowberg Erik, et al. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016. [Google Scholar]
  • [34].Rafferty Anna, Ying Huiji, and Williams Joseph. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM| Journal of Educational Data Mining, 11(1):47–79, 2019. [Google Scholar]
  • [35].Romano Joseph P, Shaikh Azeem M, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [Google Scholar]
  • [36].Schwartz Eric M, Bradlow Eric T, and Fader Peter S. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017. [Google Scholar]
  • [37].Shin Jaehyeok, Ramdas Aaditya, and Rinaldo Alessandro. Are sample means in multi-armed bandits positively or negatively biased? In Advances in Neural Information Processing Systems, pages 7100–7109, 2019. [Google Scholar]
  • [38].Tang Liang, Jiang Yexi, Li Lei, and Li Tao. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73–80, 2014. [Google Scholar]
  • [39].Villar Sofía S, Bowden Jack, and Wason James. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Wassmer Gernot and Brannath Werner. Group sequential and confirmatory adaptive designs in clinical trials. Springer, 2016. [Google Scholar]
  • [41].Yao Jiayu, Brunskill Emma, Pan Weiwei, Murphy Susan, and Doshi-Velez Finale. Power-constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [PMC free article] [PubMed] [Google Scholar]
  • [42].Yom-Tov Elad, Feraru Guy, Kozdoba Mark, Mannor Shie, Tennenholtz Moshe, and Hochberg Irit. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES