Inference for Batched Bandits

Kelly W Zhang; Lucas Janson; Susan A Murphy

. Author manuscript; available in PMC: 2022 Jan 6.

Published in final edited form as: Adv Neural Inf Process Syst. 2020 Dec;33:9818–9829.

Inference for Batched Bandits

Kelly W Zhang ¹, Lucas Janson ², Susan A Murphy ³

PMCID: PMC8734616 NIHMSID: NIHMS1641560 PMID: 35002190

Abstract

As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.

1. Introduction

Due to their regret minimizing guarantees bandit algorithms have been increasingly used in in real-world sequential decision-making problems, like online advertising [27], mobile health [42], and online education [34]. However, for many real-world problems it is not enough to just minimize regret on a particular problem instance. For example, suppose we have run an online education experiment using a bandit algorithm where we test different types of teaching strategies. When designing a new online course, ideally we could use the data from the previous experiment to inform the design, e.g., under-performing arms could be eliminated or modified. Moreover, to help others designing online courses we would like to be able to publish our findings about how different teaching strategies compare in their performance. This example demonstrates the need for statistical inference methods on bandit data, which allow practitioners to draw generalizable knowledge from the data they have collected (e.g., how much better one teaching strategy is compared to another) for the sake of scientific discovery and informed decision making.

In this work we will focus on methods to construct confidence intervals for the margin—the difference in expected rewards of two bandit arms—from batched bandit data. Rather than constructing high probability confidence intervals, we are interested in constructing confidence intervals by using the asymptotic distribution of estimators to approximate their finite sample distribution. Asymptotic approximation methods for statistical inference has a long history of being successful in science and leads to much narrower confidence intervals than those constructed using high probability bounds. Most statistical inference methods based on asymptotic approximation assume that treatments are assigned independently [15]. However, bandit data violates this independence assumption because it is collected adaptively, meaning previous actions and rewards inform future action selections. The non-independence makes statistical inference more challenging, e.g., estimators like the sample mean are often biased on bandit data [32, 37].

Throughout, we focus on the batched bandit setting, in which arms of the bandit are pulled in batches. For our asymptotic analysis we fix the total number of batches, T, and allow the arm pulls in each batch, n, to go to infinity. Note that we do not need or expect n to go to infinity for real-world experiments; we use the asymptotic distribution of estimators to approximate their finite-sample distribution when constructing confidence intervals. We focus on the batched setting because it closely reflects many of the problem settings where bandit algorithms are applied. For example, in many mobile health [42, 23, 28] and online education problems [22, 34] multiple users use apps / take courses simultaneously, so a batch corresponds to the number of unique users the bandit algorithm acts on at once. The batched setting is even common in online recommendations and advertising because it is impractical to update the bandit after every action if many users visit the site simultaneously [38, 36, 12, 26]. In many such experimental settings the length of the study, T, cannot be arbitrarily adjusted, e.g., in online education, courses generally cannot be made arbitrarily long, and clinical trials often run for a standard amount of time that depends on the domain science (e.g. the length of mobile health studies is a function of the scientific community’s belief in how long it should take for users to form a habit). On the other hand, the number of users, n, can in principle grow as large as funding allows.

Additionally, in our batched setting, we assume that the means of the arms can change over time, i.e., from batch to batch, which reflects the temporal non-stationarity that is prevalent in many real world bandit application problems. For example, in online recommendation systems, the click through rate of a given recommendation typically varies over time, e.g., breaking news articles become less popular over time [38, 26]. Online education and mobile health are also highly non-stationary problems because users tend to disengage over time, so the same notification may be much less effective if sent near the end of an experiment than sent near the beginning [9, 21, 7]. Our statistical inference method does not need to assume that the number of stationary time periods in the experiment is large and is robust to temporal non-stationarity from batch to batch.

The first contribution of this work is proving that on bandit data, rather surprisingly, whether standard estimators are asymptotically normal can depend on whether the margin is zero.

We prove that for common bandit algorithms, the arm selection probabilities only concentrate if there is a unique optimal arm. Thus, for two-arm bandits, the arm selection probabilities do not concentrate when the margin—the difference in the expected rewards between the arms—is zero. We show that this leads the ordinary least squares (OLS) estimator to be asymptotically normal when the margin is non-zero, and asymptotically not normal when the margin is zero. Since the OLS estimator does not converge uniformly (over values of the margin), standard inference methods (normal approximations, bootstrap¹) can lead to inflated Type-1 error and unreliable confidence intervals on bandit data.

The second contribution of this work is introducing the Batched OLS (BOLS) estimator, which can be used for reliable inference—even in non-stationary settings—on data collected with batched bandits.

We prove that, regardless of whether the margin is zero or not, the BOLS estimator for the margin for both multi-arm and contextual bandits is asymptotically normal and thus can be used for both hypothesis testing and obtaining confidence intervals. Moreover, BOLS is also automatically robust to non-stationarity in the rewards and can be used for constructing valid confidence intervals even if there is non-stationarity in the baseline reward, i.e., if the rewards of the arms change from batch to batch, but the margin remains constant. If the margin itself is also non-stationary, BOLS can also be used for constructing simultaneous confidence intervals for the margins for each batch.

2. Related Work

Batched Bandits

Much work on batched bandits focuses on minimizing regret [33, 10] or identifying the best arm with high probability [2, 18]. The best arm identification literature utilizes high probability confidence bounds to construct confidence intervals for bandit parameters; we will discuss this method in the next section. Note that in contrast to other batched bandit literature that allow batch sizes to be adjusted adaptively [33], here we do not have adaptive control over the batch sizes. Batched bandits are closely related to multistage adaptive clinical trials, in which between each batch (or stage of the trial) the data collection procedure can be adjusted depending on the outcome of the previous batches. Our Batched OLS estimator is most closely related to “stage-wise” p-values for group sequential trials that are computed on each stage separately [40]. p-value combination tests are commonly used to combine stage-wise p-values, when the sequence of p-values are shown to be independent or p-clud, meaning that under the null each p-value has a Uniform(0,1) distribution conditional on past p-values [40]. [29] formally establish the independence of stage-wise p-values for two-stage trials in which there are a countable number of adaptive rules; note that this rules out bandit algorithms with real-valued arm selection probabilities, like Thompson Sampling. [4] establishes the p-clud property for two-stage adaptive clinical trials under the assumption that the distribution of the second stage data is known conditioned on the decision rule and first stage data under the null hypothesis. Neither of these methods are sufficient for obtaining independent p-values for adaptive trials (1) with an arbitrary number of stages, (2) where exact distribution of rewards is unknown, and (3) where the action selection probabilities can be real numbers, like for Thompson Sampling.

High Probability Confidence Intervals

High probability confidence intervals provide stronger guarantees than those constructed using asymptotic approximations. In particular, these bounds are guaranteed to hold for finite number of observations and often even hold uniformly over all n and T. These types of bounds are used throughout the bandit and reinforcement learning literature to construct confidence intervals for bandit parameters [14, 20], prove regret bounds [1, 25], and provide guarantees regarding best arm identification [16, 17]. The primary drawback of high probability confidence intervals is that they are much more conservative than those constructed using asymptotic approximations. This means that many more observations will be needed to get a confidence interval of the same width or for a statistical test to have the same power when using high probability confidence intervals compared to those constructed using asymptotic approximation. Since the cost of increasing the the number of users in a study can be large, being able to construct narrow—yet reliable—confidence intervals is crucial to many applications.

In our simulations we compare our method to high probability confidence bounds constructed using the self-normalized martingale bound of [1]. This bound is guaranteed to hold on adaptively collected data and is commonly used in the proof of regret bounds for bandit algorithms. We find that all the approaches based on asymptotic approximations (which we discuss next), significantly outperform the statistical test constructed using a self-normalized martingale bound in terms of power. Moreover, despite the weaker guarantees of statistical inference based on asymptotic approximations, they are generally able to provide reliable coverage of confidence intervals and type-1 error control.

Adaptive Inference based on Asymptotic Approximations

A common approach in the literature for performing inference on bandit data is to use adaptive weights, which are weights that are a function of the history. An early example of using adaptive weights is that of [31] and [30], who use adaptive weights in estimating the expected reward under the optimal policy when one has access to i.i.d. observational data. They use an Augmented-Inverse-Probability-Weighted estimator with adaptive weights that are a function of the estimated standard deviation of the reward. [30] conjecture that their approach can be adapted to the adaptive sampling case. Subsequently [11] developed the adaptively weighted method for inference on bandit data to produce the Adaptively-Weighted Augmented-Inverse-Probability-Weighted Estimator (AW-AIPW) for data collected via multi-arm bandits. They prove a central limit theorem (CLT) for AW-AIPW when the adaptive weights satisfy certain conditions. Note, however, the AW-AIPW estimator does not have guarantees in non-stationary settings.

Adaptive weights are also used by [6] to form the W-decorrelated estimator, a debiased version of OLS, that is asymptotically normal. In the multi-arm bandit setting, the adaptive weights are a function of the number of times an arm was chosen previously. We found that in the two-arm setting, the W-decorrelated estimator down-weights rewards from later in the study (Appendix F). [5] introduce the Online Debiased Estimator that also has bias guarantees on adaptive data, but in the more challenging high-dimensional linear regression setting. They prove the asymptotic normality of their estimator in the Gaussian autoregressive time series and the two-batch settings. Note that none of these estimation methods have guarantees in non-stationary bandit settings.

[24] provide conditions under which the OLS estimator is asymptotically normal on adaptively collected data. However, as noted in [39, 6, 11], classical inference techniques developed for i.i.d. data often empirically have inflated Type-1 error on bandit data. In Section 4.1, we discuss the restrictive nature of [24]’s CLT conditions.

3. Problem Formulation

Setup and Notation

Though our results generalize to K-arm, contextual bandits (see Section 5.2), we first focus on the two-arm bandit for expositional simplicity. Suppose there are T timesteps or batches in a study. In each batch t ∈ [1: T], we select n binary actions ${A_{t, i}}_{i = 1}^{n} \in {0, 1}^{n}$ . We then observe independent rewards ${R_{t, i}}_{i = 1}^{n}$ , one for each action selected. Note that the distribution of these random variables changes with the batch size, n. For example, the distribution of the actions one chooses for the 2^nd batch, ${A_{2, i}}_{i = 1}^{n}$ , may change if one has observed n = 10 vs. n = 100 samples ${A_{1, i}, R_{1, i}}_{i = 1}^{n}$ in the first batch. For readability, we omit indexing random variables by n, except for the variables $H_{t - 1}^{(n)}$ and $π_{t}^{(n)}$ , and filtrations like $G_{t - 1}^{(n)}$ to be introduced next.

For each t ∈ [1: T], the bandit selects actions ${A_{t, i}}_{i = 1}^{n} \overset{i . i . d .}{~} Bernoulli (π_{t}^{(n)})$ conditional on $H_{t - 1}^{(n)} ≔ {A_{t^{'}, i}, R_{t^{'}, i}}_{i = 1, t^{'} = 1}^{i = n, t^{'} = t - 1}$ , the history prior to batch t. Note, the action selection probability $π_{t}^{(n)} ≔ ℙ (A_{t, i} = 1 ∣ H_{t - 1}^{(n)})$ depends on the history $H_{t - 1}^{(n)}$ . We assume the following conditional mean for rewards:

E [R_{t, i} ∣ H_{t - 1}^{(n)}, A_{t, i}] = (1 - A_{t, i}) β_{t, 0} + A_{t, i} β_{t, 1} .

(1)

Note in equation (1) we condition on $H_{t - 1}^{(n)}$ because the conditional mean of the reward does not depend on prior rewards or actions. Let $X_{t, i} ≔ {[1 - A_{t, i}, A_{t, i}]}^{⊤} \in ℝ^{2}$ ; note X_t,i is higher dimensional when we add more arms and/or context variables. We define the errors as ϵ_t,i ≔ R_t,i − (X_t,i)^⊤β_t. Equation (1) implies that {ϵ_t,i : i ∈ [1 : n], t ∈ [1: T]} are a martingale difference array with respect to the filtration ${G_{t}^{(n)}}_{t = 1}^{T}$ , where $G_{t}^{(n)} ≔ σ (H_{t - 1}^{(n)} \cup {A_{t, i}}_{i = 1}^{n})$ ; thus, $E [ϵ_{t, i} ∣ G_{t - 1}^{(n)}] = 0, \forall t, i, n$ . The parameters β_t = (β_t,0, β_t,1) can change across batches t ∈ [1: T], which allows for non-stationarity between batches. Assuming that $β_{t} = β_{t^{'}}$ for all t, t′ ∈ [1: T] simplifies to the stationary mean case.

Action Selection Probability Constraint (Clipping)

In order to perform inference on bandit data it is necessary to guarantee that the bandit algorithm explores sufficiently. For example, the CLTs for both the W-decorrelated [6] and the AW-AIPW [11] estimators have conditions that implicitly require that the bandit algorithms cannot sample any given action with probability that goes to zero or one arbitrarily fast. Greater exploration also increases the power of statistical tests regarding the margin [41]. Moreover, if there is non-stationarity in the margin between batches, it is desirable for the bandit algorithm to continue exploring. We explicitly guarantee exploration by constraining the probability that any given action can be sampled (see Definition 1). We allow the action selection probabilities $π_{t}^{(n)}$ to converge to 0 and/or 1 at some rate.

Definition 1. A clipping constraint with rate f(n) means that $π_{t}^{(n)}$ satisfies the following:

lim_{n \to \infty} ℙ (π_{t}^{(n)} \in [f (n), 1 - f (n)]) = 1

(2)

4. Asymptotic Distribution of the Ordinary Least Squares Estimator

Suppose we are in the stationary case, and we would like to estimate β. Consider the OLS estimator: ${\hat{β}}^{OLS} = {({\underline{X}}^{⊤} \underline{X})}^{- 1} {\underline{X}}^{⊤} R$ , where $\underline{X} ≔ {[X_{1, 1}, \dots, X_{1, n}, \dots, X_{T, 1}, \dots, X_{T, n}]}^{⊤} \in ℝ^{n T \times 2}$ and $R ≔ {[R_{1, 1}, \dots, R_{1, n}, \dots, R_{T, 1}, \dots, R_{T, n}]}^{⊤} \in ℝ^{n T}$ . Note that ${\underline{X}}^{⊤} \underline{X} = \sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} X_{t, i}^{⊤}$ .

4.1. Conditions for Asymptotically Normality of the OLS estimator

If (X_t,i, ϵ_t,i) are i.i.d., $E [ϵ_{t, i}] = 0$ , $E [ϵ_{t, i}^{2}] = σ^{2}$ and the first two moments of X_t,i exist, a classical result from statistics [3] is that the OLS estimator is asymptotically normal, i.e., as n → ∞,

{({\underline{X}}^{⊤} \underline{X})}^{1 / 2} ({\hat{β}}^{OLS} - β) \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{p}) .

[24] generalize this result by proving that the OLS estimator is still asymptotically normal in the adaptive sampling case when ${\underline{X}}^{⊤} \underline{X}$ satisfies a certain stability condition. To show that a similar result holds for the batched setting, we generalize the CLT of [24] to triangular arrays (required since the distribution of our random variables vary as the batch size, n, changes), as stated in Theorem 5.

Condition 1 (Moments). For all t, n, i, $E [ϵ_{t, i}^{2} ∣ G_{t - 1}^{(n)}] = σ^{2}$ and $E [ϵ_{t, i}^{4} ∣ G_{t - 1}^{(n)}] < M < \infty$ .

Condition 2 (Stability). For some non-random sequence of scalars ${a_{i}}_{i = 1}^{\infty}$ , as n → ∞,

a_{n} \cdot \frac{1}{n T} \sum_{t = 1}^{T} \sum_{i = 1}^{n} A_{t, i} \overset{P}{\to} 1

Theorem 1 (Triangular array version [24], Theorem 3). Assuming Conditions 1 and 2, as n → ∞,

{({\underline{X}}^{⊤} \underline{X})}^{1 / 2} ({\hat{β}}^{OLS} - β) \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{p}) .

Note that in the bandit setting, Condition 2 means that prior to running the experiment, the asymptotic rate at which arms will be selected is predictable. We will show that Condition 2 is in a sense necessary for the asymptotic normality of OLS. In Corollary 1 below we state that Conditions 1 and 3, and a non-zero margin are sufficient for stability Condition 2. Later, we will show that when the margin is zero, Condition 2 does not hold for many common bandit algorithms and prove that this leads the OLS estimator to be asymptotically non-normal.

Condition 3 (Conditionally i.i.d. actions). For each t ∈ [1: T], ${A_{t, i}}_{i = 1}^{n} \overset{i . i . d .}{~} Bernoulli (π_{t}^{(n)})$ i.i.d. over i ∈ [1: n] conditional on $H_{t - 1}^{(n)}$ .

Corollary 1 (Sufficient conditions for Theorem 5). If Conditions 1 and 3 hold, and the margin is non-zero, data collected in batches using ϵ-greedy, Thompson Sampling, or UCB with clipping constraint with f(n) = c for some $0 < c \leq \frac{1}{2}$ (see Definition 1) satisfy Theorem 5 conditions.

4.2. Asymptotic Non-Normality under No Margin

We prove the conjecture of [6] that when the margin is zero, the OLS estimator is asymptotically non-normal under common bandit algorithms, including Thompson Sampling, ϵ-greedy, and UCB. Thus as seen in Figure 1, assuming the OLS estimator is approximately Normal on bandit data can lead to inflated Type-1 error, even asymptotically. The asymptotic non-normality of OLS occurs when the margin is zero because when there is no unique optimal arm, $π_{t}^{(n)}$ does not concentrate as n → ∞ (Appendix C).

Figure 1: — Empirical distribution of the Z-statistic (σ² is known) of the OLS estimator for the margin. All simulations are with no margin (β₁ = β₀ = 0); $N (0, 1)$ rewards; T = 25; and n = 100. For ϵ-greedy, ϵ = 0.1.

We state the asymptotic non-normality result for Thompson Sampling in Theorem 2; see Appendix C for the proof and similar results for ϵ-greedy and UCB. It is sufficient to prove asymptotic non-normality for T = 2. Note, ${\hat{Δ}}^{OLS}$ is the difference in the sample means for each arm, so ${\hat{Δ}}^{OLS} = {\hat{β}}_{1}^{OLS} - {\hat{β}}_{0}^{OLS}$ . The Z-statistic of ${\hat{Δ}}^{OLS}$ , which is asymptotically normal under i.i.d. sampling, is as follows:

\sqrt{\frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i}) (\sum_{t = 1}^{2} \sum_{i = 1}^{n} 1 - A_{t, i})}{2 σ^{2} n}} ({\hat{Δ}}^{OLS} - Δ) .

(3)

Theorem 2 (Asymptotic non-normality of OLS estimator under zero margin for Thompson Sampling). Let T = 2 and $π_{1}^{(n)} = \frac{1}{2}$ . If $ϵ_{t, i} \overset{i . i . d .}{~} N (0, 1)$ , we have independent normal priors on arm means ${\tilde{β}}_{0}, {\tilde{β}}_{1} \overset{i . i . d .}{~} N (0, 1)$ , and $π_{2}^{(n)} = π_{min} \lor [(1 - π_{max}) \land ℙ ({\tilde{β}}_{1} > {\tilde{β}}_{0} ∣ H_{1}^{(n)})]$ for constants π_min, π_max with 0 < π_min ≤ π_max < 1, then (3) is asymptotically not normal when the margin is zero.

Since the OLS estimator is asymptotically normal when Δ ≠ 0 (Corollary 1) and asymptotically not Normal when Δ = 0, the OLS estimator does not converge uniformly on data collected under standard bandit algorithms. The non-uniform convergence of the OLS estimator precludes us from using a normal approximation to perform hypothesis testing and construct confidence intervals (see [19]). In real-world applications, there is rarely exactly zero margin. However, the non-uniform convergence of the OLS estimator at zero margin is still practically important because the asymptotic distribution of the OLS estimator when the margin is zero is indicative of the finite-sample distribution when the margin is statistically difficult to differentiate from zero, i.e., when the signal-to-noise ratio, $\frac{| Δ |}{σ}$ , is low. Figure 2 shows that even when the margin is non-zero, when the signal-to-noise ratio is low, confidence intervals constructed using a normal approximation have coverage probabilities below the nominal level. Moreover, for any batch size n and noise variance σ², there exists a non-zero margin size with a finite-sample distribution that is poorly approximated by a normal distribution.

Figure 2: — Empirical undercoverage probabilities (coverage probability below 95%) of confidence intervals using on a normal approximation for the OLS estimator. We use Thompson Sampling with $N (0, 1)$ priors, a clipping constraint of $0.05 \leq π_{t}^{(n)} \leq 0.95$ , $N (0, 1)$ rewards, T = 25, and known σ². Standard errors are < 0.001.

5. Batched OLS Estimator

5.1. Batched OLS Estimator for Multi-Arm Bandits

We now introduce the Batched OLS (BOLS) estimator that is asymptotically normal under a large class of bandit algorithms, even when the margin is zero. Instead of computing the OLS estimator on all the data, we compute the OLS estimator for each batch and normalize it by the variance estimated from that batch. For each t ∈ [1: T], the BOLS estimator of the margin Δ_t := β_t,1 − β_t,0 is:

{\hat{Δ}}_{t}^{BOLS} = \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{i = 1}^{n} 1 - A_{t, i}} - \frac{\sum_{i = 1}^{n} A_{t, i} R_{t, i}}{\sum_{i = 1}^{n} A_{t, i}} .

Theorem 3 (Asymptotic Normality of Batched OLS estimator for multi-arm bandits). Assuming Conditions 1 (moments) and 3 (conditionally i.i.d. actions), and a clipping rate $f (n) = \frac{1}{n^{α}}$ for some 0 ≤ α < 1 (see Definition 1),

[\begin{matrix} \sqrt{\frac{(\sum_{i = 1}^{n} 1 - A_{1, i}) (\sum_{i = 1}^{n} A_{1, i})}{n}} ({\hat{Δ}}_{1}^{BOLS} - Δ_{1}) \\ \sqrt{\frac{(\sum_{i = 1}^{n} 1 - A_{2, i}) (\sum_{i = 1}^{n} A_{2, i})}{n}} ({\hat{Δ}}_{2}^{BOLS} - Δ_{2}) \\ ⋮ \\ \sqrt{\frac{(\sum_{i = 1}^{n} 1 - A_{T, i}) (\sum_{i = 1}^{n} A_{T, i})}{n}} ({\hat{Δ}}_{T}^{BOLS} - Δ_{T}) \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T})

It is straightforward to generalize Theorem 3 to the case that batches are different sizes but the size of the smallest batch goes to infinity and the batch size is independent of the history.

By Theorem 3, for the stationary margin case, we can test H₀ : Δ = c vs. H₁ : Δ ≠ c with the following statistic, which is asymptotically normal under the null:

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \sqrt{\frac{(\sum_{i = 1}^{n} 1 - A_{t, i}) (\sum_{i = 1}^{n} A_{t, i})}{n σ^{2}}} ({\hat{Δ}}_{t}^{BOLS} - c) .

(4)

This type of test statistic—a weighted combination of asymptotically independent normals—a special case of the inverse normal p-value combination test, has been used in simple settings in which the studies (e.g., batches) are independent (e.g., when conducting meta-analyses across multiple studies) [26]. Here the ability to use this type of test statistic is novel since, due to the bandit algorithm, the batches are not independent. The work here demonstrates asymptotic independence and thus for large n the Z-statistics from each batch should be approximately independently distributed.

The key to proving asymptotic normality for BOLS is that the following ratio converges in probability to one: $\frac{(\sum_{i = 1}^{n} 1 - A_{t, i}) (\sum_{i = 1}^{n} A_{t, i})}{n} \frac{1}{n π_{t}^{(n)} (1 - π_{t}^{(n)})} \overset{P}{\to} 1$ . Since $π_{t}^{(n)} \in G_{t - 1}^{(n)}$ , $\frac{1}{n π_{t}^{(n)} (1 - π_{t}^{(n)})}$ is a constant given $G_{t - 1}^{(n)}$ . Thus, even if $π_{t}^{(n)}$ does not concentrate, we are still able to apply the martingale CLT [8] to prove asymptotic normality. See Appendix B for more details.

5.2. Batched OLS Estimator for Contextual Bandits

For contextual K-arm bandits, for any two arms x, y ∈ [0: K − 1], we can estimate the margin between them $Δ_{t, x - y} ≔ β_{t, x} - β_{t, y} \in ℝ^{d}$ . In each batch, we observe context vectors ${C_{t, i}}_{i = 1}^{n}$ for $C_{t, i} \in ℝ^{d}$ . We redefine the history $H_{t - 1}^{(n)} ≔ {C_{t^{'}, i}, A_{t^{'}, i}, R_{t^{'}, i}}_{i = 1, t^{'} = 1}^{i = 1, t^{'} = t - 1}$ and define the filtration $F_{t}^{(n)} ≔ σ (H_{t - 1}^{(n)} \cup {A_{t, i}, C_{t, i}}_{i = 1}^{n})$ . The action selection probabilities $π_{t}^{(n)}$ are now functions of the context, so $π_{t}^{(n)} (C_{t, i}) \in {[0, 1]}^{K}$ is a vector where the k^th dimension equals $ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)}, C_{t, i})$ . We assume the following conditional mean model of the reward: $E [R_{t, i} ∣ F_{t - 1}^{(n)}] = \sum_{k = 0}^{K - 1} I_{(A_{t, i} = k)} C_{t, i}^{⊤} β_{t, k}$ and let $ϵ_{t, i} ≔ R_{t, i} - \sum_{k = 0}^{K - 1} I_{(A_{t, i} = k)} C_{t, i}^{⊤} β_{t, k}$ .

Condition 4 (Conditionally i.i.d. contexts). For each t, C_t,1, C_t,2, …, C_t,n are i.i.d. and its first two moments, μ_t, ${\underline{Σ}}_{t}$ , are non-random given $H_{t - 1}^{(n)}$ , i.e., μ_t, ${\underline{Σ}}_{t} \in σ (H_{t - 1}^{(n)})$ .

Condition 5 (Bounded context). ∥C_t,i∥_max ≤ u for all i, t, n for some constant u. Also, the minimum eigenvalue of ${\underline{Σ}}_{t}$ is lower bounded, i.e., $λ_{min} ({\underline{Σ}}_{t}) > l > 0$ .

Definition 2. A conditional clipping constraint with rate f(n) means that the action selection probabilities $π_{t}^{(n)} : ℝ^{d} \to {[0, 1]}^{K}$ satisfy the following:

ℙ (\forall c \in ℝ^{d}, π_{t}^{(n)} (c) \in {[f (n), 1 - f (n)]}^{K}) \to 1

For each t ∈ [1: T], we have the OLS estimator for $Δ_{t, x - y} : {\hat{Δ}}_{t}^{OLS} ≔ {[{\underline{C}}_{t, x}^{- 1} + {\underline{C}}_{t, y}^{- 1}]}^{- 1} ({\hat{β}}_{t, x}^{OLS} - {\hat{β}}_{t, y}^{OLS})$ , where ${\underline{C}}_{t, k} ≔ \sum_{i = 1}^{n} I_{A_{t, i}^{(n)} = k} C_{t, i} C_{t, i}^{⊤} \in ℝ^{d \times d}$ , ${\hat{β}}_{t, k}^{OLS} = {\underline{C}}_{t, k}^{- 1} \sum_{i = 1}^{n} I_{A_{t, i}^{(n)} = k} C_{t, i} R_{t, i}$ .

Theorem 4 (Asymptotic Normality of Batched OLS estimator for contextual bandits). Assuming Conditions 1 (moments)², 3 (conditionally i.i.d. actions), 4, and 5, and a conditional clipping rate f(n) = c for some $0 \leq c < \frac{1}{2}$ (see Definition 2),

[\begin{array}{c} {[{\underline{C}}_{1, x}^{- 1} + {\underline{C}}_{1, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{1}^{OLS} - Δ_{1, x - y}) \\ {[{\underline{C}}_{2, x}^{- 1} + {\underline{C}}_{2, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{2}^{OLS} - Δ_{2, x - y}) \\ ⋮ \\ {[{\underline{C}}_{T, x}^{- 1} + {\underline{C}}_{T, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{T}^{OLS} - Δ_{T, x - y}) \end{array}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T d}) .

5.3. Batched OLS Statistic for Non-Stationary Bandits

Many real-world problems we would like to use bandit algorithms for have non-stationary over time. For example, in online advertising, the effectiveness of an ad may change over time due to exposure to competing ads and general societal changes that could affect perceptions of an ad. We may believe that the expected reward for a given action may vary over time, but that the margin is constant from batch to batch. In the online advertising setting, this would mean whether one ad is always better than another is stable, but the overall effectiveness of both ads may change over time. In this case, we can simply use the BOLS test statistic described earlier in equation (4) to test H₀ : Δ = 0 vs. H₁ : Δ ≠ 0. Note that the BOLS test statistic for the margin is robust to non-stationarity in the baseline reward without any adjustment. Moreover, in our simulation settings we estimate the variance σ² separately for each batch, which allows for non-stationarity in the variance between batches as well; see Appendix A for variance estimation details and see Section 6 for simulation results. Additionally, in the case that we believe that the margin itself may vary from batch to batch, the BOLS test statistic can also be used to construct confidence regions that contain the true margin Δ_t for each batch simultaneously; see Appendix A.5 for details.

6. Simulation Experiments

Procedure

We focus on the two-arm bandit setting and test whether the margin is zero, specifically H₀ : Δ = 0 vs. H₁ : Δ ≠ 0. We perform experiments for when the noise variance σ² is estimated. We assume homoscedastic errors throughout. See Appendix A.4 for more details about how we estimate the noise variance and more details regarding our experimental setup. In Figures 6 and 7, we display results for stationary bandits and in Figure 5 we show results for bandits with non-stationary baseline rewards. See Appendix A.5 for results for bandits with non-stationary margins.

Figure 5: — **Non-stationary baseline reward setting:** Type-1 error (upper left) and power (upper right) for a two-sided test of H₀ : Δ = 0 vs. H₁ : Δ ≠ 0 (α = 0.05). In the lower two plots we plot the expected rewards for each arm; note the margin is constant across batches. We use n = 25 and a clipping constraint of $0.1 \leq π_{t}^{(n)} \leq 0.9$ . We use 100k Monte Carlo simulations and standard errors are < 0.002.

In our simulations, we found that OLS and AW-AIPW have inflated Type-1 error. Since Type-1 control is a hard constraint, solutions with inflated Type-1 error are infeasible solutions. In the power plots, we adjust the cutoffs of the estimators to ensure proper Type-1 error control; if an estimator has inflated Type-1 error under the null, in the power simulations we use a critical value estimated using the simulations under the null. Note that it is unfeasible to make these cutoff adjustment for real experiments (unless one found the worst case setting), as there are many nuisance parameters—like the expected rewards for each arm and the noise variance—which can affect cutoff values.

Results

Figure 3 shows that for small sample sizes (nT ≲ 300), BOLS has more reliable Type-1 error control than AW-AIPW with variance stabilizing weights. After nT ≥ 500 samples, AW-AIPW has proper Type-1 error, and by Figure 4 it always has slightly greater power than BOLS in the stationary setting. The W-decorrelated estimator has reliable Type-1 error control, but very low power compared to AW-AIPW and BOLS. Finally the high probability, self-normalized martingale bound of [1], which we use for hypothesis testing, has very low power compared to the asymptotic approximation statistical inference methods.

Figure 4: — **Stationary Setting:** Power for a two-sided test of H₀ : Δ = 0 vs. H₁ : Δ ≠ 0 (α = 0.05). We set β₁ = 0, β₀ = 0.25, n = 25, and a clipping constraint of $0.1 \leq π_{t}^{(n)} \leq 0.9$ . We use 100k Monte Carlo simulations and standard errors are < 0.002. We account for Type-1 error inflation as described in Section 6.

In Figure 5, we display simulation results for the non-stationary baseline reward setting. Whereas other estimators have no Type-1 error guarantees, BOLS still has proper Type-1 error control in the non-stationary baseline reward setting. Moreover, BOLS can have much greater power than other estimators when there is non-stationarity in the baseline reward. Overall, BOLS is favorable over other estimators in small-sample settings or when one wants to be robust to non-stationarity in the baseline reward—at the cost of losing a little power if the environment is stationary.

7. Discussion

We found that the OLS estimator is asymptotically non-normal when the margin is zero due to the non-concentration of the action selection probabilities. Since the OLS estimator is a canonical example of a method-of-moments estimator [13], our results suggest that the inferential guarantees of standard method-of-moments estimators may fail to hold on adaptively collected data when there is no unique optimal, regret-minimizing policy. We develop the Batched OLS estimator, which is asymptotically normal even when the action selection probabilities do not concentrate. An open question is whether batched versions of general method-of-moments estimators could similarly be used for adaptive inference.

Broader Impact.

Our work has the positive impact of encouraging the use of valid statistical inference methods on bandit data, which ultimately leads to more reliable scientific conclusions. In addition, by providing a valid statistical inference method on bandit data, our work facilitates the use of bandit algorithms in experimentation.

Acknowledgments and Disclosure of Funding

Research reported in this paper was supported by National Institute on Alcohol Abuse and Alcoholism (NIAAA) of the National Institutes of Health under award number R01AA23187, National Institute on Drug Abuse (NIDA) of the National Institutes of Health under award number P50DA039838, and National Cancer Institute (NCI) of the National Institutes of Health under award number U01CA229437. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

A. Simulation Details

A.1. W-Decorrelated Estimator

For the W-decorrelated estimator [6], for a batch size of n and for T batches, we set λ to be the $\frac{1}{n T}$ quantile of $λ_{min} ({\underline{X}}^{⊤} \underline{X}) / log (n T)$ , where $λ_{min} ({\underline{X}}^{⊤} \underline{X})$ denotes the minimum eigenvalue of ${\underline{X}}^{⊤} \underline{X}$ . This procedure of choosing λ is motivated by the conditions of Theorem 4 of [6] and follows the methods used by [6] in their simulation experiments. We had to adjust the original procedure for choosing λ used by [6] (who set λ to the 0.15 quantile of $λ_{min} ({\underline{X}}^{⊤} \underline{X})$ ), because they only evaluated the W-decorrelated method for when the total number of samples was nT = 1000 and valid values of λ changes with the sample size.

A.2. AW-AIPW Estimator

Since the AW-AIPW test statistic for the treatment effect is not explicitly written in the original paper [11], we now write the formulas for the AW-AIPW estimator of the treatment effect: ${\hat{Δ}}^{AW-AIPW} ≔ {\hat{β}}_{1}^{AW-AIPW} - {\hat{β}}_{0}^{AW-AIPW}$ . We use the variance stabilizing weights, equal to the square root of the sampling probabilities, $\sqrt{π_{t}^{(n)}}$ and $\sqrt{1 - π_{t}^{(n)}}$ . Below, $N_{t, 1} = \sum_{i = 1}^{n} A_{t, i}$ and $N_{t, 0} = \sum_{i = 1}^{n} (1 - A_{t, i})$ .

Y_{t, 1} ≔ \frac{A_{t, i}}{π_{t}^{(n)}} R_{t, i} + (1 - \frac{A_{t, i}^{(n)}}{π_{t}^{(n)}}) \frac{\sum_{t^{'} = 1}^{t - 1} \sum_{i = 1}^{n} A_{t, i}^{(n)} R_{t, i}}{\sum_{t^{'} = 1}^{t - 1} N_{t^{'}, 1}}

Y_{t, 0} ≔ \frac{1 - A_{t, i}^{(n)}}{1 - π_{t}^{(n)}} R_{t, i} + (1 - \frac{1 - A_{t, i}}{1 - π_{t}^{(n)}}) \frac{\sum_{t^{'} = 1}^{t - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{t^{'} = 1}^{t - 1} N_{t^{'}, 0}}

{\hat{β}}_{1}^{AW-AIPW} ≔ \frac{\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{π_{t}^{(n)}} Y_{t, 1}}{\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{π_{t}^{(n)}}} and {\hat{β}}_{0}^{AW-AIPW} ≔ \frac{\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{1 - π_{t}^{(n)}} Y_{t, 0}}{\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{1 - π_{t}^{(n)}}}

The variance estimator for ${\hat{Δ}}^{AW-AIPW}$ is ${\hat{V}}_{0} + {\hat{V}}_{1} + 2 {\hat{C}}_{0, 1}$ where

{\hat{V}}_{1} ≔ \frac{\sum_{t = 1}^{T} \sum_{i = 1}^{n} π_{t}^{(n)} {(Y_{t, 1} - {\hat{β}}_{1}^{AW-AIPW})}^{2}}{{(\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{π_{t}^{(n)}})}^{2}} and {\hat{V}}_{0} ≔ \frac{\sum_{t = 1}^{T} \sum_{i = 1}^{n} (1 - π_{t}^{(n)}) {(Y_{t, 0} - {\hat{β}}_{0}^{AW-AIPW})}^{2}}{{(\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{1 - π_{t}^{(n)}})}^{2}}

{\hat{C}}_{0, 1} ≔ - \frac{\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{π_{t}^{(n)} (1 - π_{t}^{(n)})} (Y_{t, 1} - {\hat{β}}_{1}^{AW-AIPW}) (Y_{t, 0} - {\hat{β}}_{0}^{AW-AIPW})}{(\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{π_{t}^{(n)}}) (\sum_{t = 1}^{T} \sum_{i = 1}^{n} \sqrt{1 - π_{t}^{(n)}})}

A.3. Self-Normalized Martingale Bound

By the self-normalized martingale bound of [1], specifically Theorem 1 and Lemma 6, we have that in the two arm bandit setting,

ℙ (\forall T, n \geq 1, | {\hat{β}}_{1}^{OLS} - β_{1} | \leq c_{1, T} and | {\hat{β}}_{0}^{OLS} - β_{0} | \leq c_{0, T}) \geq 1 - δ

where

c_{a, T} = \sqrt{σ^{2} \frac{1 + \sum_{t = 1}^{T} N_{t, a}}{{(\sum_{t = 1}^{T} N_{t, a})}^{2}} (1 + 2 log (\frac{2 \sqrt{1 + \sum_{t = 1}^{T} N_{t, a}}}{δ}))}

We estimate σ² using the procedure stated below for the OLS estimator. We reject the null hypothesis that Δ = 0 whenever either the confidence bounds for the two arms are non-overlapping. Specifically when

{\hat{β}}_{1}^{OLS} + c_{1, T} \leq {\hat{β}}_{0}^{OLS} - c_{0, T} or {\hat{β}}_{0}^{OLS} + c_{0, T} \leq {\hat{β}}_{1}^{OLS} - c_{1, T}

A.4. Estimating Noise Variance

OLS Estimator

Given the OLS estimators for the means of each arm, ${\hat{β}}_{1}^{OLS}$ , ${\hat{β}}_{0}^{OLS}$ , we estimate the noise variance σ² as follows:

{\hat{σ}}^{2} ≔ \frac{1}{n T - 2} \sum_{t = 1}^{T} \sum_{i = 1}^{n} {(R_{t, i} - A_{t, i} {\hat{β}}_{1}^{OLS} - (1 - A_{t, i}) {\hat{β}}_{0}^{OLS})}^{2} .

We use a degrees of freedom bias correction by normalizing by nT − 2 rather than nT. Since the W-decorrelated estimator is a modified version of the OLS estimator, we also use this same noise variance estimator for the W-decorrelated estimator; we found that this worked well in practice, in terms of Type-1 error control.

Batched OLS

Given the Batched OLS estimators for the means of each arm for each batch, ${\hat{β}}_{t, 1}^{BOLS}$ , ${\hat{β}}_{t, 0}^{BOLS}$ , we estimate the noise variance for each batch $σ_{t}^{2}$ as follows:

{\hat{σ}}_{t}^{2} ≔ \frac{1}{n - 2} \sum_{i = 1}^{n} {(R_{t, i} - A_{t, i} {\hat{β}}_{t, 1}^{BOLS} - (1 - A_{t, i}) {\hat{β}}_{t, 0}^{BOLS})}^{2} .

Again, we use a degrees of freedom bias correction by normalizing by n − 2 rather than n. We prove the consistency of ${\hat{σ}}_{t}^{2}$ (meaning ${\hat{σ}}_{t}^{2} \overset{P}{\to} σ^{2}$ ) in Corollary 4. Using BOLS to test H₀ : Δ = a vs. H₁ : Δ ≠ a, we use the following test statistic:

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} \sqrt{\frac{N_{t, 0} N_{t, 1}}{n {\hat{σ}}_{t}^{2}}} ({\hat{Δ}}_{t}^{BOLS} - a) .

Above, $N_{t, 1} = \sum_{i = 1}^{n} A_{t, i}$ and $N_{t, 0} = \sum_{i = 1}^{n} (1 - A_{t, i})$ . For this test statistic, we use cutoffs based on the Student-t distribution, i.e., for $Y_{t} \overset{i . i . d .}{~} t_{n - 2}$ we use a cutoff c_α/2 such that

ℙ (| \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} Y_{t} | > c_{α / 2}) = α .

We found c_α/2 by simulating draws from the Student-t distribution.

A.5. Non-Stationary Treatment Effect

When we believe that the margin itself varies from batch to batch, we are able to construct a confidence region that contains the true margin Δ_t for each batch simultaneously with probability 1 − α.

Corollary 2 (Confidence band for margin for non-stationary bandits). Assume the same conditions as Theorem 3. Let z_x be x^th quantile of the standard Normal distribution, i.e., for $Z ~ N (0, 1)$ , $ℙ (Z < z_{α}) = α$ . For each t ∈ [1: T], we define the interval

L_{t} = {\hat{Δ}}_{t}^{OLS} \pm z_{1 - \frac{α}{2 T}} \sqrt{\frac{σ^{2} n}{N_{t, 0} N_{t, 1}}} .

${lim}_{n \to \infty} ℙ (\forall t \in [1 : T], Δ_{t} \in L_{t}) \geq 1 - α$ . Above, $N_{t, 1} = \sum_{i = 1}^{n} A_{t, i}$ and $N_{t, 0} = \sum_{i = 1}^{n} (1 - A_{t, i})$ .

Proof: Note that by Corollary 3,

ℙ (exists some t \in [1 : T] s.t. Δ_{t} \notin L_{t}) \leq \sum_{t = 1}^{T} ℙ (Δ_{t} \notin L_{t}) \to \sum_{t = 1}^{T} \frac{α}{T} = α

where the limit is as n → ∞. Since

ℙ (\forall t \in [1 : T], Δ_{t} \in L_{t}) = 1 - ℙ (exists some t \in [1 : T] s.t. Δ_{t} \notin L_{t})

Thus,

lim_{n \to \infty} ℙ (\forall t \in [1 : T], Δ_{t} \in L_{t}) \geq 1 - α

□

We can also test the null hypothesis of no margin against the alternative that at least one batch has non-zero margin, i.e., H₀ : ∀t ∈ [1: T], Δ_t = 0 vs. H₁ : ∃t ∈ [1: T] s.t. Δ_t ≠ 0. Note that the global null stated above is of great interest in the mobile health literature [?, ?]. Specifically we use the following test statistic:

\sum_{t = 1}^{T} \frac{N_{t, 0} N_{t, 1}}{σ^{2} n} {({\hat{Δ}}_{t}^{OLS} - 0)}^{2},

which by Theorem 3 converges in distribution to a chi-squared distribution with T degrees of freedom under the null Δ_t = 0 for all t.

To account for estimating noise variance σ², in our simulations for this test statistic, we use cutoffs based on the Student-t distribution, i.e., for $Y_{t} \overset{i . i . d .}{~} t_{n - 2}$ we use a cutoff c_α/2 such that

ℙ (\frac{1}{T} \sum_{t = 1}^{T} Y_{t}^{2} > c_{α}) = α .

We found c_α by simulating draws from the Student-t distribution.

In the plots below we call the test statistic in (A.5) “BOLS Non-Stationary Treatment Effect” (BOLS NSTE). BOLS NSTE performs poorly in terms of power compared to other test statistics in the stationary setting; however, in the non-stationary setting, BOLS NSTE significantly outperforms all other test statistics, which tend to have low power when the average treatment effect is close to zero. Note that the W-decorrelated estimator performs well in the left plot of Figure 8; this is because as we show in Appendix F, the W-decorrelated estimator upweights samples from the earlier batches in the study. So when the treatment effect is large in the beginning of the study, the W-decorrelated estimator has high power and when the treatment effect is small or zero in the beginning of the study, the W-decorrelated estimator has low power.

Figure 6: — **Stationary Setting:** Type-1 error for a two-sided test of H₀ : Δ = 0 vs. H₁ : Δ ≠ 0 (α = 0.05). We set β₁ = β₀ = 0, n = 25, and a clipping constraint of $0.1 \leq π_{t}^{(n)} \leq 0.9$ . We use 100k Monte Carlo simulations and standard errors are < 0.001.

Figure 7: — **Stationary Setting:** Power for a two-sided test of H₀ : Δ = 0 vs. H₁ : Δ ≠ 0 (α = 0.05). We set β₁ = 0, β₀ = 0.25, n = 25, and a clipping constraint of $0.1 \leq π_{t}^{(n)} \leq 0.9$ . We use 100k Monte Carlo simulations and standard errors are < 0.002. We account for Type-1 error inflation as described in Section 6.

Figure 8: — **Nonstationary setting:** The two upper plots display the power of estimators for a two-sided test of H₀ : ∀t ∈ [1: T], β_t,1 − β_t,0 = 0 vs. H₁ : ∃t ∈ [1: T], β_t,1 − β_t,0 ≠ 0 (α = 0.05). The two lower plots display two treatment effect trends; the left plot considers a decreasing trend (quadratic function) and the right plot considers a oscillating trend (sin function). We set n = 25, and a clipping constraint of $0.1 \leq π_{t}^{(n)} \leq 0.9$ . We use 100k Monte Carlo simulations and standard errors are < 0.002.

B. Asymptotic Normality of the OLS Estimator

Condition 6 (Weak moments). $\forall t, n, i, E [ϵ_{t, i}^{2} ∣ G_{t - 1}^{(n)}] = σ^{2}$ and for all $\forall t, n, i, E [φ (ϵ_{t, i}^{2}) ∣ G_{t - 1}^{(n)}] < M < \infty$ a.s. for some function φ where ${lim}_{x \to \infty} \frac{φ (x)}{x} \to \infty$ .

Condition 7 (Stability). There exists a sequence of nonrandom positive-definite symmetric matrices, ${\underline{V}}_{n}$ , such that

${\underline{V}}_{n}^{- 1} {(\sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} X_{t, i}^{⊤})}^{\frac{1}{2}} = {\underline{V}}_{n}^{- 1} {({\underline{X}}^{⊤} \underline{X})}^{\frac{1}{2}} \overset{P}{\to} {\underline{I}}_{p}$
${max}_{i \in [1 : n], t \in [1 : T]} {‖ {\underline{V}}_{n}^{- 1} X_{t, i} ‖}_{2} \overset{P}{\to} 0$

Theorem 5 (Triangular array version of Lai & Wei (1982), Theorem 3). Let $X_{t, i} \in ℝ^{p}$ be non-anticipating with respect to filtration ${G_{t}^{(n)}}_{t = 1}^{T}$ , so X_t,i is $G_{t - 1}^{(n)}$ measurable. We assume the following conditional mean model for rewards:

E [R_{t, i} ∣ G_{t - 1}^{(n)}] = X_{t, i}^{⊤} β .

We define $ϵ_{t, i} ≔ R_{t, i} - X_{t, i}^{⊤} β$ . Note that ${ϵ_{t, i}}_{i = 1, t = 1}^{i = n, t = T}$ is a martingale difference array with respect to filtration ${G_{t}^{(n)}}_{t = 1}^{T}$ .

Assuming Conditions 6 and 7, as n → ∞,

{({\underline{X}}^{⊤} \underline{X})}^{1 / 2} ({\hat{β}}^{OLS} - β) \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{p})

Note, in the body of the paper we state that this theorem holds in the two-arm bandit case assuming Conditions 2 and 1. Note that Condition 1 is sufficient for Condition 6 and Condition 2 is sufficient for Condition 7 in the two-arm bandit case.

Proof:

{\hat{β}}^{OLS} = ({({\underline{X}}^{⊤} \underline{X})}^{- 1} {\underline{X}}^{(n), ⊤} R^{(n)} = {({\underline{X}}^{⊤} \underline{X})}^{- 1} {\underline{X}}^{⊤} (\underline{X} β + ϵ)

{\hat{β}}^{OLS} - β = {({\underline{X}}^{⊤} \underline{X})}^{- 1} {\underline{X}}^{⊤} ϵ = {(\sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} X_{t, i}^{⊤})}^{- 1} \sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} ϵ_{t, i}

It is sufficient to show that as n → ∞:

{({\underline{X}}^{⊤} \underline{X})}^{- 1 / 2} \sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} ϵ_{t, i} \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{p})

By Slutsky’s Theorem and Condition 7 (a), it is also sufficient to show that as n → ∞,

{\underline{V}}_{n}^{- 1} \sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} ϵ_{t, i} \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{p})

By Cramer-Wold device, it is sufficient to show multivariate normality if for any fixed $c \in ℝ^{p}$ s.t. ∥c∥₂ = 1, as n → ∞,

c^{⊤} {\underline{V}}_{n}^{- 1} \sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} ϵ_{t, i} \overset{D}{\to} N (0, σ^{2})

We will prove this central limit theorem by using a triangular array martingale central limit theorem, specifically Theorem 2.2 of [8]. We will do this by letting $Y_{t, i} = c^{T} {\underline{V}}_{n}^{- 1} X_{t, i} ϵ_{t, i}$ . The theorem states that as n → ∞, $\sum_{t = 1}^{T} \sum_{i = 1}^{n} Y_{t, i} \overset{D}{\to} N (0, σ^{2})$ if the following conditions hold as n → ∞:

$\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{t, i} ∣ G_{t - 1}^{(n)}] \overset{P}{\to} 0$
$\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{t, i}^{2} ∣ G_{t - 1}^{(n)}] \overset{P}{\to} σ^{2}$
$\forall δ > 0, \sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{t, i}^{2} I_{(| Y_{t, i} | > δ)} ∣ G_{t - 1}^{(n)}] \overset{P}{\to} 0$

Useful Properties

Note that by Cauchy-Schwartz and Condition 7 (b), as n → ∞,

max_{i \in [1 : n], t \in [1 : T]} | c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i} | \leq max_{i \in [1 : n], t \in [1 : T]} ‖ c ‖_{2} {‖ {\underline{V}}_{n}^{- 1} X_{t, i} ‖}_{2} \overset{P}{\to} 0

By continuous mapping theorem and since the square function on non-negative inputs is order preserving, as n → ∞,

{(max_{i \in [1 : n], t \in [1 : T]} | c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i} |)}^{2} = max_{i \in [1 : n], t \in [1 : T]} {(c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i})}^{2} \overset{P}{\to} 0

(5)

By Condition 7 (a) and continuous mapping theorem, $c^{⊤} {\underline{V}}_{n}^{- 1} {(X_{t, i}^{⊤} X_{t, i})}^{1 / 2} \overset{P}{\to} c^{⊤}$ , so

c^{⊤} {\underline{V}}_{n}^{- 1} {(X_{t, i}^{⊤} X_{t, i})}^{1 / 2} {(X_{t, i}^{⊤} X_{t, i})}^{1 / 2} {\underline{V}}_{n}^{- 1} c \overset{P}{\to} c^{⊤} c = 1

Thus,

c^{⊤} {\underline{V}}_{n}^{- 1} (\sum_{t = 1}^{T} \sum_{i = 1}^{n} X_{t, i} X_{t, i}^{⊤}) {\underline{V}}_{n}^{- 1} c = \sum_{t = 1}^{T} \sum_{i = 1}^{n} c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i} X_{t, i}^{⊤} {\underline{V}}_{n}^{- 1} c \overset{P}{\to} 1

Since $c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i}$ is a scalar, as n → ∞,

\sum_{t = 1}^{T} \sum_{i = 1}^{n} {(c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i})}^{2} \overset{P}{\to} 1

(6)

Condition (a): Martingale

\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i} ϵ_{t, i} ∣ G_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i} E [ϵ_{t, i} ∣ G_{t - 1}^{(n)}] = 0

Condition (b): Conditional Variance

\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [{(c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i})}^{2} ϵ_{t, i}^{2} ∣ G_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} {(c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i})}^{2} E [ϵ_{t, i}^{2} ∣ G_{t - 1}^{(n)}] = σ^{2} \sum_{t = 1}^{T} \sum_{i = 1}^{n} {(c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i})}^{2} \overset{P}{\to} σ^{2}

where the last equality holds by Condition 6 and the limit holds by (6) as n → ∞.

Condition (c): Lindeberg Condition Let δ > 0. We want to show that as n → ∞,

\sum_{t = 1}^{T} \sum_{i = 1}^{n} Z_{t, i}^{2} E [ϵ_{t, i}^{2} I_{(Z_{t, i}^{2} ϵ_{t, i}^{2} > δ^{2})} ∣ G_{t - 1}^{(n)}] \overset{P}{\to} 0

where above, we define $Z_{t, i}^{(n)} ≔ c^{⊤} {\underline{V}}_{n}^{- 1} X_{t, i}$ . By Condition 6, we have that for all n ≥ 1,

max_{t \in [1 : T], i \in [1 : n]} E [φ (ϵ_{t, i}^{2}) ∣ G_{t - 1}^{(n)}] < M

Since we assume that ${lim}_{x \to \infty} \frac{φ (x)}{x} = \infty$ , for all m ≥ 1, there exists a b_m s.t. φ(x) ≥ mMx for all x ≥ b_m. So, for all n, t, i,

M \geq E [φ (ϵ_{t, i}^{2}) ∣ G_{t - 1}^{(n)}] \geq E [φ (ϵ_{t, i}^{2}) I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ G_{t - 1}^{(n)}] \geq m M E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ G_{t - 1}^{(n)}]

Thus,

max_{t \in [1 : T], i \in [1 : n]} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ G_{t - 1}^{(n)}] \leq \frac{1}{m}

So we have that

\sum_{t = 1}^{T} \sum_{i = 1}^{n} Z_{t, i}^{2} E [ϵ_{t, i}^{2} I_{(Z_{t, i}^{2} ϵ_{t, i}^{2} > δ^{2})} ∣ G_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} Z_{t, i}^{2} (E [ϵ_{t, i}^{2} I_{(Z_{t, i}^{2} ϵ_{t, i}^{2} > δ^{2})} ∣ G_{t - 1}^{(n)}] I_{(Z_{t, i}^{2} \leq δ^{2} / b_{m})} + E [ϵ_{t, i}^{2} I_{(Z_{t, i}^{2} ϵ_{t, i}^{2} > δ^{2})} ∣ G_{t - 1}^{(n)}] I_{(Z_{t, i}^{2} > δ^{2} / b_{m})}) \leq \sum_{t = 1}^{T} \sum_{i = 1}^{n} Z_{t, i}^{2} (E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > b_{m})} ∣ G_{t - 1}^{(n)}] + σ^{2} I_{(Z_{t, i}^{2} > δ^{2} / b_{m})}) \leq (\frac{1}{m} + σ^{2} I_{({max}_{t^{'} \in [1 : T], j \in [1 : n]} Z_{t^{'}, j}^{2} > δ^{2} / b_{m})}) \sum_{t = 1}^{T} \sum_{i = 1}^{n} Z_{t, i}^{2}

By Slutsky’s Theorem and (6), it is sufficient to show that as n → ∞,

\frac{1}{m} + σ^{2} I_{({max}_{t^{'} \in [1 : T], j \in [1 : n]} Z_{t^{'}, j}^{2} > δ^{2} / b_{m})} \overset{P}{\to} 0

For any ϵ > 0,

ℙ (\frac{1}{m} + σ^{2} I_{({max}_{t^{'} \in [1 : T], j \in [1 : n]} Z_{t^{'}, j}^{2} > δ^{2} / b_{m})} > ϵ) \leq I_{(\frac{1}{m} > \frac{ϵ}{2})} + ℙ (σ^{2} I_{({max}_{t^{'} \in [1 : T], j \in [1 : n]} Z_{t^{'}, j}^{2} > δ^{2} / b_{m})} > \frac{ϵ}{2})

We can choose m such that $\frac{1}{m} \leq \frac{ϵ}{2}$ , so $ℙ (\frac{1}{m} > \frac{ϵ}{2}) = 0$ . For the second term (note that m is now fixed),

ℙ (σ^{2} I_{({max}_{t^{'} \in [1 : T], j \in [1 : n]} Z_{t^{'}, j}^{2} > δ^{2} / b_{m})} > \frac{ϵ}{2}) \leq ℙ (max_{t^{'} \in [1 : T], j \in [1 : n]} Z_{t^{'}, j}^{2} > δ^{2} / b_{m}) \to 0

where the last limit holds by (5) as n → ∞. □

B.1. Corollary 1 (Sufficient conditions for Theorem 5)

Under Conditions 1 and 3, when the treatment effect is non-zero data collected in batches using ϵ-greedy, Thompson Sampling, or UCB with a fixed clipping constraint (see Definition 1) will satisfy Theorem 5 conditions.

Proof: The only condition of Theorem 5 that needs to verified is Condition 2. To satisfy Condition 2, it is sufficient to show that for any given Δ, for some constant c ∈ (0, T),

\frac{1}{n} \sum_{t = 1}^{T} \sum_{i = 1}^{n} A_{t, i} = \frac{1}{n} \sum_{t = 1}^{T} N_{t, 1} \overset{P}{\to} c .

ϵ-greedy

We assume without loss of generality that Δ > 0 and $π_{1}^{(n)} = \frac{1}{2}$ . Recall that for ϵ-greedy, for a ∈ [2: T],

π_{a}^{(n)} = {\begin{array}{l} 1 - \frac{ϵ}{2} & if \frac{\sum_{t - 1}^{a} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{\sum_{t^{'} = 1}^{n} N_{t^{'}, 1}} > \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{t^{'} = 1}^{n} N_{t^{'}, 0}} \\ \frac{ϵ}{2} & otherwise \end{array}

Thus to show that $π_{a}^{(n)} \overset{P}{\to} 1 - \frac{ϵ}{2}$ for all a ∈ [2: T], it is sufficient to show that

ℙ (\frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 1}} > \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 0}}) \to 1

(7)

To show (7), it is equivalent to show that

ℙ (Δ > \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 0}} - \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 1}}) \to 1

(8)

To show (8), it is sufficient to show that

\frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 0}} - \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 1}} \overset{P}{\to} 0.

(9)

To show (9), it is equivalent to show that

\sum_{t = 1}^{a} \frac{\sqrt{N_{t, 0}}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 0}} \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} - \sum_{t = 1}^{a} \frac{\sqrt{N_{t, 1}}}{\sum_{t^{'} = 1}^{a} N_{t^{'}, 1}} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} \overset{P}{\to} 0.

(10)

By Lemma 1, for all t ∈ [1: T],

\frac{N_{t, 1}}{π_{t}^{(n)} n} \overset{P}{\to} 1

Thus by Slutsky’s Theorem, to show (10), it is sufficient to show that

\sum_{t = 1}^{a} \frac{\sqrt{n (1 - π_{t}^{(n)})}}{n \sum_{t^{'} = 1}^{a} (1 - π_{t^{'}}^{(n)})} \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} - \sum_{t = 1}^{a} \frac{\sqrt{n π_{t}^{(n)}}}{n \sum_{t^{'} = 1}^{a} π_{t^{'}}^{(n)}} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} \overset{P}{\to} 0.

(11)

Since $π_{t}^{(n)} \in [\frac{ϵ}{2}, 1 - \frac{ϵ}{2}]$ for all t, n, the left hand side of (11) equals the following:

\sum_{t = 1}^{a} o_{p} (1) \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} - \sum_{t = 1}^{a} o_{p} (1) \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} \overset{P}{\to} 0.

The above limit holds because by Thereom 3, we have that

(\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}, \dots, \frac{\sum_{i = 1}^{n} A_{T, i} ϵ_{T, i}}{\sqrt{N_{T, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{T, i}) ϵ_{T, i}}{\sqrt{N_{T, 0}}}) \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{2 T}) .

(12)

Thus, by Slutsky’s Theorem and Lemma 1, we have that

\frac{1}{n} \sum_{t = 1}^{T} N_{t, 1} \overset{P}{\to} \frac{1}{2} + (T - 1) (1 - \frac{ϵ}{2}) and \frac{1}{n} \sum_{t = 1}^{T} N_{t, 0} \overset{P}{\to} \frac{1}{2} + (T - 1) \frac{ϵ}{2}

Thompson Sampling

We assume without loss of generality that Δ > 0 and $π_{1}^{(n)} = \frac{1}{2}$ . Recall that for Thompson Sampling with independent standard normal priors $({\tilde{β}}_{1}, {\tilde{β}}_{0} \overset{i . i . d .}{~} N (0, 1))$ for a ∈ [2: T],

π_{a}^{(n)} = π_{min} \lor [π_{max} \land ℙ ({\tilde{β}}_{1} > {\tilde{β}}_{0} ∣ H_{a - 1}^{(n)})]

Given the independent standard normal priors on ${\tilde{β}}_{1}$ , ${\tilde{β}}_{0}$ we have the following posterior distribution:

{\tilde{β}}_{1} - {\tilde{β}}_{0} ∣ H_{a - 1}^{(n)} ~ N (\frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}} - \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}}, \frac{σ^{2} (σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}) + σ^{2} (σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0})}{(σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}) (σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1})}) ≕ N (μ_{a - 1}^{(n)}, {(σ_{a - 1}^{(n)})}^{2})

Thus to show that $π_{a}^{(n)} \overset{P}{\to} π_{max}$ for all a ∈ [2: T], it is sufficient to show that $μ_{a - 1}^{(n)} \overset{P}{\to} Δ$ and ${(σ_{a - 1}^{(n)})}^{2} \overset{P}{\to} 0$ for all a ∈ [2: T]. By Lemma 1, for all t ∈ [1: T],

\frac{N_{t, 1}}{π_{t}^{(n)} n} \overset{P}{\to} 1

Thus, to show ${(σ_{a - 1}^{(n)})}^{2} \overset{P}{\to} 0$ , it is sufficient to show that

\frac{σ^{2} (σ^{2} + n \sum_{t = 1}^{a - 1} π_{t}^{(n)}) + σ^{2} (σ^{2} + n \sum_{t = 1}^{a - 1} (1 - π_{t}^{(n)}))}{(σ^{2} + n \sum_{t = 1}^{a - 1} (1 - π_{t}^{(n)})) (σ^{2} + n \sum_{t = 1}^{a - 1} π_{t}^{(n)})} \overset{P}{\to} 0

The above limit holds because $π_{t}^{(n)} \in [π_{min}, π_{max}]$ for 0 < π_min ≤ π_max < 1 by the clipping condition.

We now show that $μ_{a - 1}^{(n)} \overset{P}{\to} Δ$ , which is equivalent to showing that the following converges in probability to Δ

\frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}} - \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}} = \frac{\sum_{t = 1}^{a - 1} N_{t, 1}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}} \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{\sum_{t = 1}^{a - 1} N_{t, 1}} - \frac{\sum_{t = 1}^{a - 1} N_{t, 0}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}} \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{t = 1}^{a - 1} N_{t, 0}} = \frac{\sum_{t = 1}^{a - 1} N_{t, 1}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}} (β_{1} + \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sum_{t = 1}^{a - 1} N_{t, 1}}) - \frac{\sum_{t = 1}^{a - 1} N_{t, 0}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}} (β_{0} + \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sum_{t = 1}^{a - 1} N_{t, 0}})

(13)

Note that

\frac{\sum_{t = 1}^{a - 1} N_{t, 1}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}} β_{1} - \frac{\sum_{t = 1}^{a - 1} N_{t, 0}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}} β_{0} \overset{P}{\to} Δ

(14)

Equation (14) above holds by Lemma 1, because

\frac{n \sum_{t = 1}^{a - 1} π_{t}^{(n)}}{σ^{2} + n \sum_{t = 1}^{a - 1} π_{t}^{(n)}} \overset{P}{\to 1} \frac{n \sum_{t = 1}^{a - 1} (1 - π_{t}^{(n)})}{σ^{2} + n \sum_{t = 1}^{a - 1} (1 - π_{t}^{(n)})} \overset{P}{\to} 1

(15)

which hold because $π_{t}^{(n)} \in [π_{min}, π_{max}]$ due to our clipping condition.

By Slutsky’s Theorem and (14), to show (13), it is sufficient to show that

\frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 1}} - \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{σ^{2} + \sum_{t = 1}^{a - 1} N_{t, 0}} \overset{P}{\to} 0.

(16)

Equation (16) is equivalent to the following:

\sum_{t = 1}^{a - 1} \frac{\sqrt{N_{t, 1}}}{σ^{2} + \sum_{t^{'} = 1}^{a - 1} N_{t^{'}, 1}} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} - \sum_{t = 1}^{a - 1} \frac{\sqrt{N_{t, 0}}}{σ^{2} + \sum_{t^{'} = 1}^{a - 1} N_{t^{'}, 0}} \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} \overset{P}{\to} 0

(17)

By Lemma 1, to show (17) it is sufficient to show that

\sum_{t = 1}^{a - 1} \frac{\sqrt{n π_{t}^{(n)}}}{σ^{2} + n \sum_{t^{'} = 1}^{a - 1} π_{t^{'}}^{(n)}} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} - \sum_{t = 1}^{a - 1} \frac{\sqrt{n (1 - π_{t}^{(n)})}}{σ^{2} + n \sum_{t^{'} = 1}^{a - 1} (1 - π_{t^{'}}^{(n)})} \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} \overset{P}{\to} 0

(18)

Since $π_{t}^{(n)} \in [π_{min}, π_{max}]$ due to our clipping condition, the left hand side of (18) equals the following

\sum_{t = 1}^{a - 1} o_{p} (1) \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} - \sum_{t = 1}^{a - 1} o_{p} (1) \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} \overset{P}{\to} 0

The above limit holds by (12).

Thus, by Slutsky’s Theorem and Lemma 1, we have that

\frac{1}{n} \sum_{t = 1}^{T} N_{t, 1} \overset{P}{\to} \frac{1}{2} + (T - 1) π_{max} and \frac{1}{n} \sum_{t = 1}^{T} N_{t, 0} \overset{P}{\to} \frac{1}{2} + (T - 1) π_{min}

□

UCB

We assume without loss of generality that Δ > 0 and $π_{1}^{(n)} = \frac{1}{2}$ . Recall that for UCB, for a ∈ [2: T],

π_{a}^{(n)} = {\begin{array}{l} π_{max} & if U_{a - 1, 1} > U_{a - 1, 0} \\ 1 - π_{max} & otherwise \end{array}

where we define the upper confidence bounds U for any confidence level δ with 0 < δ < 1 as follows:

U_{a - 1, 1} = {\begin{array}{l} \infty & if \sum_{t = 1}^{a - 1} N_{t, 1} = 0 \\ \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{\sum_{t = 1}^{a - 1} N_{t, 1}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{a - 1} N_{t, 1}}} & otherwise \end{array}

U_{a - 1, 0} = {\begin{array}{l} \infty & if N_{1, 0} = 0 \\ \frac{\sum_{t = 1}^{a - 1} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{t = 1}^{a - 1} N_{t, 0}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{a - 1} N_{t, 0}}} & otherwise \end{array}

Thus to show that $π_{a}^{(n)} \overset{P}{\to} π_{max}$ for all a ∈ [2: T], it is sufficient to show that $I_{(U_{a, 1} > U_{a, 0})} \overset{P}{\to} 1$ , which is equivalent to showing that the following converges in probability to 1:

I_{(\sum_{t = 1}^{a} N_{t, 1} > 0, \sum_{t = 1}^{a} N_{t, 0} > 0)} I_{(\frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{\sum_{t = 1}^{n} N_{t, 1}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{α} N_{t, 1}}} > \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{\sum_{t = 1}^{α} N_{t, 1}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{α} N_{t, 0}}})} + I_{(\sum_{t = 1}^{a} N_{t, 1} = 0, \sum_{t = 1}^{a} N_{t, 0} > 0)} = I_{((β_{1} - β_{0}) + \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{Σ_{t = 1}^{a} N_{t, 1}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{a} N_{t, 1}}} > \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sum_{t = 1}^{n} N_{t, 1}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{a} N_{t, 0}}})} + o_{p} (1)

Note that to show that the above converges in probability to 1, it is sufficient to show that the following:

\frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sum_{t = 1}^{a} N_{t, 1}} + \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{a} N_{t, 0}}} - \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sum_{t = 1}^{a} N_{t, 1}} - \sqrt{\frac{2 log 1 / δ}{\sum_{t = 1}^{a} N_{t, 1}}} \overset{P}{\to} 0

Note that for fixed δ, we have that $\frac{2 log 1 / δ}{\sum_{t = 1}^{a} N_{t, 0}} \overset{P}{\to} 0$ , since $\frac{N_{t, 0}}{n / 2} \overset{P}{\to} 1$ . Also note that $\frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sum_{t = 1}^{a} N_{t, 1}} - \frac{\sum_{t = 1}^{a} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sum_{t = 1}^{a} N_{t, 1}} \overset{P}{\to} 0$ , by the same argument made in the ϵ-greedy case to show (9).

Thus, by Slutsky’s Theorem and Lemma 1, we have that

\frac{1}{n} \sum_{t = 1}^{T} N_{t, 1} \overset{P}{\to} \frac{1}{2} + (T - 1) π_{max} and \frac{1}{n} \sum_{t = 1}^{T} N_{t, 0} \overset{P}{\to} \frac{1}{2} + (T - 1) (1 - π_{max})

C. Non-uniform convergence of the OLS Estimator

Definition 3 (Non-concentration of a sequence of random variables). For a sequence of random variables ${Y_{i}}_{i = 1}^{n}$ on probability space (Ω, $F$ , $ℙ$ ), we say Y_n does not concentrate if for each $a \in ℝ$ there exists an ϵ_a > 0 with

P ({ω \in Ω : | Y_{n} (ω) - a | > ϵ_{a}}) ↛ 0.

C.1. Thompson Sampling

Proposition 1 (Non-concentration of sampling probabilities under Thompson Sampling). Under the assumptions of Theorem 2, the posterior distribution that arm 1 is better than arm 0 converges as follows:

ℙ ({\tilde{β}}_{1} > {\tilde{β}}_{0} ∣ H_{1}^{(n)}) \overset{D}{\to} {\begin{array}{l} 1 & if Δ > 0 \\ 0 & if Δ < 0 \\ Uniform [0, 1] & if Δ = 0 \end{array}

Thus, the sampling probabilities $π_{t}^{(n)}$ do not concentrate when Δ = 0.

Proof: Below, $N_{t, 1} = \sum_{i = 1}^{n} A_{t, i}$ and $N_{t, 0} = \sum_{i = 1}^{n} (1 - A_{t, i})$ . Posterior means:

{\tilde{β}}_{0} ∣ H_{1}^{(n)} ~ N (\frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, i}}{σ_{a}^{2} + N_{1, 0}}, \frac{σ^{2}}{σ_{a}^{2} + N_{0, 1}})

{\tilde{β}}_{1} ∣ H_{1}^{(n)} ~ N (\frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{σ_{a}^{2} + N_{1, 1}}, \frac{σ_{a}^{2}}{σ_{a}^{2} + N_{1, 1}})

{\tilde{β}}_{1} - {\tilde{β}}_{0} ∣ H_{1}^{(n)} ~ N (μ_{n}, σ_{n}^{2})

for $μ_{n} ≔ \frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{σ_{a}^{2} + N_{1, 1}} - \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, i}}{σ_{a}^{2} + N_{1, 0}}$ and $σ_{n}^{2} ≔ \frac{σ_{a}^{2} (σ_{a}^{2} + N_{1, 1}) + σ_{a}^{2} (σ_{a}^{2} + N_{1, 0})}{(σ_{a}^{2} + N_{1, 0}) (σ_{a}^{2} + N_{1, 1})}$ .

P ({\tilde{β}}_{1} > {\tilde{β}}_{0} ∣ H_{1}^{(n)}) = P ({\tilde{β}}_{1} - {\tilde{β}}_{0} > 0 ∣ H_{1}^{(n)}) = P (\frac{{\tilde{β}}_{1} - {\tilde{β}}_{0} - μ_{n}}{σ_{n}} > - \frac{μ_{n}}{σ_{n}} ∣ H_{1}^{(n)})

For $Z ~ N (0, 1)$ independent of μ_n, σ_n.

= P (Z > - \frac{μ_{n}}{σ_{n}} ∣ H_{1}^{(n)}) = P (Z < \frac{μ_{n}}{σ_{n}} ∣ H_{1}^{(n)}) = Φ (\frac{μ_{n}}{σ_{n}} ∣ H_{1}^{(n)})

\frac{μ_{n}}{σ_{n}} = (\frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{σ_{a}^{2} + N_{1, 1}} - \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, i}}{σ_{a}^{2} + N_{1, 0}}) \sqrt{\frac{(σ_{a}^{2} + N_{1, 0}) (σ_{a}^{2} + N_{1, 1})}{2 σ_{a}^{4} + σ_{a}^{2} n}} = (\frac{β_{1} N_{1, 1} + \sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{σ_{a}^{2} + N_{1, 1}} - \frac{β_{0} N_{1, 0} + \sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{σ_{a}^{2} + N_{1, 0}}) \sqrt{\frac{(σ_{a}^{2} + N_{1, 0}) (σ_{a}^{2} + N_{1, 1})}{2 σ_{a}^{4} + σ_{a}^{2} n}} = \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \sqrt{\frac{N_{1, 1} (σ_{a}^{2} + N_{1, 0})}{(2 σ_{a}^{4} + σ_{a}^{2} n) (σ_{a}^{2} + N_{1, 1})}} - \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}} \sqrt{\frac{N_{1, 0} (σ_{a}^{2} + N_{1, 1})}{(2 σ_{a}^{4} + σ_{a}^{2} n) (σ_{a}^{2} + N_{1, 0})}} + (β_{1} \frac{N_{1, 1}}{σ_{a}^{2} + N_{1, 1}} - β_{0} \frac{N_{1, 0}}{σ_{a}^{2} + N_{1, 0}}) \sqrt{\frac{(σ_{a}^{2} + N_{1, 0}) (σ_{a}^{2} + N_{1, 1})}{2 σ_{a}^{4} + σ_{a}^{2} n}} ≕ B_{n} + C_{n}

Let’s first examine C_n. Note that β₁ = β₀ + Δ, $β_{1} \frac{N_{1, 1}}{σ_{a}^{2} + N_{1, 1}} - β_{0} \frac{N_{1, 0}}{σ_{a}^{2} + N_{1, 0}}$ equals

= (β_{0} + Δ) \frac{N_{1, 1}}{σ_{a}^{2} + N_{1, 1}} - β_{0} \frac{N_{1, 0}}{σ_{a}^{2} + N_{1, 0}} = Δ \frac{N_{1, 1}}{σ_{a}^{2} + N_{1, 1}} + β_{0} (\frac{N_{1, 1}}{σ_{a}^{2} + N_{1, 1}} - \frac{N_{1, 0}}{σ_{a}^{2} + N_{1, 0}})

= Δ \frac{N_{1, 1} / n}{(σ_{a}^{2} + N_{1, 1}) / n} + β_{0} (\frac{N_{1, 1} (σ_{a}^{2} + N_{1, 0}) - N_{1, 0} (σ_{a}^{2} + N_{1, 1})}{(σ_{a}^{2} + N_{1, 1}) (σ_{a}^{2} + N_{1, 1})})

= Δ \frac{\frac{1}{2} + o (1)}{\frac{1}{2} + o (1)} + β_{0} σ_{a}^{2} (\frac{N_{1, 1} - N_{1, 0}}{(σ_{a}^{2} + N_{1, 1}) (σ_{a}^{2} + N_{1, 1})}) = Δ [1 + o (1)] + o (\frac{1}{n})

where the last equality holds by the Strong Law of Large Numbers because

\frac{\frac{1}{n^{2}} (N_{1, 1} - N_{1, 0})}{\frac{1}{n^{2}} (σ_{a}^{2} + N_{1, 1}) (σ_{a}^{2} + N_{1, 1})} = \frac{\frac{1}{n} [\frac{1}{2} - \frac{1}{2} + o (1)]}{[\frac{1}{2} + o (1)] [\frac{1}{2} + o (1)]} = \frac{\frac{1}{n} o (1)}{\frac{1}{4} + o (1)} = o (\frac{1}{n})

Thus,

C_{n} = [Δ [1 + o (1)] + o (\frac{1}{n})] \sqrt{\frac{(σ_{a}^{2} + N_{1, 0}) (σ_{a}^{2} + N_{1, 1})}{2 σ_{a}^{4} + σ_{a}^{2} n}} = [Δ [1 + o (1)] + o (\frac{1}{n})] \sqrt{\frac{n [\frac{1}{2} + o (1)] [\frac{1}{2} + o (1)]}{o (1) + σ_{a}^{2}}} = \sqrt{n} Δ [1 / (2 σ_{a}) + o (1)] + o (\frac{1}{\sqrt{n}})

Let’s now examine B_n.

\sqrt{\frac{N_{1, 1} (σ_{a}^{2} + N_{1, 0})}{(2 σ_{a}^{4} + σ_{a}^{2} n) (σ_{a}^{2} + N_{1, 1})}} = \sqrt{\frac{[\frac{1}{2} + o (1)] [\frac{1}{2} + o (1)]}{[σ_{a}^{2} + o (1)] [\frac{1}{2} + o (1)]}} = \sqrt{\frac{1}{2 σ_{a}^{2}}} + o (1)

\sqrt{\frac{N_{1, 0} (σ_{a}^{2} + N_{1, 1})}{(2 σ_{a}^{4} + σ_{a}^{2} n) (σ_{a}^{2} + N_{1, 0})}} = \sqrt{\frac{[\frac{1}{2} + o (1)] [\frac{1}{2} + o (1)]}{[σ_{a}^{2} + o (1)] [\frac{1}{2} + o (1)]}} = \sqrt{\frac{1}{2 σ_{a}^{2}}} + o (1)

Note that by Theorem 3, $[\frac{1}{\sqrt{N_{1, 1}}} \sum_{i = 1}^{n} ϵ_{1, i} A_{1, i}, \frac{1}{\sqrt{N_{1, 0}}} \sum_{i = 1}^{n} ϵ_{1, i} (1 - A_{1, i})] \overset{D}{\to} N (0, {\underline{I}}_{2})$ . Thus by Slutky’s Theorem,

[\begin{matrix} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \sqrt{\frac{N_{1, 1} (σ_{a}^{2} + N_{1, 0})}{(2 σ_{a}^{4} + σ_{a}^{2} n) (σ_{a}^{2} + N_{1, 1})}} \\ \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}} \sqrt{\frac{N_{1, 0} (σ_{a}^{2} + N_{1, 1})}{(2 σ_{a}^{4} + σ_{a}^{2} n) (σ_{a}^{2} + N_{1, 0})}} \end{matrix}] = [\begin{matrix} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} [\sqrt{\frac{1}{2 σ_{a}^{2}}} + o (1)] \\ \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}} [\sqrt{\frac{1}{2 σ_{a}^{2}}} + o (1)] \end{matrix}] \overset{D}{\to} N (0, \frac{1}{2 σ_{a}^{2}} {\underline{I}}_{2})

Thus, we have that, $B_{n} \overset{D}{\to} N (0, \frac{1}{σ_{a}^{2}})$ . Since we assume that the algorithm’s variance is correctly specified, so $σ_{a}^{2} = 1$ ,

B_{n} + C_{n} \overset{D}{\to} {\begin{array}{l} \infty & if Δ > 0 \\ - \infty & if Δ < 0 \\ N (0, 1) & if Δ = 0 \end{array}

Thus, by continuous mapping theorem,

ℙ ({\tilde{β}}_{1} > {\tilde{β}}_{0} ∣ H_{1}^{(n)}) = Φ (\frac{μ_{n}}{σ_{n}}) = Φ (B_{n} + C_{n}) \overset{D}{\to} {\begin{array}{l} 1 & if Δ > 0 \\ 0 & if Δ < 0 \\ Uniform [0, 1] & if Δ = 0 \end{array}

□

Proof of Theorem 2 (Non-uniform convergence of the OLS estimator of the treatment effect for Thompson Sampling): The normalized errors of the OLS estimator for Δ, which are asymptotically normal under i.i.d. sampling are as follows:

\sqrt{\frac{(N_{1, 1} + N_{2, 1}) (N_{1, 0} + N_{2, 0})}{2 n}} ({\hat{β}}_{1}^{OLS} - {\hat{β}}_{0}^{OLS} - Δ) = \sqrt{\frac{(N_{1, 1} + N_{2, 1}) (N_{1, 0} + N_{2, 0})}{2 n}} (\frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{N_{1, 1} + N_{2, 1}} - \frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{N_{1, 0} + N_{2, 0}} - Δ) = \sqrt{\frac{(N_{1, 1} + N_{2, 1}) (N_{1, 0} + N_{2, 0})}{2 n}} ((β_{1} - β_{0}) - Δ + \frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{1, 1} + N_{2, 1}} - \frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{1, 0} + N_{2, 0}}) = \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 n}} \frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{1, 1} + N_{2, 0}}} - \sqrt{\frac{N_{1, 1} + N_{2, 1}}{2 n}} \frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{1, 0} + N_{2, 0}}} = [1, - 1, 1, - 1] [\begin{matrix} \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 n}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1} + N_{2, 1}}} \\ \sqrt{\frac{N_{1, 1} + N_{2, 1}}{2 n}} \frac{\sum_{i = 1}^{n} (1 - A_{1, t}) ϵ_{1, i}}{\sqrt{N_{1, 0} + N_{2, 0}}} \\ \sqrt{\frac{N_{1, 0 + N_{2, 0}}}{2 n} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{1, 1} + N_{2, 1}}}} \\ \sqrt{\frac{N_{1, 1} + N_{2, 1}}{2 n}} \frac{\sum_{i = 1}^{n} (1 - A_{2, 1}) ϵ_{2, i}}{\sqrt{N_{1, 0} + N_{2, 0}}} \end{matrix}] = [1, - 1, 1, - 1] [\begin{matrix} \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1})}} \sqrt{\frac{N_{1, 1}}{n}} \frac{\sum_{1 = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \\ \sqrt{\frac{N_{1, 1} + N_{2, 1}}{2 (N_{1, 0} + N_{2, 0})}} \sqrt{\frac{N_{1, 0}}{n}} \frac{\sum_{t = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}} \\ \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1}})} \sqrt{\frac{N_{2, 1}}{n}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} \\ \sqrt{\frac{N_{1, 1} + N_{2, 1}}{2 (N_{1, 0} + N_{2, 0})}} \sqrt{\frac{N_{2, 0}}{n}} \frac{\sum_{t = 1}^{n} (1 - A_{2, i}) ϵ_{2, i}}{\sqrt{N_{2, 0}}} \end{matrix}]

(19)

By Theorem 3, $(\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}, \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{2, i}) ϵ_{2, i}}{\sqrt{N_{2, 0}}}) \overset{D}{\to} N (0, {\underline{I}}_{4})$ . By Lemma 1 and Slutsky’s Theorem, $\sqrt{\frac{2 n (N_{1, 1} + N_{2, 1})}{N_{1, 1} (N_{1, 0} + N_{2, 0})}} \sqrt{\frac{\frac{1}{2} (\frac{1}{2} + [1 - π_{2}])}{2 (\frac{1}{2} + π_{2})}} = 1 + o_{p} (1)$ , thus,

\sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1})}} \sqrt{\frac{N_{1, 1}}{n}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} = (\sqrt{\frac{2 n (N_{1, 1} + N_{2, 1})}{N_{1, 1} (N_{1, 0} + N_{2, 0})}} \sqrt{\frac{\frac{1}{2} (\frac{1}{2} + [1 - π_{2}])}{2 (\frac{1}{2} + π_{2})}} + o_{p} (1)) \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1}})} \sqrt{\frac{N_{1, 1}}{n}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} = \sqrt{\frac{\frac{1}{2} (\frac{1}{2} + [1 - π_{2}])}{2 (\frac{1}{2} + π_{2})}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1})}} \sqrt{\frac{N_{1, 1}}{n}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}}

Note that $\sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1})}}$ is stochastically bounded because for any K > 2,

ℙ (\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1})} > K) \leq ℙ (\frac{n}{N_{1, 1}} > K) = ℙ (\frac{1}{K} > \frac{N_{1, 1}}{n}) \to 0

where the limit holds by the law of large numbers since $N_{1, 1}^{(n)} ~ Binomial (n, \frac{1}{2})$ . Thus, since $\frac{N_{1, 1}}{n} \leq 1$ and $\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \overset{D}{\to} N (0, 1)$ ,

o_{p} (1) \sqrt{\frac{N_{1, 0} + N_{2, 0}}{2 (N_{1, 1} + N_{2, 1})}} \sqrt{\frac{N_{1, 1}}{n}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} = o_{p} (1)

We can perform the above procedure on the other three terms. Thus, equation (19) is equal to the following:

[1, - 1, 1, - 1] [\begin{matrix} \sqrt{\frac{1 / 2 + 1 - π_{2}}{4 (1 / 2 + π_{2})}} \frac{\sum_{i = 1}^{n} A_{1, t} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \\ \sqrt{\frac{1 / 2 + π_{2}}{4 (1 / 2 + 1 - π_{2})} \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}} \\ \sqrt{\frac{(1 / 2 + 1 - π_{2}) π_{2}}{2 (1 / 2 + π_{2})}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} \\ \sqrt{\frac{(1 / 2 + π_{2}) (1 - π_{2})}{2 (1 / 2 + 1 - π_{2})}} \frac{\sum_{i = 1}^{n} (1 - A_{2, i}) ϵ_{2, i}}{\sqrt{N_{2, 0}}} \end{matrix}] + o_{p} (1)

Recall that we showed earlier in Proposition 1 that

π_{2}^{(n)} = π_{min} \lor [π_{max} \land Φ (\frac{μ_{n}}{σ_{n}})] = π_{min} \lor [π_{max} \land Φ (B_{n} + C_{n})] = π_{min} \lor [π_{max} \land Φ (\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{2 N_{1, 1}}} - \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{2 N_{1, 0}}} + \sqrt{n} Δ [\frac{1}{2} + o (1)] + o (1))]

When Δ > 0, $π_{2}^{(n)} \overset{P}{\to} π_{max}$ and when Δ < 0, $π_{2}^{(n)} \overset{P}{\to} π_{min}$ . We now consider the Δ = 0 case.

π_{2}^{(n)} = π_{min} \lor [π_{max} \land Φ (\frac{1}{\sqrt{2}} [\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} - \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}] + o (1))] = π_{min} \lor [π_{max} \land Φ (\frac{1}{\sqrt{2}} [\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} - \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}])] + o (1)

By Slutsky’s Theorem, for Z₁, Z₂, Z₃, $Z_{4} \overset{i . i . d .}{~} N (0, 1)$ ,

[1, - 1, 1, - 1] [\begin{matrix} \sqrt{\frac{1 / 2 + 1 - π_{2}}{4 (1 / 2 + π_{2})}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \\ \sqrt{\frac{1 / 2 + π_{2}}{4 (1 / 2 + 1 - π_{2})} \frac{\sum_{l = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}} \\ \sqrt{\frac{(1 / 2 + 1 - π_{2}) π_{2}}{2 (1 / 2 + π_{2})}} \frac{\sum_{i = 1}^{n} A_{2, t} ϵ_{2, t}}{\sqrt{N_{2, 1}}} \\ \sqrt{\frac{(1 / 2 + π_{2}) (1 - π_{2})}{2 (1 / 2 + 1 - π_{2})}} \frac{\sum_{i = 1}^{n} (1 - A_{2, i}) ϵ_{2, i}}{\sqrt{N_{2, 0}}} \end{matrix}] + o_{p} (1) \overset{D}{\to} [1, - 1, 1, - 1] [\begin{matrix} \sqrt{\frac{1 / 2 + 1 - π_{*}}{4 (1 / 2 + π_{*})}} Z_{1} \\ \sqrt{\frac{1 / 2 + π_{*}}{4 (1 / 2 + 1 - π_{*})}} Z_{2} \\ \sqrt{\frac{(1 / 2 + 1 - π_{s}) π_{*}}{2 (1 / 2 + π_{*})}} Z_{3} \\ \sqrt{\frac{(1 / 2 + π_{*}) (1 - π_{*})}{2 (1 / 2 + 1 - π_{*})}} Z_{4} \end{matrix}] = \sqrt{\frac{1 / 2 + 1 - π_{*}}{2 (1 / 2 + π_{*})}} (\sqrt{1 / 2} Z_{1} + \sqrt{π_{*}} Z_{3}) - \sqrt{\frac{1 / 2 + π_{*}}{2 (1 / 2 + 1 - π_{*})}} (\sqrt{1 / 2} Z_{2} + \sqrt{1 - π_{*}} Z_{4})

where $π_{*} = {\begin{array}{l} π_{max} & if Δ > 0 \\ π_{min} & if Δ < 0 \\ π_{min} \lor (π_{max} \land Φ [\sqrt{1 / 2} (Z_{1} - Z_{2})]) & if Δ = 0 \end{array}$

C.2. ϵ-Greedy

Proposition 2 (Non-concentration of the sampling probabilities under zero treatment effect for ϵ-greedy). Let T = 2 and $π_{1}^{(n)} = \frac{1}{2}$ for all n. We assume that ${ϵ_{t, i}}_{i = 1}^{n} \overset{i . i . d .}{~} N (0, 1)$ , and

π_{2}^{(n)} = {\begin{array}{l} 1 - \frac{ϵ}{2} & if \frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, i}}{N_{1, 0}} \\ \frac{ϵ}{2} & otherwise \end{array}

Thus, the sampling probability $π_{2}^{(n)}$ does not concentrate when β₁ = β₀.

Proof: We define $M_{n} ≔ I_{(\frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, i}}{N_{1, 0}})} = I_{((β_{1} - β_{0}) + \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{N_{1, 0}})}$ . Note that when M_n = 1, $π_{2}^{(n)} = 1 - \frac{ϵ}{2}$ and when M_n = 0, $π_{2}^{(n)} = \frac{ϵ}{2}$ .

When the margin is zero, M_n does not concentrate because for all N_1,1, N_1,0, since $ϵ_{1, i} \overset{i . i . d .}{~} N (0, 1)$ ,

ℙ (\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{N_{1, 0}}) = ℙ (\frac{1}{\sqrt{N_{1, 1}}} Z_{1} - \frac{1}{\sqrt{N_{1, 0}}} Z_{2} > 0) = \frac{1}{2}

for Z₁, $Z_{2} \overset{i . i . d .}{~} N (0, 1)$ . Thus, we have shown that $π_{2}^{(n)}$ does not concentrate when β₁ − β₀ = 0. □

Theorem 6 (Non-uniform convergence of the OLS estimator of the treatment effect for ϵ-greedy). Assuming the setup and conditions of Proposition 2, and that β₁ = b, we show that the normalized errors of the OLS estimator converges in distribution as follows:

\sqrt{N_{1, 1} + N_{2, 1}} ({\hat{β}}_{1}^{OLS} - b) \overset{D}{\to} Y

Y = {\begin{array}{l} Z_{1} & if β_{1} - β_{0} \neq 0 \\ \sqrt{\frac{1}{3 - ϵ}} (Z_{1} - \sqrt{2 - ϵ} Z_{3}) I_{(Z_{1} > Z_{2})} + \sqrt{\frac{1}{1 + ϵ}} (Z_{1} - \sqrt{ϵ} Z_{3}) I_{(Z_{1} < Z_{2})} & if β_{1} - β_{0} = 0 \end{array}

for Z₁, Z₂, $Z_{3} \overset{i . i . d .}{~} N (0, 1)$ . Note the β₁ − β₀ = 0 case, Y is non-normal.

Proof: The normalized errors of the OLS estimator for β₁ are

\sqrt{N_{1, 1} + N_{2, 1}} (\frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{N_{1, 1} + N_{2, 1}} - b) = \frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{1, 1} + N_{2, 1}}} = [1, 1] [\begin{array}{l} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1} + N_{2, 1}}} \\ \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{1, 1} + N_{2, 1}}} \end{array}] = [1, 1] [\begin{array}{l} \sqrt{\frac{N_{1, 1}}{N_{1, 1} + N_{2, 1}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \\ \sqrt{\frac{N_{2, 1}}{N_{1, 1} + N_{2, 1}}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} \end{array}]

By Slutsky’s Theorem and Lemma 1, $(\sqrt{\frac{1 / 2}{1 / 2 + π_{2}^{(n)}}} \sqrt{\frac{N_{1, 1} + N_{2, 1}}{N_{1, 1}}}, \sqrt{\frac{π_{2}^{(n)}}{1 / 2 + π_{2}^{(n)}}} \sqrt{\frac{N_{1, 1} + N_{2, 1}}{N_{2, 1}}}) \overset{P}{\to} (1, 1)$ , so

= [1, 1] [\begin{array}{l} (\sqrt{\frac{1 / 2}{1 / 2 + π_{2}^{(n)}}} \sqrt{\frac{N_{1, 1} + N_{2, 1}}{N_{1, 1}}} + o_{p} (1)) \sqrt{\frac{N_{1, 1}}{N_{1, 1} + N_{2, 1}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} \\ (\sqrt{\frac{π_{2}^{(n)}}{1 / 2 + π_{2}^{(n)}}} \sqrt{\frac{N_{1, 1} + N_{2, 1}}{N_{2, 1}}} + o_{p} (1)) \sqrt{\frac{N_{2, 1}}{N_{1, 1} + N_{2, 1}}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} \end{array}]

= [1, 1] [\begin{array}{l} \sqrt{\frac{1 / 2}{1 / 2 + π_{2}^{(n)}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{π_{2}^{(n)}}{1 / 2 + π_{2}^{(n)}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1)} \end{array}]

The last equality holds because by Theorem 3, $(\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}}, \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}}) \overset{D}{\to} N (0, {\underline{I}}_{2})$ .

Let’s define $M_{n} ≔_{(\frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, i}}{N_{1, 0}})} = I_{((β_{1} - β_{0}) + \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{N_{1, 0}})}$ .

Note that when M_n = 1, $π_{2}^{(n)} = 1 - \frac{ϵ}{2}$ and when M_n = 0, $π_{2}^{(n)} = \frac{ϵ}{2}$ .

M_{n} = I_{((β_{1} - β_{0}) + \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{N_{1, 1}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, 1}) ϵ_{1, i}}{N_{1, 0}})} = I_{(\sqrt{N_{1, 0}} (β_{1} - β_{0}) + \sqrt{\frac{N_{1, 0}}{N_{1, 1}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}})} = I_{(\sqrt{N_{1, 0}} (β_{1} - β_{0}) + [1 + o_{p} (1)] \frac{\sum_{i = 1}^{n} A_{1, 1} ϵ_{1, i}}{\sqrt{N_{1, 1}}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, t}) ϵ_{1, i}}{\sqrt{N_{1, 0}}})}

where the last equality holds because $\sqrt{\frac{N_{1, 0}}{N_{1, 1}}} \overset{P}{\to} 1$ by Lemma 1, Slutsky’s Theorem, and continuous mapping theorem. Thus, by Proposition 2,

M^{(n)} \overset{P}{\to} {\begin{array}{l} 1 & if β_{1} - β_{0} > 0 \\ 0 & if β_{1} - β_{0} < 0 \\ does not concentrate & if β_{1} - β_{0} = 0 \end{array}

Note that

[\begin{array}{l} \sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + π_{2}^{(n)}}} \frac{\sum_{t = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{π_{2}^{(n)}}{\frac{1}{2} + π_{2}^{(n)}} \frac{\sum_{1 = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1)} \end{array}] = [\begin{array}{l} \sqrt{\frac{1}{\frac{1}{2} + 1 - \frac{ε}{2}}} \frac{\sum_{1 = 1}^{n} A_{1, t} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{1 - ϵ / 2}{\frac{1}{2} + 1 - \frac{ϵ}{2}}} \frac{\sum_{1 = 1}^{n} A_{2, 1} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] M_{n} + [\begin{array}{l} \sqrt{\frac{\frac{1}{\frac{1}{2}}}{2} + \frac{ε}{2}} \frac{\sum_{i = 1}^{n} A_{1, t} ϵ_{1, t}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{\frac{ϵ}{2}}{\frac{1}{2} + \frac{ϵ}{2}}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] (1 - M_{n})

Also note that by Theorem 3, $(\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}, \frac{\sum_{t = 1}^{n} A_{2, 1} ϵ_{2, i}}{\sqrt{N_{2, 1}}}, \frac{\sum_{1 = 1}^{n} (1 - A_{2, i}) ϵ_{2, i}}{\sqrt{N_{2, 1}}}) \overset{D}{\to} N (0, {\underline{I}}_{4})$ .

When β₁ > β₀, $M_{n} \overset{P}{\to} 1$ and when β₁ < β₀, $M_{n} \overset{P}{\to} 0$ ; in both these cases the normalized errors are asymptotically normal. We now focus on the case that β₁ = β₀. By continuous mapping theorem and Slutsky’s theorem for Z₁, Z₂, Z₃, $Z_{4} \overset{i . i . d .}{~} N (0, 1)$ ,

= [1, 1] [\begin{array}{l} \sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + 1 - \frac{ϵ}{2}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{1 - ϵ / 2}{\frac{1}{2} + 1 - \frac{ϵ}{2}}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] I_{([1 + o (1)] \frac{\sum_{i = 1}^{n} A_{1, t} ϵ_{1, t}}{\sqrt{N_{1, 1}}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}})} + [1, 1] [\begin{array}{l} \sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + \frac{ϵ}{2}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{\frac{ϵ}{2}}{\frac{1}{2} + \frac{1}{2}}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] (1 - I_{([1 + o (1)] \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, t}}{\sqrt{N_{1, 1}}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}})}) \overset{D}{\to} [1, 1] [\begin{array}{l} \sqrt{\frac{1 / 2}{1 / 2 + 1 - ϵ / 2}} Z_{1} \\ \sqrt{\frac{1 - ϵ / 2}{1 / 2 + 1 - ϵ / 2}} Z_{3} \end{array}] I_{(Z_{1} > Z_{2})} + [1, 1] [\begin{array}{l} \sqrt{\frac{1 / 2}{1 / 2 + ϵ / 2}} Z_{1} \\ \sqrt{\frac{ϵ / 2}{1 / 2 + ϵ / 2}} Z_{3} \end{array}] I_{(Z_{1} < Z_{2})} = (\sqrt{\frac{1}{3 - ϵ}} Z_{1} + \sqrt{\frac{2 - ϵ}{3 - ϵ}} Z_{3}) I_{(Z_{1} > Z_{2})} + (\sqrt{\frac{1}{1 + ϵ}} Z_{1} + \sqrt{\frac{ϵ}{1 + ϵ}} Z_{3}) I_{(Z_{1} < Z_{2})}

+ [1, 1] [\begin{array}{l} \sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + \frac{ϵ}{2}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{\frac{ϵ}{2}}{\frac{1}{2} + \frac{1}{2}}} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] (1 - I_{([1 + o (1)] \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, t}}{\sqrt{N_{1, 1}}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}})}) \overset{D}{\to} [1, 1] [\begin{array}{l} \sqrt{\frac{1 / 2}{1 / 2 + 1 - ϵ / 2}} Z_{1} \\ \sqrt{\frac{1 - ϵ / 2}{1 / 2 + 1 - ϵ / 2}} Z_{3} \end{array}] I_{(Z_{1} > Z_{2})} + [1, 1] [\begin{array}{l} \sqrt{\frac{1 / 2}{1 / 2 + ϵ / 2}} Z_{1} \\ \sqrt{\frac{ϵ / 2}{1 / 2 + ϵ / 2}} Z_{3} \end{array}] I_{(Z_{1} < Z_{2})}

Thus,

\frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{1, 1} + N_{2, 1}}} \overset{D}{\to} Y

Y ≔ {\begin{array}{l} \sqrt{\frac{1}{3 - ϵ}} (Z_{1} - \sqrt{2 - ϵ} Z_{3}) & if β_{1} - β_{0} > 0 \\ \sqrt{\frac{1}{1 + ϵ}} (Z_{1} - \sqrt{ϵ} Z_{3}) & if β_{1} - β_{0} < 0 \\ \sqrt{\frac{1}{3 - ϵ}} (Z_{1} - \sqrt{2 - ϵ} Z_{3}) I_{(Z_{1} > Z_{2})} + \sqrt{\frac{1}{1 + ϵ}} (Z_{1} - \sqrt{ϵ} Z_{3}) I_{(Z_{1} < Z_{2})} & if β_{1} - β_{0} = 0 \end{array}

□

C.3. UCB

Theorem 7 (Asymptotic non-Normality under zero treatment effect for clipped UCB). Let T = 2 and $π_{1}^{(n)} = \frac{1}{2}$ for all n. We assume that ${ϵ_{t, i}}_{i = 1}^{n} \overset{i . i . d .}{~} N (0, 1)$ , and

π_{2}^{(n)} = {\begin{array}{l} π_{max} & if U_{1} > U_{0} \\ 1 - π_{max} & otherwise \end{array}

where we define the upper confidence bounds U for any confidence level δ with 0 < δ < 1 as follows:

U_{1} = {\begin{array}{l} \infty & if N_{1, 1} = 0 \\ \frac{\sum_{i = 1}^{n} A_{1, i} R_{1, i}}{N_{1, 1}} + \sqrt{\frac{2 log 1 / δ}{N_{1, 1}}} & otherwise \end{array}

U_{0} = {\begin{array}{l} \infty & if N_{1, 0} = 0 \\ \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) R_{1, t}}{N_{1, 1}} + \sqrt{\frac{2 log 1 / δ}{N_{1, 0}}} & otherwise \end{array}

Assuming above conditions, and that β₁ = b, we show that the normalized errors of the OLS estimator converges in distribution as follows:

\sqrt{N_{1, 1} + N_{2, 1}} ({\hat{β}}_{1}^{OLS} - b) \overset{D}{\to} Y

Y = {\begin{array}{l} Z_{1} & if Δ = 0 \\ (\sqrt{\frac{1}{2} + π_{max}} Z_{1} + \sqrt{\frac{π_{max}}{2} + π_{max}} Z_{3}) I_{(Z_{1} > Z_{2})} + (\sqrt{\frac{3}{\frac{3}{2} - π_{max}}} Z_{1} + \sqrt{\frac{1 - π max}{2 - π max}} Z_{3}) I_{(Z_{1} < Z_{2})} & if Δ = 0 \end{array}

for Z₁, Z₂, $Z_{3} \overset{i . i . d .}{~} N (0, 1)$ . Note the Δ := β₁ − β₀ = 0 case, Y is non-normal.

Proof: The proof is very similar to that of asymptotic non-normality result for ϵ-Greedy. By the same arguments made as in the ϵ-Greedy case, we have that

\sqrt{N_{1, 1} + N_{2, 1}} (\frac{\sum_{t = 1}^{2} \sum_{i = 1}^{n} A_{t, i} R_{t, i}}{N_{1, 1} + N_{2, 1}} - b) = [1, 1] [\begin{array}{l} \sqrt{\frac{1 / 2}{1 / 2 + π_{2}^{(n)}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{π_{2}^{(n)}}{1 / 2 + π_{2}^{(n)}} \frac{\sum_{i = 1}^{n} A_{2, t} ϵ_{2, t}}{\sqrt{N_{2, 1}}} + o_{p} (1)} \end{array}]

Assuming n ≥ 1, we then define

M_{n} ≔ I_{(U_{1} > U_{0})} = I_{(N_{1, 1} > 0, N_{1, 0} > 0)} I_{(\frac{\sum_{i = 1}^{n} A_{1, 1} R_{1, t}}{N_{1, 1}} + \sqrt{\frac{2 log 1 / δ}{N_{1, 1}}} > \frac{\sum_{i = 1}^{n} (1 - A_{1, t}) R_{1, i}}{N_{1, 1}} + \sqrt{\frac{2 log 1 / δ}{N_{1, 0}}})} + I_{(N_{1, 1} = 0, N_{1, 0} > 0)} = I_{(N_{1, 1} > 0, N_{1, 0} > 0)} I_{((β_{1} - β_{0}) + \frac{\sum_{1 = 1}^{n} A_{1, i} ϵ_{1, 1}}{N_{1, 1}} + \sqrt{\frac{2 log 1 / δ}{N_{1, 1}}} > \frac{\sum_{1 = 1}^{n} (1 - A_{1, 1}) ϵ_{1, t}}{N_{1, 1}} + \sqrt{\frac{2 log 1 / δ}{N_{1, 0}}})} + I_{(N_{1, 1} = 0, N_{1, 0} > 0)} = I_{(N_{1, 1} > 0, N_{1, 0} > 0)} I_{(\sqrt{N_{1, 0}} (β_{1} - β_{0}) + \sqrt{\frac{N_{1, 0}}{N_{1, 1}}} [\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + \sqrt{2 log 1 / δ}] > \frac{\sum_{1 = 1}^{n} (1 - A_{1, 1}) ϵ_{1, 1}}{\sqrt{N_{1, 1}}} + \sqrt{2 log 1 / δ})} + I_{(N_{1, 1} = 0, N_{1, 0} > 0)}

Note that $\frac{N_{1, 0}}{N_{1, 1}} \overset{P}{\to} 1$ by Lemma 1. Thus by Slutsky’s Theorem and continuous mapping theorem,

= I_{(\sqrt{N_{1, 0}} (β_{1} - β_{0}) + [1 + o_{p} (1)] \frac{\sum_{1 = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) > \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 1}}})} + o_{p} (1)

(20)

Note that

[\begin{array}{l} \sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + π_{2}^{(n)}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{π_{2}^{(n)}}{\frac{1}{2} + π_{2}^{(n)}} \frac{\sum_{1 = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1)} \end{array}] = [\begin{array}{l} \sqrt{\frac{1}{2}} \frac{\sum_{i = 1}^{n} A_{1, 1} ϵ_{1, t}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{1}{2} + π_{max}} \frac{\sum_{i = 1}^{n} A_{2, 1} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] M_{n} + [\begin{array}{l} \sqrt{\frac{1}{\frac{1}{2} + 1 - π_{max}}} \frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}} + o_{p} (1) \\ \sqrt{\frac{1 - π max}{2} + 1 - π max} \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, i}}{\sqrt{N_{2, 1}}} + o_{p} (1) \end{array}] (1 - M_{n})

Let $(Z_{1}^{(n)}, Z_{2}^{(n)}, Z_{3}^{(n)}, Z_{4}^{(n)}) ≔ (\frac{\sum_{i = 1}^{n} A_{1, i} ϵ_{1, i}}{\sqrt{N_{1, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i}}{\sqrt{N_{1, 0}}}, \frac{\sum_{i = 1}^{n} A_{2, i} ϵ_{2, 1}}{\sqrt{N_{2, 1}}}, \frac{\sum_{i = 1}^{n} (1 - A_{2, 1}) ϵ_{2, i}}{\sqrt{N_{2, 1}}})$ .

Note that by Theorem 3, $(Z_{1}^{(n)}, Z_{2}^{(n)}, Z_{3}^{(n)}, Z_{4}^{(n)}) \overset{D}{\to} N (0, {\underline{I}}_{4})$ .

= [1, 1] [\begin{array}{l} \sqrt{\frac{1}{2}} Z_{1}^{(n)} + o_{p} (1) \\ \sqrt{\frac{1}{2} + π_{max}} \frac{π_{max}}{\frac{1}{2} + π_{max}} Z_{3}^{(n)} + o_{p} (1) \end{array}] [I_{([1 + o_{p} (1)] Z_{1}^{(n)} + o_{p} (1) > Z_{2}^{(n)})} + o_{p} (1)] + [1, 1] [\begin{array}{l} \sqrt{\frac{1}{\frac{1}{2} + 1 - π_{max}}} Z_{1}^{(n)} + o_{p} (1) \\ \sqrt{\frac{1 - π_{max}}{2} + 1 - π max} Z_{3}^{(n)} + o_{p} (1) \end{array}] [1 - I_{([1 + o_{p} (1)] Z_{1}^{(n)} + o_{p} (1) > Z_{2}^{(n)})} + o_{p} (1)]) \overset{D}{\to} [1, 1] [\begin{array}{l} \sqrt{\frac{1}{2} + π_{max}} Z_{1} \\ \sqrt{\frac{π_{max}}{2} + π_{max}} Z_{3} \end{array}] I_{(Z_{1} > Z_{2})} + [1, 1] [\begin{array}{l} \sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + 1 - π_{max}}} Z_{1} \\ \sqrt{\frac{1 - π_{max}}{2} + 1 - π_{max}} Z_{3} \end{array}] I_{(Z_{1} < Z_{2})} = (\sqrt{\frac{\frac{1}{2}}{\frac{1}{2} + π_{max}}} Z_{1} + \sqrt{\frac{π_{max}}{\frac{1}{2} + π_{max}}} Z_{3}) I_{(Z_{1} > Z_{2})} + (\sqrt{\frac{\frac{1}{2}}{\frac{3}{2} - π_{max}}} Z_{1} + \sqrt{\frac{1 - π_{max}}{\frac{3}{2} - π_{max}}} Z_{3}) I_{(Z_{1} < Z_{2})}

□

Note that (20) implies that if β₁ = β₀, that $π_{2}^{(n)}$ will not concentrate.

D. Asymptotic Normality of the Batched OLS Estimator: Multi-Arm Bandits

Theorem 3 (Asymptotic normality of Batched OLS estimator for multi-arm bandits) Assuming Conditions 6 (weak moments) and 3 (conditionally i.i.d. actions), and a clipping rate of $f (n) = ω (\frac{1}{n})$ (Definition 1),

[\begin{matrix} {[\begin{matrix} N_{1, 0} & 0 \\ 0 & N_{1, 1} \end{matrix}]}^{1 / 2} ({\hat{β}}_{1}^{BOLS} - β_{1}) \\ {[\begin{matrix} N_{2, 0} & 0 \\ 0 & N_{2, 1} \end{matrix}]}^{1 / 2} ({\hat{β}}_{2}^{BOLS} - β_{2}) \\ ⋮ \\ {[\begin{matrix} N_{T, 0} & 0 \\ 0 & N_{T, 1} \end{matrix}]}^{1 / 2} ({\hat{β}}_{T}^{BOLS} - β_{T}) \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{2 T})

where β_t = (β_t,0, β_t,1), $N_{t, 1} = \sum_{i = 1}^{n} A_{t, i}$ and $N_{t, 0} = \sum_{i = 1}^{n} (1 - A_{t, i})$ . Note in the body of this paper, we state Theorem 3 with conditions that are are sufficient for the weaker conditions we use here.

Lemma 1. Assuming the conditions of Theorem 3, for any batch t ∈ [1: T],

\frac{N_{t, 1}}{n π_{t}^{(n)}} = \frac{\sum_{i = 1}^{n} A_{t, i}}{n π_{t}^{(n)}} \overset{P}{\to} 1 and \frac{N_{t, 0}}{n (1 - π_{t}^{(n)})} = \frac{\sum_{i = 1}^{n} (1 - A_{t, i})}{n (1 - π_{t}^{(n)})} \overset{P}{\to} 1

Proof of Lemma 1: To prove that $\frac{N_{t, 1}}{n π_{t}^{(n)}} \overset{P}{\to} 1$ , it is equivalent to show that $\frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}) \overset{P}{\to} 0$ . Let ϵ > 0.

ℙ (| \frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}) | > ϵ) = ℙ (| \frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}) | [I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} + I_{(π_{t}^{(n)} \notin [f (n), 1 - f (n)])}] > ϵ) \leq ℙ (| \frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}) | I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} > \frac{ϵ}{2}) + ℙ (| \frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}) | I_{(π_{t}^{(n)} \notin [f (n), 1 - f (n)])} > \frac{ϵ}{2})

Since by our clipping assumption, $I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} \overset{P}{\to} 1$ , the second probability in the summation above converges to 0 as n → ∞. We will now show that the first probability in the summation above also goes to zero. Note that $E [\frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)})] = E [\frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (E [A_{t, i} ∣ H_{t - 1}^{(n)}] - π_{t}^{(n)})] = 0$ . So by Chebychev inequality, for any ϵ > 0,

ℙ (| \frac{1}{n π_{t}^{(n)}} \sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}) | I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} > ϵ) \leq \frac{1}{ϵ^{2} n^{2}} E [\frac{1}{{(π_{t}^{(n)})}^{2}} {(\sum_{i = 1}^{n} (A_{t, i} - π_{t}^{(n)}))}^{2} I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])}] \leq \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{1}{{(π_{t}^{(n)})}^{2}} (A_{t, i} - π_{t}^{(n)}) (A_{t, j} - π_{t}^{(n)}) I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])}] = \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{1}{{(π_{t}^{(n)})}^{2}} I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} E [A_{t, i} A_{t, j} - π_{t}^{(n)} (A_{t, i} + A_{t, j}) + {(π_{t}^{(n)})}^{2} ∣ H_{t - 1}^{(n)}]] = \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{1}{{(π_{t}^{(n)})}^{2}} I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} (E [A_{t, i} A_{t, j} ∣ H_{t - 1}^{(n)}] - {(π_{t}^{(n)})}^{2})]

(21)

Note that if i ≠ j, since $A_{t, i} \overset{i . i . d .}{~} Bernoulli (π_{t}^{(n)})$ , $E [A_{t, i} A_{t, j} ∣ H_{t - 1}^{(n)}] = E [A_{t, i} ∣ H_{t - 1}^{(n)}] E [A_{t, j} ∣ H_{t - 1}^{(n)}] = {(π_{t}^{(n)})}^{2}$ , so (21) above equals the following

= \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} E [\frac{1}{{(π_{t}^{(n)})}^{2}} I_{(π_{i}^{(n)} \in [f (n), 1 - f (n)])} (E [A_{t, i} ∣ H_{t - 1}^{(n)}] - {(π_{t}^{(n)})}^{2})]

= \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} E [\frac{1 - π_{t}^{(n)}}{π_{t}^{(n)}} I_{(π_{i}^{(n)} \in [f (n), 1 - f (n)])}] = \frac{1}{ϵ^{2} n} E [\frac{1 - π_{t}^{(n)}}{π_{t}^{(n)}} I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])}] \leq \frac{1}{ϵ^{2} n} \frac{1}{f (n)} \to 0

where the limit holds because we assume $f (n) = ω (\frac{1}{n})$ so f(n)n → ∞. We can make a very similar argument for $\frac{N_{t, 0}}{n (1 - π_{t}^{(n)})} \overset{P}{\to} 1$ . □

Proof for Theorem 3 (Asymptotic normality of Batched OLS estimator for multi-arm bandits): For readability, for this proof we drop the (n) superscript on $π_{t}^{(n)}$ . Note that

{[\begin{matrix} N_{t, 0} & 0 \\ 0 & N_{t, 1} \end{matrix}]}^{1 / 2} ({\hat{β}}_{t}^{BOLS} - β_{t}) = {[\begin{matrix} N_{t, 0} & 0 \\ 0 & N_{t, 1} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i} .

We want to show that

[\begin{matrix} {[\begin{matrix} N_{0, 1} & 0 \\ 0 & N_{1, 1} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{1, i} \\ A_{1, i} \end{matrix}] \\ {[\begin{matrix} N_{0, 2} & 0 \\ 0 & N_{1, 2} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{2, i} \\ A_{2, i} \end{matrix}] ϵ_{2, i} \\ ⋮ \\ {[\begin{matrix} N_{t, 0} & 0 \\ 0 & N_{t, 1} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{T, i} \\ A_{T, i} \end{matrix}] ϵ_{T, i} \end{matrix}] = [\begin{matrix} N_{0, 1}^{- 1 / 2} \sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i} \\ N_{1, 1}^{- 1 / 2} \sum_{i = 1}^{n} A_{1, i} ϵ_{1, i} \\ N_{0, 2}^{- 1 / 2} \sum_{i = 1}^{n} (1 - A_{2, i}) ϵ_{2, i} \\ N_{1, 2}^{- 1 / 2} \sum_{i = 1}^{n} A_{2, i} ϵ_{2, i} \\ ⋮ \\ N_{t, 0}^{- 1 / 2} \sum_{i = 1}^{n} (1 - A_{T, i}) ϵ_{T, i} \\ N_{t, 1}^{- 1 / 2} \sum_{i = 1}^{n} A_{T, i} ϵ_{T, i} \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{2 T}) .

By Lemma 1 and Slutsky’s Theorem it is sufficient to show that as n → ∞,

[\begin{matrix} \frac{1}{\sqrt{n (1 - π_{1})}} \sum_{i = 1}^{n} (1 - A_{1, i}) ϵ_{1, i} \\ \frac{1}{\sqrt{n π_{1}}} \sum_{i = 1}^{n} A_{1, i} ϵ_{1, i} \\ \frac{1}{\sqrt{n (1 - π_{2})}} \sum_{i = 1}^{n} (1 - A_{2, i}) ϵ_{2, i} \\ \frac{1}{\sqrt{n π_{2}}} \sum_{i = 1}^{n} A_{2, i} ϵ_{2, i} \\ ⋮ \\ \frac{1}{\sqrt{n (1 - π_{T})}} \sum_{i = 1}^{n} (1 - A_{T, i}) ϵ_{T, i} \\ \frac{1}{{\sqrt{n - π}}_{T}} \sum_{i = 1}^{n} A_{T, i} ϵ_{T, i} \end{matrix}] = [\begin{matrix} \frac{1}{\sqrt{n}} {[\begin{matrix} 1 - π_{1, 1} & 0 \\ 0 & π_{1, 1} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{1, i} \\ A_{1, i} \end{matrix}] ϵ_{1, i} \\ \frac{1}{\sqrt{n}} {[\begin{matrix} 1 - π_{2}^{(n)} & 0 \\ 0 & π_{2}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{2, i} \\ A_{2, i} \end{matrix}] ϵ_{2, i} \\ ⋮ \\ \frac{1}{\sqrt{n}} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{T, i} \\ A_{T, i} \end{matrix}] ϵ_{T, i} \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{2 T})

By Cramer-Wold device, it is sufficient to show that for any fixed vector $c \in ℝ^{2 T} s.t. ‖ c ‖_{2} = 1$ that as n → ∞,

c^{⊤} [\begin{matrix} n^{- 1 / 2} {[\begin{matrix} 1 - π_{1, 1} & 0 \\ 0 & π_{1, 1} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{1, i} \\ A_{1, i} \end{matrix}] ϵ_{1, i} \\ n^{- 1 / 2} {[\begin{matrix} 1 - π_{2}^{(n)} & 0 \\ 0 & π_{2}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{2, i} \\ A_{2, i} \end{matrix}] ϵ_{2, i} \\ ⋮ \\ n^{- 1 / 2} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{T, i} \\ A_{T, i} \end{matrix}] ϵ_{T, i} \end{matrix}] \overset{D}{\to} N (0, σ^{2})

Let us break up c so that $c = {[c_{1}, c_{2}, \dots, c_{T}]}^{⊤} \in ℝ^{2 T}$ with $c_{t} \in ℝ^{2}$ for t ∈ [1: T]. The above is equivalent to

\sum_{t = 1}^{T} n^{- 1 / 2} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i} \overset{D}{\to} N (0, σ^{2})

Let us define $Y_{t, i} ≔ n^{- 1 / 2} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t, i} & 0 \\ 0 & π_{t, i} \end{matrix}]}^{- 1 / 2} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i}$ .

The sequence {Y_1,1,Y_1,2, …,Y_1,n, …,Y_T,1, Y_T,2, …,Y_T,n} is a martingale with respect to sequence of histories ${H_{t}^{(n)}}_{t = 1}^{T}$ , since

E [Y_{t, i} ∣ H_{t - 1}^{(n)}] = n^{- 1 / 2} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} E [[\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i} ∣ H_{t - 1}^{(n)}] = n^{- 1 / 2} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} E [[\begin{matrix} (1 - π_{t}^{(n)}) E [ϵ_{t, i} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] \\ π_{t, i} E [ϵ_{t, i} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] \end{matrix}] ∣ H_{t - 1}^{(n)}] = 0

for all i ∈ [1: n] and all t ∈ [1: T]. We then apply [8] martingale central limit theorem to Y_t,i to show the desired result (see the proof of Theorem 5 in Appendix B for the statement of the martingale CLT conditions).

Condition(a): Martingale Condition The first condition holds because $E [Y_{t, i} ∣ H_{t - 1}^{(n)}] = 0$ for all i ∈ [1: n] and all t ∈ [1: T].

Condition(b): Conditional Variance

\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{n, t, i}^{2} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} E [{(\frac{1}{\sqrt{n}} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i})}^{2} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} E [\begin{matrix} \frac{1}{n} c_{t}^{⊤} & {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} [\begin{matrix} 1 - A_{t, i} & 0 \\ 0 & A_{t, i} \end{matrix}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} c_{t} ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] \end{matrix} = \sum_{t = 1}^{T} \sum_{i = 1}^{n} \frac{1}{n} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} [\begin{matrix} E [(1 - A_{t, i}) ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] & 0 \\ 0 & E [A_{t, i} ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] \end{matrix}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} c_{t}

Since $E [A_{t, i} ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] = π_{t}^{(n)} E [ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] = σ^{2} π_{t}$ and $E [(1 - A_{t, i}) ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] = (1 - π_{t}) E [ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] = σ^{2} (1 - π_{t})$ ,

= \sum_{t = 1}^{T} \sum_{i = 1}^{n} n^{- 1} c_{t}^{⊤} c_{t} σ^{2} = \sum_{t = 1}^{T} c_{t}^{⊤} c_{t} σ^{2} = σ^{2}

Condition(c): Lindeberg Condition Let δ > 0.

\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} E [{(n^{- 1 / 2} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i})}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \frac{1}{n} \sum_{i = 1}^{n} E [c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} [\begin{matrix} 1 - A_{t, i} & 0 \\ 0 & A_{t, i} \end{matrix}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} c_{t} ϵ_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \frac{1}{n} \sum_{i = 1}^{n} c_{t}^{⊤} {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- \frac{1}{2}} [\begin{matrix} E [(1 - A_{t, i}) ϵ_{t, i}^{2} I_{(Y_{i, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] & 0 \\ 0 & E [A_{t, i} ϵ_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] \end{matrix}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- \frac{1}{2}} c_{t}

Note that for c_t = [c_t,0, c_t,1]^⊤, $E [(1 - A_{t, i}) ϵ_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] = E [ϵ_{t, i}^{2} I_{(\frac{c_{t, 0}^{2}}{1 - π^{(n)}} ϵ_{t, i}^{2} > n δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] (1 - π_{t})$ and $E [A_{t, i} ϵ_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] = E [ϵ_{t, i}^{2} I_{(\frac{c_{t, 1}^{2}}{π_{t}^{(n)}} ϵ_{t, i}^{2} > n δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] π_{t}$ .Thus, we have that

= \sum_{t = 1}^{T} \frac{1}{n} \sum_{i = 1}^{n} c_{t, 0}^{2} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} (1 - π_{t})}{c_{t}^{2}, 0})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] + c_{t, 1}^{2} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} π_{t}^{(n)}}{c_{t, 1}^{2}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1]

\leq \sum_{t = 1}^{T} max_{i \in [1 : n]} {c_{t, 0}^{2} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} (1 - π_{t})}{c_{t, 0}^{2}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] + c_{t, 1}^{2} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} π_{t}^{(n)}}{c^{2}})}) ∣ H_{t - 1}^{(n)}, A_{t, i} = 1]}

Note that for any t ∈ [1: T] and i ∈ [1: n]

E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} π_{i}^{(n)}}{c_{i}^{2}})} α ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] = E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} π_{i}^{(n)}}{c_{t, 1}^{2}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] (I_{(π_{t}^{(n)} \in [f (n), 1 - f (n)])} + I_{(π_{t}^{(n)} \notin [f (n), 1 - f (n)])}) \leq E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} f (n)}{c_{t, 1}^{2}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] + σ^{2} I_{(π_{t}^{(n)} \notin [f (n), 1 - f (n)])}

The second term converges in probability to zero as n → ∞ by our clipping assumption. We now show how the first term goes to zero in probability. Since we assume $f (n) = ω (\frac{1}{n})$ , nf(n) → ∞. So, it is sufficient to show that for all t, n,

lim_{m \to \infty} max_{i \in [1 : n]} {E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > m)} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1]} = 0

By Condition 6, we have that for all n ≥ 1,

max_{t \in [1 : T], i \in [1 : n]} E [φ (ϵ_{t, i}^{2}) ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] < M

Since we assume that $lim_{x \to \infty} \frac{φ (x)}{x} = \infty$ , for all m, there exists a b_m s.t. φ(x) ≥ mMx for all x ≥ b_m. So, for all n, t, i,

M \geq E [φ (ϵ_{t, i}^{2}) ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] \geq E [φ (ϵ_{t, i}^{2}) I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] \geq m M E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1]

Thus,

max_{t \in [1 : T], i \in [1 : n]} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] \leq \frac{1}{m}

We can make a very similar argument that for all t ∈ [1: T], as n → ∞,

max_{i \in [1 : n]} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > \frac{n δ^{2} (1 - π t)}{c_{t, 0}^{2}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] \overset{P}{\to} 0

□

Corollary 3 (Asymptotic Normality of the Batched OLS Estimator of Margin; two-arm bandit setting). Assume the same conditions as Theorem 3. For each t ∈ [1: T], we have the BOLS estimator of the margin β₁ − β₀:

{\hat{Δ}}_{t}^{BOLS} = \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) R_{t, i}}{N_{t, 0}} - \frac{\sum_{i = 1}^{n} A_{t, i} R_{t, i}}{N_{t, 1}}

We show that as n → ∞,

[\begin{matrix} \sqrt{\frac{N_{1, 0} N_{1, 1}}{n}} ({\hat{Δ}}_{1}^{BOLS} - Δ_{1}) \\ \sqrt{\frac{N_{2, 0} N_{2, 1}}{n}} ({\hat{Δ}}_{2}^{BOLS} - Δ_{2}) \\ ⋮ \\ \sqrt{\frac{N_{T, 0} N_{T, 1}}{n}} ({\hat{Δ}}_{T}^{BOLS} - Δ_{T}) \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T})

Proof:

\sqrt{\frac{N_{t, 0} N_{t, 1}}{n}} ({\hat{Δ}}_{t}^{BOLS} - Δ_{t}) = \sqrt{\frac{N_{t, 0} N_{t, 1}}{n}} (\frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{t, 0}} - \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{t, 1}}) = \sqrt{\frac{N_{t, 1}}{n}} \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{\sqrt{N_{t, 0}}} - \sqrt{\frac{N_{t, 0}}{n}} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{\sqrt{N_{t, 1}}} = [\sqrt{\frac{N_{t, 1}}{n}} - \sqrt{\frac{N_{t, 0}}{n}}] {[\begin{matrix} N_{t, 0} & 0 \\ 0 & N_{t, 1} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i}

By Slutsky’s Theorem and Lemma 1, it is sufficient to show that as n → ∞,

[\begin{matrix} \frac{1}{\sqrt{n}} [\sqrt{π_{1}^{(n)}} - \sqrt{1 - π_{1}^{(n)}}] [\begin{matrix} 1 - π_{1}^{(n)} & 0 \\ 0 & π_{1}^{(n)} \end{matrix}] \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{1, i} \\ A_{1, i} \end{matrix}] ϵ_{1, i} \\ \frac{1}{\sqrt{n}} [\sqrt{π_{2}^{(n)}} - \sqrt{1 - π_{2}^{(n)}}] {[\begin{matrix} 1 - π_{2}^{(n)} & 0 \\ 0 & π_{2}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{2, i} \\ A_{2, i} \end{matrix}] ϵ_{2, i} \\ ⋮ \\ \frac{1}{\sqrt{n}} [\sqrt{π_{t}^{(n)}} - \sqrt{1 - π_{t}^{(n)}}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{T, i} \\ A_{T, i} \end{matrix}] ϵ_{T, i} \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T})

By Cramer-Wold device, it is sufficient to show that for any fixed vector $d \in ℝ^{T} s.t. ‖ d ‖_{2} = 1$ that

d^{⊤} [\begin{matrix} \frac{1}{\sqrt{n}} [\sqrt{π_{1}^{(n)}} - \sqrt{1 - π_{1}^{(n)}}] {[\begin{matrix} 1 - π_{1}^{(n)} & 0 \\ 0 & π_{1}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{1, i} \\ A_{1, i} \end{matrix}] ϵ_{1, i} \\ \frac{1}{\sqrt{n}} [\sqrt{π_{2}^{(n)}} - \sqrt{1 - π_{2}^{(n)}}] {[\begin{matrix} 1 - π_{2}^{(n)} & 0 \\ 0 & π_{2}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{2, i} \\ A_{2, i} \end{matrix}] ϵ_{2, i} \\ ⋮ \\ \frac{1}{\sqrt{n}} [\sqrt{π_{t}^{(n)}} - \sqrt{1 - π_{t}^{(n)}}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{T, i} \\ A_{T, i} \end{matrix}] ϵ_{T, i} \end{matrix}] \overset{D}{\to} N (0, σ^{2})

Let ${[d_{1}, d_{2}, \dots, d_{T}]}^{⊤} ≔ d \in ℝ^{T}$ . The above is equivalent to

\sum_{t = 1}^{T} \frac{1}{\sqrt{n}} d_{t} [\sqrt{π_{t}^{(n)}} - \sqrt{1 - π_{t}^{(n)}}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} \sum_{i = 1}^{n} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i} \overset{D}{\to} N (0, σ^{2})

Define $Y_{t, i} ≔ \frac{1}{\sqrt{n}} d_{t} [\sqrt{π_{t}^{(n)}} - \sqrt{1 - π_{t}^{(n)}}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} [\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i}$ . {Y_1,1,Y_1,2, …,Y_1,n, …,Y_T,1,Y_T,2, …,Y_T,n} is a martingale difference array with respect to the sequence of histories ${H_{t}^{(n)}}_{t = 1}^{T}$ because for all i ∈ [1: n] and t ∈ [1: T],

E [Y_{t, i} ∣ H_{t - 1}^{(n)}] = \frac{1}{\sqrt{n}} d_{t} [\sqrt{π_{t}^{(n)}} - \sqrt{1 - π_{t}^{(n)}}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- 1 / 2} E [[\begin{matrix} 1 - A_{t, i} \\ A_{t, i} \end{matrix}] ϵ_{t, i} ∣ H_{t - 1}^{(n)}] = \frac{d_{t}}{\sqrt{n}} [\sqrt{π_{t}^{(n)}} - \sqrt{1 - π_{t}^{(n)}}] {[\begin{matrix} 1 - π_{t}^{(n)} & 0 \\ 0 & π_{t}^{(n)} \end{matrix}]}^{- \frac{1}{2}} E [[\begin{matrix} (1 - π_{t}^{(n)}) E [ϵ_{t, i} ∣ H_{t - 1}^{(n)}, A_{t, i} = 0] \\ π_{t, i} E [ϵ_{t, i} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] \end{matrix}] ∣ H_{t - 1}^{(n)}] = 0

We now apply [8] martingale central limit theorem to Y_t,i to show the desired result. Verifying the conditions for the martingale CLT is equivalent to what we did to verify the conditions in the conditions in the proof of Theorem 3—the only difference is that we replace $c_{t}^{⊤}$ in the Theorem 3 proof with $d_{t} [\sqrt{1 - π_{t}^{(n)}} - \sqrt{π_{t}^{(n)}}]$ in this proof. Even though c_t is a constant vector and $d_{t} [\sqrt{1 - π_{t}^{(n)}} - \sqrt{π_{t}^{(n)}}]$ is a random vector, the proof still goes through with this adjusted c_t vector, since (i) $d_{t} [\sqrt{1 - π_{t}^{(n)}} - \sqrt{π_{t}^{(n)}}] \in H_{t - 1}^{(n)}$ , (ii) ${∥ [\sqrt{1 - π_{t}^{(n)}} - \sqrt{π_{t}^{(n)}}] ∥}_{2} = 1$ , and (iii) $\frac{n δ^{2} π_{i}^{(n)}}{c_{t, 1}^{2}} = \frac{n δ^{2} π_{i}^{(n)}}{d_{t}^{2} π_{t}^{(n)}} \to \infty$ and $\frac{n δ^{2} (1 - π_{t})}{c_{t, 0}^{2}} = \frac{n δ^{2} (1 - π_{t})}{d_{t}^{2} (1 - π_{t})} \to \infty$ . □

Corollary 4 (Consistency of BOLS Variance Estimator). Assuming Conditions 1 (moments) and 3 (conditionally i.i.d. actions), and a clipping rate of $f (n) = ω (\frac{1}{n})$ (Definition 1), for all t ∈ [1: T], as n → ∞,

{\hat{σ}}_{t}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} {(R_{t, i} - A_{t, i} {\hat{β}}_{t, 1}^{BOLS} - (1 - A_{t, i}) {\hat{β}}_{t, 0}^{BOLS})}^{2} \overset{P}{\to} σ^{2}

Proof:

{\hat{σ}}_{t}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} {(R_{t, i} - A_{t, i} {\hat{β}}_{t, 1}^{BOLS} - (1 - A_{t, i}) {\hat{β}}_{t, 0}^{BOLS})}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} {([A_{t, i} β_{t, 1} + (1 - A_{t, i}) β_{t, 0} + ϵ_{t, i}] - A_{t, i} [β_{t, 1} + \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{t, 1}}] - (1 - A_{t, i}) [β_{t, 0} + \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{t, 0}}])}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} {(ϵ_{t, i} - A_{t, i} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{t, 1}} - (1 - A_{t, i}) \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{t, 0}})}^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} (ϵ_{t, i}^{2} - 2 A_{t, i} ϵ_{t, i} \frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{t, 1}} - 2 (1 - A_{t, i}) ϵ_{t, i} \frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{t, 0}} + A_{t, i} {[\frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{t, 1}}]}^{2} + (1 - A_{t, i}) {[\frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{t, 0}}]}^{2}) = (\frac{1}{n - 2} \sum_{i = 1}^{n} ϵ_{t, i}^{2}) - 2 \frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) N_{t, 1}} - 2 \frac{{(\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i})}^{2}}{(n - 2) N_{t, 0}} + \frac{N_{t, 1}}{n - 2} {[\frac{\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i}}{N_{t, 1}}]}^{2} + \frac{N_{t, 0}}{n - 2} {[\frac{\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i}}{N_{t, 0}}]}^{2} = (\frac{1}{n - 2} \sum_{i = 1}^{n} ϵ_{t, i}^{2}) - \frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) N_{t, 1}} - \frac{{(\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i})}^{2}}{(n - 2) N_{t, 0}}

Note that $\frac{1}{n - 2} \sum_{i = 1}^{n} ϵ_{t, i}^{2} \overset{P}{\to} σ^{2}$ because for all δ > 0,

ℙ (| [\frac{1}{n - 2} \sum_{i = 1}^{n} ϵ_{t, i}^{2}] - σ^{2} | > δ) \leq ℙ (| [\frac{1}{n - 2} \sum_{i = 1}^{n} ϵ_{t, i}^{2}] - \frac{σ^{2} (n - 2)}{n} | > δ / 2) + ℙ (| \frac{σ^{2} (n - 2)}{n} - σ^{2} | > δ / 2) = ℙ (| \frac{1}{n - 2} \sum_{i = 1}^{n} (ϵ_{t, i}^{2} - σ^{2}) | > δ / 2) + ℙ (σ^{2} | \frac{- 2}{n} | > δ / 2)

Since the second term in the summation above goes to zero for sufficiently large n, we now focus on the first term in the summation above. By Chebychev inequality,

ℙ (| \frac{1}{n - 2} \sum_{i = 1}^{n} (ϵ_{t, i}^{2} - σ^{2}) | > δ / 2) \leq \frac{4}{δ^{2} {(n - 2)}^{2}} E [\sum_{i = 1}^{n} \sum_{j = 1}^{n} (ϵ_{t, i}^{2} - σ^{2}) (ϵ_{t, j}^{2} - σ^{2})] = \frac{4}{δ^{2} {(n - 2)}^{2}} E [\sum_{i = 1}^{n} {(ϵ_{t, i}^{2} - σ^{2})}^{2}]

where the equality above holds because for i ≠ j, $E [(ϵ_{t, i}^{2} - σ^{2}) (ϵ_{t, j}^{2} - σ^{2})] = E [E [(ϵ_{t, i}^{2} - σ^{2}) (ϵ_{t, j}^{2} - σ^{2}) ∣ H_{t - 1}^{(n)}]] = E [E [ϵ_{t, i}^{2} - σ^{2} ∣ H_{t - 1}^{(n)}] E [ϵ_{t, j}^{2} - σ^{2} ∣ H_{t - 1}^{(n)}]] = 0$ . By Condition 1 $E [ϵ_{t, i}^{4} ∣ H_{t = 1}^{(n)}] < M < \infty$ ,

= \frac{4}{δ^{2} {(n - 2)}^{2}} E [\sum_{i = 1}^{n} E [(ϵ_{t, i}^{4} - 2 ϵ_{t, i}^{2} σ^{2} + σ^{4}) ∣ H_{t - 1}^{(n)}]] \leq \frac{4 n (M + σ^{4})}{δ^{2} {(n - 2)}^{2}} \to 0

Thus by Slutsky’s Theorem it is sufficient to show that $\frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) N_{t, 1}} + \frac{{(\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i})}^{2}}{(n - 2) N_{t, 0}} \overset{P}{\to} 0$ . We will only show that $\frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) N_{t, 1}} \overset{P}{\to} 0$ ; $\frac{{(\sum_{i = 1}^{n} (1 - A_{t, i}) ϵ_{t, i})}^{2}}{(n - 2) N_{t, 0}} \overset{P}{\to} 0$ holds by a very similar argument.

Note that by Lemma 1, $\frac{N_{t, 1}}{n π_{t}^{(n)}} \overset{P}{\to} 1$ . Thus, to show that $\frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) N_{t, 1}} \overset{P}{\to} 0$ by Slutsky’s Theorem it is sufficient to show that $\frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) n π_{t}^{(n)}} \overset{P}{\to} 0$ . Let δ > 0. By Markov inequality,

ℙ (| \frac{{(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}}{(n - 2) n π_{t}^{(n)}} | > δ) \leq E [\frac{1}{δ (n - 2) n π_{t}^{(n)}} {(\sum_{i = 1}^{n} A_{t, i} ϵ_{t, i})}^{2}] = E [\frac{1}{δ (n - 2) n π_{t}^{(n)}} \sum_{j = 1}^{n} \sum_{i = 1}^{n} A_{t, j} A_{t, i} ϵ_{t, i} ϵ_{t, j}]

Since $π_{t}^{(n)} \in H_{t - 1}^{(n)}$ ,

= E [\frac{1}{δ (n - 2) n π_{t}^{(n)}} \sum_{j = 1}^{n} \sum_{i = 1}^{n} E [A_{t, j} A_{t, i} ϵ_{t, i} ϵ_{t, j} ∣ H_{t - 1}^{(n)}]]

Since for i ≠ j, $E [A_{t, j} A_{t, i} ϵ_{t, j} ϵ_{t, i} ∣ H_{t - 1}^{(n)}] = E [A_{t, j} ϵ_{t, j} ∣ H_{t - 1}^{(n)}] E [A_{t, i} ϵ_{t, i} ∣ H_{t - 1}^{(n)}] = 0$ ,

= E [\frac{1}{δ (n - 2) n π_{t}^{(n)}} \sum_{i = 1}^{n} E [A_{t, i} ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}]]

Since $E [A_{t, i} ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] = E [ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}, A_{t, i} = 1] π_{t}^{(n)} = σ^{2} π_{t}^{(n)}$ ,

= E [\frac{1}{δ (n - 2) n π_{t}^{(n)}} n σ^{2} π_{t}^{(n)}] = \frac{σ^{2}}{δ (n - 2)} \to 0

□

E. Asymptotic Normality of the Batched OLS Estimator: Contextual Bandits

Theorem 4 (Asymptotic Normality of the Batched OLS Statistic) For a K-armed contextual bandit, we for each t ∈ [1: T], we have the BOLS estimator:

{\hat{β}}_{t}^{BOLS} = {[\begin{matrix} {\underline{C}}_{t, 0} & 0 & 0 & \dots & 0 \\ 0 & {\underline{C}}_{t, 1} & 0 & \dots & 0 \\ 0 & 0 & {\underline{C}}_{t, 2} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & {\underline{C}}_{t, K - 1} \end{matrix}]}^{- 1} \sum_{i = 1}^{n} [\begin{matrix} I_{A_{t, i} = 0} C_{t, i} \\ I_{A_{t, i} = 1} C_{t, i} \\ ⋮ \\ I_{A t, i = K - 1} C_{t, i} \end{matrix}] R_{t, i} \in ℝ^{K d}

where ${\underline{C}}_{t, k} ≔ \sum_{i = 1}^{n} I_{A_{i, i}^{(n)} = k} C_{t, i} {(C_{t, i})}^{⊤} \in ℝ^{d \times d}$ . Assuming Conditions 6 (weak moments), 3 (conditionally i.i.d. actions), 4 (conditionally i.i.d. contexts), and 5 (bounded contexts), and a conditional clipping rate f(n) = c for some $0 \leq c < \frac{1}{2}$ (see Definition 2), we show that as n → ∞,

[\begin{matrix} Diagonal {[{\underline{C}}_{1, 0}, {\underline{C}}_{1, 1}, \dots, {\underline{C}}_{1, K - 1}]}^{1 / 2} ({\hat{β}}_{1}^{BOLS} - β_{1}) \\ Diagonal {[{\underline{C}}_{2, 0}, {\underline{C}}_{2, 1}, \dots, {\underline{C}}_{2, K - 1}]}^{1 / 2} ({\hat{β}}_{2}^{BOLS} - β_{2}) \\ ⋮ \\ Diagonal {[{\underline{C}}_{T, 0}, {\underline{C}}_{T, 1}, \dots, {\underline{C}}_{T, K - 1}]}^{1 / 2} ({\hat{β}}_{T}^{BOLS} - β_{T}) \end{matrix}] \overset{D}{\to N (0, σ^{2} {\underline{I}}_{T K d})}

Lemma 2. Assuming the conditions of Theorem 4, for any batch t ∈ [1: T] and any arm k ∈ [0: K − 1], as n → ∞,

[\sum_{i = 1}^{n} I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤}] {[n {\underline{Z}}_{t, k} P_{t, k}]}^{- 1} \overset{P}{\to} {\underline{I}}_{d}

(22)

{[\sum_{i = 1}^{n} I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤}]}^{1 / 2} {[n {\underline{Z}}_{t, k} P_{t, k}]}^{- 1 / 2} \overset{P}{\to} {\underline{I}}_{d}

(23)

where $P_{t, k} ≔ ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)})$ and ${\underline{Z}}_{t, k} ≔ E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}, A_{t, i} = k]$ .

Proof of Lemma 2: We first show that as n → ∞, $\frac{1}{n} \sum_{i = 1}^{n} (I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} - {\underline{Z}}_{t, k} P_{t, k}) \overset{P}{\to} \underline{0}$ . It is sufficient to show that convergence holds entry-wise so for any r,s ∈ [0: d − 1], as n → ∞, $\frac{1}{n} \sum_{i = 1}^{n} (I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s)) \overset{P}{\to} 0$ . Note that

E [I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s)] = E [E [C_{t, i} C_{t, i}^{⊤} (r, s) ∣ H_{t - 1}, A_{t, i} = k] P_{t, k} - P_{t, k} {\underline{Z}}_{t, k} (r, s)] = 0

By Chebychev inequality, for any ϵ > 0,

ℙ (| \frac{1}{n} \sum_{i = 1}^{n} I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s) | > ϵ) \leq \frac{1}{ϵ^{2} n^{2}} E [{(\sum_{i = 1}^{n} I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} - P_{t, k} {\underline{Z}}_{t, k} (r, s))}^{2}] = \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [[I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s)] [I_{A_{t, i} = k} C_{t, j} C_{t, j}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s)]]

(24)

By conditional independence and by law of iterated expectations (conditioning on $H_{t - 1}^{(n)}$ ), for i ≠ j, $E [(I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s)) (I_{A_{t, j} = k} C_{t, j} C_{t, j}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s))] = 0$ . Thus, (24) above equals the following:

= \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} E [{(I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) - P_{t, k} {\underline{Z}}_{t, k} (r, s))}^{2}]

= \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} E [I_{A_{t, i} = k} {[C_{t, i} C_{t, i}^{⊤} (r, s)]}^{2} - 2 I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} (r, s) P_{t, k} {\underline{Z}}_{t, k} (r, s) + P_{t, k}^{2} {[{\underline{Z}}_{t, k} (r, s)]}^{2}]

= \frac{1}{ϵ^{2} n^{2}} \sum_{i = 1}^{n} E [I_{A_{t, i} = k} {[C_{t, i} C_{t, i}^{⊤} (r, s)]}^{2} - P_{t, k}^{2} {[{\underline{Z}}_{t, k} (r, s)]}^{2}]

= \frac{1}{ϵ^{2} n} E [I_{A_{t, i} = k} {[C_{t, i} C_{t, i}^{⊤} (r, s)]}^{2} - P_{t, k}^{2} {[{\underline{Z}}_{t, k} (r, s)]}^{2}] \leq \frac{2 d max (u^{2}, 1)}{ϵ^{2} n} \to 0

as n → ∞. The last inequality above holds by Condition 5.

Proving Equation (22):

It is sufficient to show that

{‖ \frac{2 max (d u^{2}, 1)}{ϵ^{2} n} {[n {\underline{Z}}_{t, k} P_{t, k}]}^{- 1} ‖}_{o p} = {‖ \frac{2 max (d u^{2}, 1)}{ϵ^{2} n^{2} P_{t, k}} {\underline{Z}}_{t, k}^{- 1} ‖}_{o p} \overset{P}{\to 0}

(25)

We define random variable $M_{t}^{(n)} = I_{(\forall c \in ℝ^{d}, A_{t} (H_{t - 1}^{(n)}, c) \in {[f (n), 1 - f (n)]}^{K})}$ , representing whether the conditional clipping condition is satisfied. Note that by our conditional clipping assumption, $M_{t}^{(n)} \overset{P}{\to} 1$ as n → ∞. The left hand side of (25) is equal to the following

{‖ \frac{2 max (d u^{2}, 1)}{ϵ^{2} n^{2} P_{t, k}} {\underline{Z}}_{t, k}^{- 1} (M_{t}^{(n)} + (1 - M_{t}^{(n)})) ‖}_{o p} = {‖ \frac{2 max (d u^{2}, 1)}{ϵ^{2} n^{2} P_{t, k}} {\underline{Z}}_{t, k}^{- 1} M_{t}^{(n)} ‖}_{o p} + o_{p} (1)

(26)

By our conditional clipping condition and Bayes rule we have that for all c ∈ [−u, u]^d,

ℙ (C_{t, i} = c ∣ A_{t, i} = k, H_{t - 1}^{(n)}, M_{t}^{(n)} = 1) = \frac{ℙ (A_{t, i} = k ∣ C_{t, i} = c, H_{t - 1}^{(n)}, M_{t}^{(n)} = 1) ℙ (C_{t, i} = c ∣ H_{t - 1}^{(n)}, M_{t}^{(n)} = 1)}{ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)}, M_{t}^{(n)} = 1)} \geq \frac{f (n) ℙ (C_{t, i} = c ∣ H_{t - 1}^{(n)}, M_{t}^{(n)} = 1)}{1} .

Thus, we have that

{\underline{Z}}_{t, k} M_{t}^{(n)} = E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] M_{t}^{(n)} = E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}, A_{t, i} = k, M_{t}^{(n)} = 1] M_{t}^{(n)} ≽ f (n) E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}, M_{t}^{(n)} = 1] M_{t}^{(n)} = f (n) E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}] M_{t}^{(n)} = f (n) {\underline{Σ}}_{t}^{(n)} M_{t}^{(n)} .

λ_{max} ({\underline{Z}}_{t, k}^{- 1} M_{t}^{(n)}) \leq \frac{1}{f (n)} λ_{max} ({({\underline{Σ}}_{t}^{(n)})}^{- 1}) M_{t}^{(n)} \leq \frac{1}{l f (n)}

By apply matrix inverses to both sides of the above inequality, we get that

λ_{max} ({\underline{Z}}_{t, k}^{- 1} M_{t}^{(n)}) \leq \frac{1}{f (n)} λ_{max} ({({\underline{Σ}}_{t}^{(n)})}^{- 1}) M_{t}^{(n)} \leq \frac{1}{l f (n)}

(27)

where the last inequality above holds for constant l by Condition 5. Recall that $P_{t, k} = ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)})$ , so $P_{t, k} ∣ (M_{t}^{(n)} = 1) \geq f (n)$ . Thus, equation (26) is bounded above by the following

\leq \frac{2 max (d u^{2}, 1)}{ϵ^{2} n^{2} l f {(n)}^{2}} + o_{p} (1) \overset{P}{\to} 0

where the limit above holds because we assume that f(n) = c for some $0 < c \leq \frac{1}{2}$ □.

Proving Equation (23): By Condition 5, ${‖ \frac{1}{n} {\underline{C}}_{t, k} ‖}_{max} \leq u$ and ${‖ {\underline{Z}}_{t, k} P_{t, k} ‖}_{max} \leq u$ . Thus, any continuous function of $\frac{1}{n} {\underline{C}}_{t, k}$ and ${\underline{Z}}_{t, k} P_{t, k}$ will have compact support and thus be uniformly continuous. For any uniformly continuous function $f : ℝ^{d \times d} \to ℝ^{d \times d}$ , for any ϵ > 0, there exists a δ > 0 such that for any matrices $\underline{A}, \underline{B} \in ℝ^{d \times d}$ , whenever $‖ \underline{A} - \underline{B} ‖_{op} < δ$ , then $‖ f (\underline{A}) - f (\underline{B}) ‖_{op} < ϵ$ . Thus, for any ϵ > 0, there exists some δ > 0 such that

ℙ ({‖ (\frac{1}{n} \sum_{i = 1}^{n} I_{(A_{t, k} = k)} C_{t, i} C_{t, i}^{⊤}) - {\underline{Z}}_{t, k} P_{t, k} ‖}_{op} > δ) \to 0

implies

ℙ ({‖ f (\frac{1}{n} \sum_{i = 1}^{n} I_{(A_{t, k} = k)} C_{t, i} C_{t, i}^{⊤}) - f ({\underline{Z}}_{t, k} P_{t, k}) ‖}_{op} > ϵ) \to 0

Thus, by letting f be the matrix square-root function,

{(\frac{1}{n} \sum_{i = 1}^{n} I_{(A_{t, k} = k)} C_{t, i} C_{t, i}^{⊤})}^{1 / 2} - {({\underline{Z}}_{t, k} P_{t, k})}^{1 / 2} \overset{P}{\to} \underline{0} .

We now want to show that for some constant $r > 0, ℙ ({‖ {\underline{Z}}_{t, k}^{- 1} \frac{1}{P_{t, k}} ‖}_{op} > r)$ , because this would imply that

[{(\frac{1}{n} \sum_{i = 1}^{n} I_{(A_{t, k} = k)} C_{t, i} C_{t, i}^{⊤})}^{1 / 2} - {({\underline{Z}}_{t, k} P_{t, k})}^{1 / 2}] {({\underline{Z}}_{t, k} P_{t, k})}^{- 1 / 2} \overset{P}{\to} \underline{0} .

Recall that for $M_{t}^{(n)} = I_{(\forall c \in ℝ^{d}, A_{t} (H_{t - 1}^{(n)}, c) \in {[f (n), 1 - f (n)]}^{K})}$ , representing whether the conditional clipping condition is satisfied,

{\underline{Z}}_{t, k}^{- 1} = {\underline{Z}}_{t, k}^{- 1} (M_{t}^{(n)} + (1 - M_{t}^{(n)})) = {\underline{Z}}_{t, k}^{- 1} M_{t}^{(n)} + o_{p} (1) .

Thus it is sufficient to show that $ℙ ({‖ {\underline{Z}}_{t, k}^{- 1} \frac{1}{P_{t, k}} M_{t}^{(n)} ‖}_{op} > r)$ . Recall that by equation (27) we have that

λ_{max} ({\underline{Z}}_{t, k}^{- 1} M_{t}^{(n)}) \leq \frac{1}{f (n)} λ_{max} ({({\underline{Σ}}_{t}^{(n)})}^{- 1}) M_{t}^{(n)} \leq \frac{1}{l f (n)}

Also note that $P_{t, k} = ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)})$ , so $P_{t, k} ∣ (M_{t}^{(n)} = 1) \geq f (n)$ . Thus we have that

ℙ ({‖ {\underline{Z}}_{t, k}^{- 1} \frac{1}{P_{t, k}} M_{t}^{(n)} ‖}_{op} > r) \leq I_{(\frac{1}{1 f {(n)}^{2}} > r)} = 0

for $r > \frac{1}{l f {(n)}^{2}} = \frac{1}{l c^{2}}$ , since we assume that f(n) = c for some $0 < c \leq \frac{1}{2}$ . □

Proof of Theorem 4: We define $P_{t, k} ≔ ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)})$ and ${\underline{Z}}_{t, k} ≔ E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}, A_{t, i} = k]$ . We also define

D_{t}^{(n)} ≔ Diagonal {[{\underline{C}}_{t, 0}, {\underline{C}}_{t, 1}, \dots, {\underline{C}}_{t, K - 1}]}^{1 / 2} ({\hat{β}}_{t} - β_{t}) = \sum_{i = 1}^{n} [\begin{matrix} {\underline{C}}_{t, 0}^{- 1 / 2} C_{t, i} I_{A_{t, i} = 0} \\ \frac{C_{t, 1}^{- 1 / 2} C_{t, i} I_{A_{t, i} = 1}}{⋮} \\ {\underline{C}}_{t, K - 1}^{- 1 / 2} C_{t, i} I_{A_{t, i} = K - 1} \end{matrix}] ϵ_{t, i}

We want to show that ${[D_{1}^{(n)}, D_{2}^{(n)}, \dots, D_{T}^{(n)}]}^{⊤} \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T K d})$ . By Lemma 2 and Slutsky’s Theorem, it sufficient to show that as n → ∞, ${[Q_{1}^{(n)}, Q_{2}^{(n)}, \dots, Q_{T}^{(n)}]}^{⊤} \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T K d})$ for

Q_{t}^{(n)} ≔ \sum_{i = 1}^{n} [\begin{matrix} \frac{1}{\sqrt{n P_{t, 0}}} {\underline{Z}}_{t, 0}^{- 1 / 2} C_{t, i} I_{A_{t, i} = 0} \\ \frac{1}{\sqrt{n P_{t, 1}}} {\underline{Z}}_{t, 1}^{- 1 / 2} C_{t, i} I_{A_{t, i} = 1} \\ ⋮ \\ \frac{1}{\sqrt{n P_{t, K - 1}}} {\underline{Z}}_{t, K - 1}^{- 1 / 2} C_{t, i} I_{A_{t, i} = K - 1} \end{matrix}] ϵ_{t, i}

By Cramer Wold device, it is sufficient to show that for any $b \in ℝ^{T K d}$ with $‖ b ‖_{2} = 1$ , where b = [b₁, b₂, …, b_T] for $b_{t} \in ℝ^{K d}$ , as n → ∞.

\sum_{t = 1}^{T} b_{t}^{⊤} Q_{t}^{(n)} \overset{D}{\to} N (0, σ^{2})

(28)

We can further define for all t ∈ [1: T], b_t = [b_t,0, b_t,1, …,b_t,K−1] with $b_{t, k} \in ℝ^{d}$ . Thus to show (28) it is equivalent to show that

\sum_{t = 1}^{T} \sum_{k = 0}^{K - 1} b_{t, k}^{⊤} \frac{1}{\sqrt{n P_{t, k}}} {\underline{Z}}_{t, k}^{- 1 / 2} \sum_{i = 1}^{n} I_{A_{t, i} = k} C_{t, i} ϵ_{t, i} \overset{D}{\to} N (0, σ^{2})

We define $Y_{t, i}^{(n)} ≔ \sum_{k = 0}^{K - 1} b_{t, k}^{⊤} \frac{1}{\sqrt{n P_{t, k}}} I_{A_{t, i} = k} {\underline{Z}}_{t, k}^{- 1 / 2} C_{t, i} ϵ_{t, i}$ . The sequence $Y_{1, 1}^{(n)}, Y_{1, 2}^{(n)}, \dots, Y_{1, n}^{(n)}, \dots Y_{T, 1}^{(n)}, Y_{T, 2}^{(n)}, \dots, Y_{T, n}^{(n)}$ is a martingale difference array with respect to the sequence of histories ${H_{t - 1}^{(n)}}_{t = 1}^{T}$ because $E [Y_{t, i}^{(n)} ∣ H_{t - 1}^{(n)}] = E [E [Y_{t, i}^{(n)} ∣ H_{t - 1}^{(n)}, A_{t, i}, C_{t, i}] ∣ H_{t - 1}^{(n)}] = 0$ for all i ∈ [1: n] and all t ∈ [1: T]. We then apply the martingale central limit theorem of [8] to $Y_{t, i}^{(n)}$ to show the desired result (see the proof of Theorem 5 in Appendix B for the statement of the martingale CLT conditions). Note that the first condition (a) of the martingale CLT is already satisfied, as we just showed that $Y_{t, i}^{(n)}$ form a martingale difference array with respect to $H_{t - 1}^{(n)}$ .

Condition(b): Conditional Variance

\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{t, i}^{2} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} E [{(\sum_{k = 0}^{K - 1} b_{t, k}^{⊤} \frac{1}{\sqrt{n P_{t, k}}} {\underline{Z}}_{t, k}^{- 1 / 2} I_{A_{t, i} = k} C_{t, i} ϵ_{t, i})}^{2} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \frac{1}{n P_{t, k}} b_{t, k}^{⊤} {\underline{Z}}_{t, k}^{- 1 / 2} E [I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} ϵ_{t, i}^{2} ∣ H_{t - 1}^{(n)}] {\underline{Z}}_{t, k}^{- 1 / 2} b_{t, k}

By law of iterated expectations (conditioning on $H_{t - 1}^{(n)}$ , A_t,i, C_t,i) and Condition 6,

= \frac{1}{n} \sum_{t = 1}^{T} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \frac{1}{P_{t, k}} b_{t, k}^{⊤} {\underline{Z}}_{t, k}^{- 1 / 2} E [I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}] {\underline{Z}}_{t, k}^{- 1 / 2} b_{t, k} σ^{2}

= \frac{1}{n} \sum_{t = 1}^{T} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \frac{1}{P_{t, k}} b_{t, k}^{⊤} {\underline{Z}}_{t, k}^{- 1 / 2} E [C_{t, i} C_{t, i}^{⊤} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] P_{t, k} {\underline{Z}}_{t, k}^{- 1 / 2} b_{t, k} σ^{2}

= \frac{1}{n} \sum_{t = 1}^{T} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} b_{t, k}^{⊤} {\underline{I}}_{d} b_{t, k} σ^{2} = σ^{2} \sum_{t = 1}^{T} \sum_{k = 0}^{K - 1} b_{t, k}^{⊤} b_{t, k} = σ^{2}

Condition(c): Lindeberg Condition Let δ > 0.

\sum_{t = 1}^{T} \sum_{i = 1}^{n} E [Y_{t, i}^{2} I_{(| Y_{t, i} | > δ)} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} E [{(\sum_{k = 0}^{K - 1} b_{t, k}^{⊤} \frac{1}{\sqrt{n P_{t, k}}} Z_{t, i}^{- 1 / 2} I_{A_{t, i} = k} C_{t, i} ϵ_{t, i})}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] = \sum_{t = 1}^{T} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \frac{1}{n P_{t, k}} b_{t, k}^{⊤} Z_{t, i}^{- 1 / 2} E [I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} ϵ_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] Z_{t, i}^{- 1 / 2} b_{t, k}

It is sufficient to show that for any t ∈ [1: T] and any k ∈ [0 : K − 1] the following converges in probability to zero:

\sum_{i = 1}^{n} \frac{1}{n P_{t, k}} b_{t, k}^{⊤} Z_{t, i}^{- 1 / 2} E [I_{A_{t, i} = k} C_{t, i} C_{t, i}^{⊤} ϵ_{t, i}^{2} I_{(Y_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}] Z_{t, i}^{- 1 / 2} b_{t, k}

Recall that $Y_{t, i} = \sum_{k = 0}^{K - 1} b_{t, k}^{⊤} \frac{1}{\sqrt{n P_{t, k}}} I_{A_{t, i} = k} {\underline{Z}}_{t, k}^{- 1 / 2} C_{t, i} ϵ_{t, i}$ .

= \frac{1}{n} \sum_{i = 1}^{n} b_{t, k}^{⊤} Z_{t, i}^{- 1 / 2} E [C_{t, i} C_{t, i}^{⊤} ϵ_{t, i}^{2} I_{(\frac{1}{n P_{t, k}} b_{t, k}^{⊤} Z_{t, k}^{- 1 / 2} C_{t, i} C_{t, i, k}^{⊤} Z_{t, k}^{- 1 / 2} b_{t, k} ϵ_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] Z_{t, i}^{- 1 / 2} b_{t, k}

Since c ∈ [−u,u], by the Gershgorin circle theorem, we can bound the maximum eigenvalue of cc^⊤ by some constant a > 0.

\leq \frac{a}{n} \sum_{i = 1}^{n} b_{t, k}^{⊤} Z_{t, i}^{- 1} b_{t, k} E [ϵ_{t, i}^{2} I_{(\frac{a}{n P_{t}, k} b_{t, k}^{⊤} Z_{t, k}^{- 1} b_{t, k} ϵ_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k]

= \frac{a}{n} \sum_{i = 1}^{n} b_{t, k}^{⊤} {\underline{Z}}_{t, k}^{- 1} b_{t, k} E [ϵ_{t, i}^{2} I_{(\frac{a}{n P_{t, k}} b_{t, k}^{⊤} Z_{t, k}^{- 1} b_{t, k} ϵ_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] (M_{t}^{(n)} + (1 - M_{t}^{(n)}))

= \frac{a}{n} \sum_{i = 1}^{n} b_{t, k}^{⊤} {\underline{Z}}_{t, k}^{- 1} b_{t, k} E [ϵ_{t, i}^{2} I_{(\frac{a}{n P_{t, k}} b_{t, k}^{⊤} Z_{t, k}^{- 1} b_{t, k} ϵ_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] M_{t}^{(n)} + o_{p} (1)

(29)

By equation (27), have that

λ_{max} ({\underline{Z}}_{t, k}^{- 1}) \leq \frac{1}{f (n)} λ_{max} ({({\underline{Σ}}_{t}^{(n)})}^{- 1}) \leq \frac{1}{l f (n)}

Recall that $P_{t, k} = ℙ (A_{t, i} = k ∣ H_{t - 1}^{(n)})$ , so $P_{t, k} ∣ (M_{t}^{(n)} = 1) \geq f (n)$ . Thus we have that equation (29) is upper bounded by the following:

\leq \frac{1}{n} \sum_{i = 1}^{n} \frac{b_{t, k}^{⊤} b_{t, k}}{l f (n)} E [ϵ_{t, i}^{2} I_{(\frac{a}{n f (n)} \frac{b_{t, k}^{⊤} b_{t, k}}{τ (n)} ϵ_{t, i}^{2} > δ^{2})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] + o_{p} (1)

= \frac{1}{n} \sum_{i = 1}^{n} \frac{b_{t, k}^{⊤} b_{t, k}}{l f (n)} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > δ^{2} \frac{l n f {(n)}^{2}}{a b_{t, k}^{'} b_{t, k}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] + o_{p} (1)

It is sufficient to show that

lim_{n \to \infty} max_{i \in [1 : n]} \frac{1}{f (n)} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} > δ^{2} \frac{l n f {(n)}^{2}}{a b_{t, k}^{⊤} b_{t, k}})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] = 0.

(30)

By Condition 6, we have that for all n ≥ 1, ${max}_{t \in [1 : T], i \in [1 : n]} E [φ (ϵ_{t, i}^{2}) ∣ H_{t - 1}^{(n)}, A_{t, i} = k] < M$ . Since we assume that $lim_{x \to \infty} \frac{φ (x)}{x} = \infty$ , for all m ≥ 1, there exists a b_m s.t. φ(x) ≥ mMx for all x ≥ b_m. So, for all n, t, i,

M \geq E [φ (ϵ_{t, i}^{2}) ∣ H_{t - 1}^{(n)}, A_{t, i} = k] \geq E [φ (ϵ_{t, i}^{2}) I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] \geq m M E [ϵ_{t, i}^{2} I_{(ϵ_{i, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k]

Thus, ${max}_{i \in [1 : n]} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] \leq \frac{1}{m}$ ; so ${lim}_{m \to \infty} {max}_{i \in [1 : n]} E [ϵ_{t, i}^{2} I_{(ϵ_{t, i}^{2} \geq b_{m})} ∣ H_{t - 1}^{(n)}, A_{t, i} = k] = 0$ . Since by our conditional clipping assumption, f(n) = c for some $0 < c \leq \frac{1}{2}$ thus nf(n)² → ∞. So equation (30) holds. □

Corollary 5 (Asymptotic Normality of the Batched OLS for Margin with Context Statistic). Assume the same conditions as Theorem 4. For any two arms x,y ∈ [0: K − 1] for all t ∈ [1: T], we have the BOLS estimator for Δ_t,x−y := β_t,x − β_t,y. We show that as n → ∞,

[\begin{matrix} {[{\underline{C}}_{1, x}^{- 1} + {\underline{C}}_{1, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{1, x - y}^{BOLS} - Δ_{1, x - y}) \\ {[{\underline{C}}_{2, x}^{- 1} + {\underline{C}}_{2, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{2, x - y}^{BOLS} - Δ_{2, x - y}) \\ ⋮ \\ {[{\underline{C}}_{T, x}^{- 1} + {\underline{C}}_{T, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{T, x - y}^{BOLS} - Δ_{T, x - y}) \end{matrix}] \overset{D}{\to} N (0, σ^{2} {\underline{I}}_{T d})

where

{\hat{Δ}}_{t, x - y}^{BOLS} = {[{\underline{C}}_{t, x}^{- 1} + {\underline{C}}_{t, y}^{- 1}]}^{- 1} ({\underline{C}}_{t, y}^{- 1} \sum_{i = 1}^{n} A_{t, i} C_{t, i} R_{t, i} - {\underline{C}}_{t, x}^{- 1} \sum_{i = 1}^{n} (1 - A_{t, i}) C_{t, i} R_{t, i}) .

Proof: By Cramer-Wold device, it is sufficient to show that for any fixed vector $d \in ℝ^{T d} s.t. ‖ d ‖_{2} = 1$ , where d = [d₁,d₂, …,d_T] for $d_{t} \in ℝ^{d}$ , $\sum_{t = 1}^{T} d_{t}^{⊤} {[{\underline{C}}_{t, x}^{- 1} + {\underline{C}}_{t, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{t, x - y}^{BOLS} - Δ_{t, x - y}) \overset{D}{\to} N (0, σ^{2})$ , as n → ∞.

\sum_{t = 1}^{T} d_{t}^{⊤} {[{\underline{C}}_{t, x}^{- 1} + {\underline{C}}_{t, y}^{- 1}]}^{1 / 2} ({\hat{Δ}}_{t, x - y}^{BOLS} - Δ_{t, x - y}) = \sum_{t = 1}^{T} d_{t}^{⊤} {[{\underline{C}}_{t, x}^{- 1} + {\underline{C}}_{t, y}^{- 1}]}^{- 1 / 2} ({\underline{C}}_{t, y}^{- 1} \sum_{i = 1}^{n} A_{t, i} C_{t, i} ϵ_{t, i} - {\underline{C}}_{t, x}^{- 1} \sum_{i = 1}^{n} (1 - A_{t, i}) C_{t, i} ϵ_{t, i})

By Lemma 2, as n → ∞, $\frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} {\underline{C}}_{t, x} \overset{P}{\to} {\underline{I}}_{d}$ and $\frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1} {\underline{C}}_{t, y} \overset{P}{\to} {\underline{I}}_{d}$ , so by Slutsky’s Theorem it is sufficient to that as n → ∞,

\sum_{t = 1}^{T} d_{t}^{⊤} {[{\underline{C}}_{t, x}^{- 1} + {\underline{C}}_{t, y}^{- 1}]}^{- 1 / 2} (\frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1} \sum_{i = 1}^{n} A_{t, i} C_{t, i} ϵ_{t, i} - \frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} \sum_{i = 1}^{n} (1 - A_{t, i}) C_{t, i} ϵ_{t, i}) \overset{D}{\to} N (0, σ^{2})

We know that ${[\frac{1}{P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} {[\frac{1}{P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{1 / 2} \overset{P}{\to} {\underline{I}}_{d}$ .

By Lemma 2 and continuous mapping theorem, $n P_{t, x} {\underline{Z}}_{t, x} {\underline{C}}_{t, x}^{- 1} \overset{P}{\to} {\underline{I}}_{d}$ and $n P_{t, y} {\underline{Z}}_{t, y} {\underline{C}}_{t, y}^{- 1} \overset{P}{\to} {\underline{I}}_{d \cdot}$ . So by Slutsky’s Theorem,

{[\frac{1}{P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} {[n {\underline{C}}_{t, x}^{- 1} + n {\underline{C}}_{t, y}^{- 1}]}^{1 / 2} \overset{P}{\to} {\underline{I}}_{d}

So, returning to our CLT, by Slutsky’s Theorem, it is sufficient to show that as n → ∞,

\sum_{t = 1}^{T} d_{t}^{⊤} {[\frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} \frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1} \sum_{i = 1}^{n} A_{t, i} C_{t, i} ϵ_{t, i} - \sum_{t = 1}^{T} d_{t}^{⊤} {[\frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} \frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} \sum_{i = 1}^{n} (1 - A_{t, i}) C_{t, i} ϵ_{t, i} \overset{D}{\to} N (0, σ^{2})

The above sum equals the following:

= \sum_{t = 1}^{T} d_{t}^{⊤} {[\frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} \frac{1}{\sqrt{n P_{t, x}}} {\underline{Z}}_{t, x}^{- 1 / 2} (\frac{1}{\sqrt{n P_{t, x}}} {\underline{Z}}_{t, x}^{- 1 / 2} \sum_{i = 1}^{n} A_{t, i} C_{t, i} ϵ_{t, i}) - \sum_{t = 1}^{T} d_{t}^{⊤} {[\frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{\sqrt{n P_{t, x}}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} \frac{1}{\sqrt{n P_{t, y}}} {\underline{Z}}_{t, y}^{- 1 / 2} (\frac{1}{\sqrt{n P_{t, y}}} {\underline{Z}}_{t, y}^{- 1 / 2} \sum_{i = 1}^{n} (1 - A_{t, i}) C_{t, i} ϵ_{t, i})

Asymptotic normality holds by the same martingale CLT as we used in the proof of Theorem 4. The only difference is that we adjust our b_t,k vector from Theorem 4 to the following:

b_{t, k} : = {\begin{array}{l} 0 & if k \notin {x, y} \\ d_{t}^{⊤} {[\frac{1}{n P_{t, x}} {\underline{Z}}_{t, x}^{- 1} + \frac{1}{n P_{t, y}} {\underline{Z}}_{t, y}^{- 1}]}^{- 1 / 2} \frac{1}{\sqrt{n P_{t, x}}} {\underline{Z}}_{t, x}^{- 1 / 2} & if k = x \\ d_{t}^{⊤} {[\frac{1}{n P_{t, x}} Z_{t, x}^{- 1} + \frac{1}{n P_{t, y}} Z_{t, y}^{- 1}]}^{- 1 / 2} \frac{1}{\sqrt{n P_{t}, y}} {\underline{Z}}_{t, y}^{- 1 / 2} & if k = y \end{array}

The proof still goes through with this adjustment because for all k ∈ [0: K − 1], (i) $b_{t, k} \in H_{t - 1}^{(n)}$ , (ii) $\sum_{t = 1}^{T} \sum_{k = 0}^{K - 1} b_{t, k}^{⊤} b_{t, k} = \sum_{t = 1}^{T} d_{t}^{⊤} d_{t} = 1$ . and (iii) $\frac{l n f {(n)}^{2}}{a b_{t, k}^{⊤} b_{t, k}} \to \infty$ still holds because $b_{t, k}^{⊤} b_{t, k}$ is bounded above by one. □

F. W-Decorrelated Estimator [6]

To better understand why the W-decorrelated estimator has relatively low power, but is still able to guarantee asymptotic normality, we now investigate the form of the W-decorrelated estimator in the two-arm bandit setting.

F.1. Decorrelation Approach

We now assume we are in the unbatched setting (i.e., batch size of one), as the W-decorrelated estimator was developed for this setting; however, these results easily translate to the batched setting. We now let n index the number of samples total (previously this was nT) and examine asymptotics as n → ∞. We assume the following model:

R_{n} = {\underline{X}}_{n}^{⊤} β + ϵ_{n}

where R_n, $ϵ_{n} \in ℝ^{n}$ and ${\underline{X}}_{n} \in ℝ^{n \times p}$ and $β \in ℝ^{p}$ . The W-decorrelated OLS estimator is defined as follows:

{\hat{β}}^{d} = {\hat{β}}_{OLS} + {\underline{W}}_{n} (R_{n} - {\underline{X}}_{n} {\hat{β}}_{OLS})

With this definition we have that,

{\hat{β}}^{d} - β = {\hat{β}}_{OLS} + {\underline{W}}_{n} (R_{n} - {\underline{X}}_{n} {\hat{β}}_{OLS}) - β = {\hat{β}}_{OLS} + {\underline{W}}_{n} ({\underline{X}}_{n} β + ϵ_{n}) - {\underline{W}}_{n} {\underline{X}}_{n} {\hat{β}}_{OL S} - β = ({\underline{I}}_{p} - {\underline{W}}_{n} {\underline{X}}_{n}) ({\hat{β}}_{OLS} - β) + {\underline{W}}_{n} ϵ_{n}

Note that if $E [{\underline{W}}_{n} ϵ_{n}] = E [\sum_{i = 1}^{n} W_{i} ϵ_{i}] = 0$ (where W_i is the i^th column of ${\underline{W}}_{n}$ ), then $E [({\underline{I}}_{p} - {\underline{W}}_{n} {\underline{X}}_{n}) ({\hat{β}}_{OLS} - β)]$ would be the bias of the estimator. We assume {ϵ_i} is a martingale difference sequence w.r.t. filtration ${G_{i}}_{i = 1}^{n}$ . Thus, if we constrain W_i to be $G_{i - 1}$ measurable,

E [{\underline{W}}_{n} ϵ_{n}] = E [\sum_{i = 1}^{n} W_{i} ϵ_{i}] = \sum_{i = 1}^{n} E [E [W_{i} ϵ_{i} ∣ G_{i - 1}]] = \sum_{i = 1}^{n} E [W_{i} E [ϵ_{i} ∣ G_{i - 1}]] = 0

Trading off Bias and Variance

While decreasing $E [({\underline{I}}_{p} - {\underline{W}}_{n} {\underline{X}}_{n}) ({\hat{β}}_{OLS} - β)]$ will decrease the bias, making ${\underline{W}}_{n}$ larger in norm will increase the variance. So the trade-off between bias and variance can be adjusted with different values of λ for the following optimization problem:

{‖ {\underline{I}}_{p} - {\underline{W}}_{n} {\underline{X}}_{n} ‖}_{F}^{2} + λ {‖ {\underline{W}}_{n} ‖}_{F}^{2} = {‖ {\underline{I}}_{p} - {\underline{W}}_{n} {\underline{X}}_{n} ‖}_{F}^{2} + λ Tr ({\underline{W}}_{n} {\underline{W}}_{n}^{⊤})

Optimizing for ${\underline{W}}_{n}$

The authors propose to optimize for ${\underline{W}}_{n}$ in a recursive fashion, so that the i^th column, W_i, only depends on {X_i}_j≤i ∪ {ϵ_j}_j≤_i−1 (so $\sum_{i = 1}^{n} E [W_{i} ϵ_{i}] = 0$ ). We let W₀ = 0, X₀ = 0, and recursively define ${\underline{W}}_{n} ≔ {\underline{W}}_{n - 1} W_{n}]$ where

W_{n} = {argmin}_{W \in ℝ^{p}} {‖ {\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1} - W X_{n}^{⊤} ‖}_{F}^{2} + λ ‖ W ‖_{2}^{2}

where ${\underline{W}}_{n - 1} = [W_{1}; W_{2}; \dots; W_{n - 1}] \in ℝ^{p \times (n - 1)}$ and ${\underline{X}}_{n - 1} = {[X_{1}; X_{2}; \dots; X_{n - 1}]}^{⊤} \in ℝ^{(n - 1) \times p}$ . Now, let us find the closed form solution for each step of this minimization:

\frac{d}{d W} {‖ {\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1} - W X_{n}^{⊤} ‖}_{F}^{2} + λ ‖ W ‖_{2}^{2} = 2 ({\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1} - W X_{n}^{⊤}) (- X_{n}) + 2 λ W

Note that since the Hessian is positive definite, so we can find the minimizing W by setting the first derivative to 0:

\frac{d^{2}}{d W d W^{⊤}} {‖ {\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1} - W X_{n}^{⊤} ‖}_{F}^{2} + λ ‖ W ‖_{2}^{2} = 2 X_{n} X_{n}^{⊤} + 2 λ {\underline{I}}_{p} ≽ 0

0 = 2 ({\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1} - W X_{n}^{⊤}) (- X_{n}) + 2 λ W

({\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1} - W X_{n}^{⊤}) X_{n} = λ W

({\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1}) X_{n} = λ W + W X_{n}^{⊤} X_{n} = (λ + {‖ X_{n} ‖}_{2}^{2}) W

W^{*} = ({\underline{I}}_{p} - {\underline{W}}_{n - 1} {\underline{X}}_{n - 1}) \frac{X_{n}}{λ + {‖ X_{n} ‖}_{2}^{2}}

Proposition 3 (W-decorrelated estimator and time discounting in the two-arm bandit setting). Suppose we have a 2-arm bandit. A_i is an indicator that equals 1 if arm 1 is chosen for the i^th sample, and 0 if arm 0 is chosen. We define $X_{i} ≔ [1 - A_{i}, A_{i}] \in ℝ^{2}$ . We assume the following model of rewards:

R_{i} = X_{i}^{⊤} β + ϵ_{i} = A_{i} β_{1} + (1 - A_{i}) β_{0} + ϵ_{i}

We further assume that ${ϵ_{i}}_{i = 1}^{n}$ are a martingale difference sequence with respect to filtration ${G_{i}}_{i = 1}^{n}$ . We also assume that X_i are non-anticipating with respect to filtration ${G_{i}}_{i = 1}^{n}$ . Note the W-decorrelated estimator:

{\hat{β}}^{d} = {\hat{β}}_{OLS} + {\underline{W}}_{n} (R_{n} - {\underline{X}}_{n} {\hat{β}}_{OLS})

We show that for ${\underline{W}}_{n} = [W_{1}; W_{2}; \dots; W_{n}] \in ℝ^{p \times n}$ and choice of constant λ,

W_{i} = [\begin{matrix} (1 - \frac{1}{λ + 1}) \sum_{i = 1}^{n} (1 - A_{i}) \frac{1}{λ + 1} \\ {(1 - \frac{1}{λ + 1})}^{\sum_{i = 1}^{n} A_{i}} \frac{1}{λ + 1} \end{matrix}] \in ℝ^{2}

Moreover, we show that the W-decorrelated estimator for the mean of arm 1, β₁, is as follows:

{\hat{β}}_{1}^{d} = (1 - \sum_{i = 1}^{n} A_{t} \frac{1}{λ + 1} {(1 - \frac{1}{λ + 1})}^{N_{1, i} - 1}) {\hat{β}}_{1}^{OLS} + \sum_{i = 1}^{n} A_{t} R_{t} \cdot \frac{1}{λ + 1} {(1 - \frac{1}{λ + 1})}^{N_{1, i} - 1}

where ${\hat{β}}_{1}^{OLS} = \frac{\sum_{i = 1}^{n} A_{i} R_{i}}{N_{1, n}}$ for $N_{1, n} = \sum_{i = 1}^{n} A_{i}$ . Since [6] require that λ ≥ 1 for their CLT results to hold, thus, the W-decorrelated estimators is down-weighting samples drawn later on in the study and up-weighting earlier samples.

Proof: Recall the formula for W_i,

W_{i} = ({\underline{I}}_{p} - {\underline{W}}_{i - 1} {\underline{X}}_{i - 1}) \frac{X_{i}}{{‖ λ + X_{i} ‖}_{2}^{2}}

We let W_i = [W_0,i,W_1,i]^⊤. For notational simplicity, we let $r = \frac{1}{λ + 1}$ . We now solve for

W_{1, 1} = (1 - 0) \cdot r A_{1} = r A_{1}

W_{1, 2} = (1 - W_{1, 1} \cdot A_{1}) \cdot r A_{2} = (1 - r A_{1}) r A_{2}

W_{1, 3} = (1 - \sum_{i = 1}^{2} W_{1, i} \cdot A_{i}) \cdot r A_{3} = (1 - r A_{1} - (1 - r A_{1}) r A_{2}) \cdot r A_{3} = (1 - r A_{1}) (1 - r A_{2}) \cdot r A_{3}

W_{1, 4} = (1 - \sum_{i = 1}^{3} W_{1, i} \cdot A_{i}) \cdot r A_{4} = (1 - r A_{1} - (1 - r A_{1}) r A_{2} - (1 - r A_{1}) (1 - r A_{2}) \cdot r A_{3}) \cdot r A_{4} = (1 - r A_{1}) (1 - r A_{2} - (1 - r A_{2}) r A_{3})) \cdot r A_{4} = (1 - r A_{1}) (1 - r A_{2}) (1 - r A_{3}) \cdot r A_{4}

We have that for arbitrary n,

W_{1, n} = (1 - \sum_{i = 1}^{n - 1} W_{1, i} \cdot A_{i}) \cdot r A_{n} = r A_{n} \prod_{i = 1}^{n - 1} (1 - r A_{i}) = r A_{n} {(1 - r)}^{\sum_{i = 1}^{n - 1} A_{i}} = r A_{n} {(1 - r)}^{N_{1, n - 1}}

By symmetry, we have that

W_{0, n} = (1 - \sum_{i = 1}^{n - 1} W_{1, i} \cdot (1 - A_{i})) \cdot r (1 - A_{n}) = r (1 - A_{n}) {(1 - r)}^{N_{0, n - 1}}

Note the W-decorrelated estimator for β₁:

{\hat{β}}_{1}^{d} = {\hat{β}}_{1}^{OLS} + \sum_{i = 1}^{n} A_{i} (R_{i} - {\hat{β}}_{1}^{OL S}) r {(1 - r)}^{N_{1, i - 1}} = (1 - \sum_{i = 1}^{n} A_{i} r {(1 - r)}^{N_{1, i - 1}}) {\hat{β}}_{1}^{OLS} + \sum_{i = 1}^{n} A_{i} R_{i} \cdot r {(1 - r)}^{N_{1, i - 1}}

Footnotes

Note that the validity of bootstrap methods rely on uniform convergence [35].

Assume an analogous moment condition for the contextual bandit case, where $G_{t}^{(n)}$ is replaced by $F_{t}^{(n)}$ .

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Contributor Information

Kelly W. Zhang, Department of Computer Science, Harvard University

Lucas Janson, Departments of Statistics, Harvard University.

Susan A. Murphy, Departments of Statistics and Computer Science, Harvard University

References

[1].Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [Google Scholar]
[2].Agarwal Arpit, Agarwal Shivani, Assadi Sepehr, and Khanna Sanjeev. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75, 2017. [Google Scholar]
[3].Amemiya Takeshi. Advanced Econometrics. Harvard University Press, 1985. [Google Scholar]
[4].Brannath W, Gutjahr G, and Bauer P. Probabilistic foundation of confirmatory adaptive designs. Journal of the American Statistical Association, 107(498):824–832, 2012. [Google Scholar]
[5].Deshpande Yash, Javanmard Adel, and Mehrabi Mohammad. Online debiasing for adaptively collected high-dimensional data. arXiv preprint arXiv:1911.01040, 2019. [Google Scholar]
[6].Deshpande Yash, Mackey Lester, Syrgkanis Vasilis, and Taddy Matt. Accurate inference for adaptive linear models. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [Google Scholar]
[7].Druce Katie L, Dixon William G, and McBeth John. Maximizing engagement in mobile health studies: lessons learned and future directions. Rheumatic Disease Clinics, 45(2):159–172, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Dvoretzky Aryeh. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [Google Scholar]
[9].Eysenbach Gunther. The law of attrition. Journal of medical Internet research, 7(1):e11, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Gao Zijun, Han Yanjun, Ren Zhimei, and Zhou Zhengqing. Batched multi-armed bandits problem. Conference on Neural Information Processing Systems, 2019. [Google Scholar]
[11].Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Han Yanjun, Zhou Zhengqing, Zhou Zhengyuan, Blanchet Jose, Glynn Peter W, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020. [Google Scholar]
[13].Hazelton Martin L.. Methods of Moments Estimation, pages 816–817. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. [Google Scholar]
[14].Howard Steven R, Ramdas Aaditya, McAuliffe Jon, and Sekhon Jasjeet. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv preprint arXiv:1810.08240, 2018. [Google Scholar]
[15].Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
[16].Jamieson Kevin, Malloy Matthew, Nowak Robert, and Bubeck Sébastien. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014. [Google Scholar]
[17].Jamieson Kevin and Nowak Robert. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014. [Google Scholar]
[18].Jun Kwang-Sung, Jamieson Kevin G, Nowak Robert D, and Zhu Xiaojin. Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pages 139–148, 2016. [Google Scholar]
[19].Kasy Maximilian. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [Google Scholar]
[20].Kaufmann Emilie and Koolen Wouter. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419, 2018. [Google Scholar]
[21].Kizilcec René F, Piech Chris, and Schneider Emily. Deconstructing disengagement: analyzing learner subpopulations in massive open online courses. In Proceedings of the third international conference on learning analytics and knowledge, pages 170–179, 2013. [Google Scholar]
[22].Kizilcec René F, Reich Justin, Yeomans Michael, Dann Christoph, Brunskill Emma, Lopez Glenn, Turkay Selen, Williams Joseph Jay, and Tingley Dustin. Scaling up behavioral science interventions in online education. Proceedings of the National Academy of Sciences, 117(26):14900–14905, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Klasnja Predrag, Smith Shawna, Seewald Nicholas J, Lee Andy, Hall Kelly, Luers Brook, Hekler Eric B, and Murphy Susan A. Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 53(6):573–582, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Lai Tze Leung and Wei Ching Zong. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [Google Scholar]
[25].Lattimore Tor and Szepesvári Csaba. Bandit algorithms. Cambridge University Press, 2020. [Google Scholar]
[26].Li Chang and De Rijke Maarten. Cascading non-stationary bandits: Online learning to rank in the non-stationary cascade model. arXiv preprint arXiv:1905.12370, 2019. [Google Scholar]
[27].Li Lihong, Chu Wei, Langford John, and Schapire Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [Google Scholar]
[28].Liao Peng, Greenewald Kristjan, Klasnja Predrag, and Murphy Susan. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Liu Qing, Proschan Michael A, and Pledger Gordon W. A unified theory of two-stage adaptive designs. Journal of the American Statistical Association, 97(460):1034–1041, 2002. [Google Scholar]
[30].Luedtke Alexander R and van der Laan Mark J. Parametric-rate inference for one-sided differentiable parameters. Journal of the American Statistical Association, 113(522):780–788, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Luedtke Alexander R and Van Der Laan Mark J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Nie Xinkun, Tian Xiaoying, Taylor Jonathan, and Zou James. Why adaptively collected data have negative bias and how to correct for it. International Conference on Artificial Intelligence and Statistics, 2018. [Google Scholar]
[33].Perchet Vianney, Rigollet Philippe, Chassang Sylvain, Snowberg Erik, et al. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016. [Google Scholar]
[34].Rafferty Anna, Ying Huiji, and Williams Joseph. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM| Journal of Educational Data Mining, 11(1):47–79, 2019. [Google Scholar]
[35].Romano Joseph P, Shaikh Azeem M, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [Google Scholar]
[36].Schwartz Eric M, Bradlow Eric T, and Fader Peter S. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017. [Google Scholar]
[37].Shin Jaehyeok, Ramdas Aaditya, and Rinaldo Alessandro. Are sample means in multi-armed bandits positively or negatively biased? In Advances in Neural Information Processing Systems, pages 7100–7109, 2019. [Google Scholar]
[38].Tang Liang, Jiang Yexi, Li Lei, and Li Tao. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73–80, 2014. [Google Scholar]
[39].Villar Sofía S, Bowden Jack, and Wason James. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Wassmer Gernot and Brannath Werner. Group sequential and confirmatory adaptive designs in clinical trials. Springer, 2016. [Google Scholar]
[41].Yao Jiayu, Brunskill Emma, Pan Weiwei, Murphy Susan, and Doshi-Velez Finale. Power-constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [PMC free article] [PubMed] [Google Scholar]
[42].Yom-Tov Elad, Feraru Guy, Kozdoba Mark, Mannor Shie, Tennenholtz Moshe, and Hochberg Irit. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [Google Scholar]

[R2] [2].Agarwal Arpit, Agarwal Shivani, Assadi Sepehr, and Khanna Sanjeev. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75, 2017. [Google Scholar]

[R3] [3].Amemiya Takeshi. Advanced Econometrics. Harvard University Press, 1985. [Google Scholar]

[R4] [4].Brannath W, Gutjahr G, and Bauer P. Probabilistic foundation of confirmatory adaptive designs. Journal of the American Statistical Association, 107(498):824–832, 2012. [Google Scholar]

[R5] [5].Deshpande Yash, Javanmard Adel, and Mehrabi Mohammad. Online debiasing for adaptively collected high-dimensional data. arXiv preprint arXiv:1911.01040, 2019. [Google Scholar]

[R6] [6].Deshpande Yash, Mackey Lester, Syrgkanis Vasilis, and Taddy Matt. Accurate inference for adaptive linear models. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [Google Scholar]

[R7] [7].Druce Katie L, Dixon William G, and McBeth John. Maximizing engagement in mobile health studies: lessons learned and future directions. Rheumatic Disease Clinics, 45(2):159–172, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Dvoretzky Aryeh. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [Google Scholar]

[R9] [9].Eysenbach Gunther. The law of attrition. Journal of medical Internet research, 7(1):e11, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Gao Zijun, Han Yanjun, Ren Zhimei, and Zhou Zhengqing. Batched multi-armed bandits problem. Conference on Neural Information Processing Systems, 2019. [Google Scholar]

[R11] [11].Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Han Yanjun, Zhou Zhengqing, Zhou Zhengyuan, Blanchet Jose, Glynn Peter W, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020. [Google Scholar]

[R13] [13].Hazelton Martin L.. Methods of Moments Estimation, pages 816–817. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. [Google Scholar]

[R14] [14].Howard Steven R, Ramdas Aaditya, McAuliffe Jon, and Sekhon Jasjeet. Uniform, nonparametric, non-asymptotic confidence sequences. arXiv preprint arXiv:1810.08240, 2018. [Google Scholar]

[R15] [15].Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]

[R16] [16].Jamieson Kevin, Malloy Matthew, Nowak Robert, and Bubeck Sébastien. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014. [Google Scholar]

[R17] [17].Jamieson Kevin and Nowak Robert. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014. [Google Scholar]

[R18] [18].Jun Kwang-Sung, Jamieson Kevin G, Nowak Robert D, and Zhu Xiaojin. Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pages 139–148, 2016. [Google Scholar]

[R19] [19].Kasy Maximilian. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [Google Scholar]

[R20] [20].Kaufmann Emilie and Koolen Wouter. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419, 2018. [Google Scholar]

[R21] [21].Kizilcec René F, Piech Chris, and Schneider Emily. Deconstructing disengagement: analyzing learner subpopulations in massive open online courses. In Proceedings of the third international conference on learning analytics and knowledge, pages 170–179, 2013. [Google Scholar]

[R22] [22].Kizilcec René F, Reich Justin, Yeomans Michael, Dann Christoph, Brunskill Emma, Lopez Glenn, Turkay Selen, Williams Joseph Jay, and Tingley Dustin. Scaling up behavioral science interventions in online education. Proceedings of the National Academy of Sciences, 117(26):14900–14905, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Klasnja Predrag, Smith Shawna, Seewald Nicholas J, Lee Andy, Hall Kelly, Luers Brook, Hekler Eric B, and Murphy Susan A. Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 53(6):573–582, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Lai Tze Leung and Wei Ching Zong. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [Google Scholar]

[R25] [25].Lattimore Tor and Szepesvári Csaba. Bandit algorithms. Cambridge University Press, 2020. [Google Scholar]

[R26] [26].Li Chang and De Rijke Maarten. Cascading non-stationary bandits: Online learning to rank in the non-stationary cascade model. arXiv preprint arXiv:1905.12370, 2019. [Google Scholar]

[R27] [27].Li Lihong, Chu Wei, Langford John, and Schapire Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [Google Scholar]

[R28] [28].Liao Peng, Greenewald Kristjan, Klasnja Predrag, and Murphy Susan. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Liu Qing, Proschan Michael A, and Pledger Gordon W. A unified theory of two-stage adaptive designs. Journal of the American Statistical Association, 97(460):1034–1041, 2002. [Google Scholar]

[R30] [30].Luedtke Alexander R and van der Laan Mark J. Parametric-rate inference for one-sided differentiable parameters. Journal of the American Statistical Association, 113(522):780–788, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Luedtke Alexander R and Van Der Laan Mark J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Nie Xinkun, Tian Xiaoying, Taylor Jonathan, and Zou James. Why adaptively collected data have negative bias and how to correct for it. International Conference on Artificial Intelligence and Statistics, 2018. [Google Scholar]

[R33] [33].Perchet Vianney, Rigollet Philippe, Chassang Sylvain, Snowberg Erik, et al. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016. [Google Scholar]

[R34] [34].Rafferty Anna, Ying Huiji, and Williams Joseph. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM| Journal of Educational Data Mining, 11(1):47–79, 2019. [Google Scholar]

[R35] [35].Romano Joseph P, Shaikh Azeem M, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [Google Scholar]

[R36] [36].Schwartz Eric M, Bradlow Eric T, and Fader Peter S. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017. [Google Scholar]

[R37] [37].Shin Jaehyeok, Ramdas Aaditya, and Rinaldo Alessandro. Are sample means in multi-armed bandits positively or negatively biased? In Advances in Neural Information Processing Systems, pages 7100–7109, 2019. [Google Scholar]

[R38] [38].Tang Liang, Jiang Yexi, Li Lei, and Li Tao. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73–80, 2014. [Google Scholar]

[R39] [39].Villar Sofía S, Bowden Jack, and Wason James. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Wassmer Gernot and Brannath Werner. Group sequential and confirmatory adaptive designs in clinical trials. Springer, 2016. [Google Scholar]

[R41] [41].Yao Jiayu, Brunskill Emma, Pan Weiwei, Murphy Susan, and Doshi-Velez Finale. Power-constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Yom-Tov Elad, Feraru Guy, Kozdoba Mark, Mannor Shie, Tennenholtz Moshe, and Hochberg Irit. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Inference for Batched Bandits

Kelly W Zhang

Lucas Janson

Susan A Murphy

Abstract

1. Introduction

The first contribution of this work is proving that on bandit data, rather surprisingly, whether standard estimators are asymptotically normal can depend on whether the margin is zero.

The second contribution of this work is introducing the Batched OLS (BOLS) estimator, which can be used for reliable inference—even in non-stationary settings—on data collected with batched bandits.

2. Related Work

Batched Bandits

High Probability Confidence Intervals

Adaptive Inference based on Asymptotic Approximations

3. Problem Formulation

Setup and Notation

Action Selection Probability Constraint (Clipping)

4. Asymptotic Distribution of the Ordinary Least Squares Estimator

4.1. Conditions for Asymptotically Normality of the OLS estimator

4.2. Asymptotic Non-Normality under No Margin

Figure 1:

Figure 2:

5. Batched OLS Estimator

5.1. Batched OLS Estimator for Multi-Arm Bandits

5.2. Batched OLS Estimator for Contextual Bandits

5.3. Batched OLS Statistic for Non-Stationary Bandits

6. Simulation Experiments

Procedure

Figure 5:

Results

Figure 3:

Figure 4:

7. Discussion

Broader Impact.

Acknowledgments and Disclosure of Funding

A. Simulation Details

A.1. W-Decorrelated Estimator

A.2. AW-AIPW Estimator

A.3. Self-Normalized Martingale Bound

A.4. Estimating Noise Variance

OLS Estimator

Batched OLS

A.5. Non-Stationary Treatment Effect

Figure 6:

Figure 7:

Figure 8:

B. Asymptotic Normality of the OLS Estimator

Useful Properties

B.1. Corollary 1 (Sufficient conditions for Theorem 5)

ϵ-greedy

Thompson Sampling

UCB

C. Non-uniform convergence of the OLS Estimator

C.1. Thompson Sampling

C.2. ϵ-Greedy

C.3. UCB

D. Asymptotic Normality of the Batched OLS Estimator: Multi-Arm Bandits

E. Asymptotic Normality of the Batched OLS Estimator: Contextual Bandits

F. W-Decorrelated Estimator [6]

F.1. Decorrelation Approach

Trading off Bias and Variance

Optimizing for W_n

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Optimizing for ${\underline{W}}_{n}$