Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang; Lucas Janson; Susan A Murphy

. Author manuscript; available in PMC: 2022 Jun 24.

Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;34:7460–7471.

Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang ¹, Lucas Janson ², Susan A Murphy ³

PMCID: PMC9232184 NIHMSID: NIHMS1762664 PMID: 35757490

Abstract

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators—which includes estimators based on empirical risk minimization as well as maximum likelihood—on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.

1. Introduction

Due to the need for interventions that are personalized to users, (contextual) bandit algorithms are increasingly used to address sequential decision making problems in health-care [Yom-Tov et al., 2017, Liao et al., 2020], online education [Liu et al., 2014, Shaikh et al., 2019], and public policy [Kasy and Sautmann, 2021, Caria et al., 2020]. Contextual bandits personalize, that is, minimize regret, by learning to choose the best intervention in each context, i.e., the action that leads to the greatest expected reward. Besides the goal of regret minimization, another critical goal in these real-world problems is to be able to use the resulting data collected by bandit algorithms to advance scientific knowledge [Liu et al., 2014, Erraqabi et al., 2017]. By scientific knowledge, we mean information gained by using the data to conduct a variety of statistical analyses, including confidence interval construction and hypothesis testing. While regret minimization is a within-experiment learning objective, gaining scientific knowledge from the resulting adaptively collected data is a between-experiment learning objective, which ultimately helps with regret minimization between deployments of bandit algorithms. Note that the data collected by bandit algorithms are adaptively collected because previously observed contexts, actions, and rewards are used to inform what actions to select in future timesteps.

There are a variety of between-experiment learning questions encountered in real-life applications of bandit algorithms. For example, in real-life sequential decision-making problems there are often a number of additional scientifically interesting outcomes besides the reward that are collected during the experiment. In the online advertising setting, the reward might be whether an ad is clicked on, but one may be interested in the outcome of amount of money spent or the subsequent time spent on the advertiser’s website. If it was found that an ad had high click-through rate, but low amounts of money was spent after clicking on the ad, one may redesign the reward used in the next bandit experiment. One type of statistical analysis would be to construct confidence intervals for the relative effect of the actions on multiple outcomes (in addition to the reward) conditional on the context. Furthermore, due to engineering and practical limitations, some of the variables that might be useful as context are often not accessible to the bandit algorithm online. If after-study analyses find some such contextual variables to have sufficiently strong influence on the relative usefulness of an action, this might lead investigators to ensure these variables are accessible to the bandit algorithm in the next experiment.

As discussed above, we can gain scientific knowledge from data collected with (contextual) bandit algorithms by constructing confidence intervals and performing hypothesis tests for unknown quantities such as the expected outcome for different actions in various contexts. Unfortunately, standard statistical methods developed for i.i.d. data fail to provide valid inference when applied to data collected with common bandit algorithms. For example, assuming the sample mean of rewards for an arm is approximately normal can lead to unreliable confidence intervals and inflated type-1 error; see Section 3.1 for an illustration. Recently statistical inference methods have been developed for data collected using bandit algorithms [Hadad et al., 2019, Deshpande et al., 2018, Zhang et al., 2020]; however, these methods are limited to inference for parameters of simple models. There is a lack of general statistical inference methods for data collected with (contextual) bandit algorithms in more complex data-analytic settings, including parameters in non-linear models for outcomes; for example, there are currently no methods for constructing valid confidence intervals for the parameters of a logistic regression model for binary outcomes or for constructing confidence intervals based on robust estimators like minimizers of the Huber loss function.

In this work we show that a wide variety of estimators which are frequently used both in science and industry on i.i.d. data, namely, M-estimators [Van der Vaart, 2000], can be used to conduct valid inference on data collected with (contextual) bandit algorithms when adjusted with particular adaptive weights, i.e., weights that are a function of previously collected data. Different forms of adaptive weights are used by existing methods for simple models [Deshpande et al., 2018, Hadad et al., 2019, Zhang et al., 2020]. Our work is a step towards developing a general framework for statistical inference on data collected with adaptive algorithms, including (contextual) bandit algorithms.

2. Problem Formulation

We assume that the data we have after running a contextual bandit algorithm is comprised of contexts ${X_{t}}_{t = 1}^{T}$ , actions ${A_{t}}_{t = 1}^{T}$ , and primary outcomes ${Y_{t}}_{t = 1}^{T}$ . T is deterministic and known. We assume that rewards are a deterministic function of the primary outcomes, i.e., R_t = f(Y_t) for some known function f. We are interested in constructing confidence regions for the parameters of the conditional distribution of Y_t given (X_t, A_t). Below we consider T → ∞ in order to derive the asymptotic distributions of estimators and construct asymptotically valid confidence intervals. We allow the action space $A$ to be finite or infinite. We use potential outcome notation [Imbens and Rubin, 2015] and let ${Y_{t} (a) : a \in A}$ denote the potential outcomes of the primary outcome and let Y_t ≔ Y_t(A_t) be the observed outcome. We assume a stochastic contextual bandit environment in which ${X_{t}, Y_{t} (a) : a \in A} \overset{i . i . d .}{\sim} P \in P$ for t ∈ [1: T]; the contextual bandit environment distribution $P$ is in a space of possible environment distributions P. We define the history $H_{t} ≔ {X_{t^{'}}, A_{t^{'}}, Y_{t^{'}}}_{t^{'} = 1}^{t}$ for t ≥ 1 and $H_{0} ≔ \emptyset$ . Actions $A_{t} \in A$ are selected according to policies π ≔ {π_t}_t≥1, which define action selection probabilities $π_{t} (A_{t}, X_{t}, H_{t - 1}) ≔ P (A_{t} ∣ H_{t - 1}, X_{t})$ . Even though the potential outcomes are i.i.d., the observed data ${X_{t}, A_{t}, Y_{t}}_{t = 1}^{T}$ are not because the actions are selected using policies π_t which are a function of past data, $H_{t - 1}$ . Non-independence of observations is a key property of adaptively collected data.

We are interested in constructing confidence regions for some unknown $θ^{*} (P) \in Θ \subset R^{d}$ , which is a parameter of the conditional distribution of Y_t given (X_t, A_t). This work focuses on the setting in which we have a well-specified model for Y_t. Specifically, we assume that $θ^{*} (P)$ is a conditionally maximizing value of criterion m_θ, i.e., for all $P \in P$ ,

θ^{*} (P) \in \underset{θ \in Θ}{argmax} E_{P} [m_{θ} (Y_{t}, X_{t}, A_{t}) ∣ X_{t}, A_{t}] w.p. 1 .

(1)

Note that $θ^{*} (P)$ does not depend on (X_t, A_t) and it is an implicit modelling assumption that such a $θ^{*} (P)$ exists for a given m_θ. Note that this formulation includes semi-parametric models, e.g., the model could constrain the conditional mean of Y_t to be linear in some function of the actions and context, but allow the residuals to follow any mean-zero distribution, including ones that depend on the actions and/or contexts.

To estimate $θ^{*} (P)$ , we build on M-estimation [Huber, 1992], which classically selects the estimator $\hat{θ}$ to be the θ ∈ Θ that maximizes the empirical analogue of Equation (1):

{\hat{θ}}_{T} ≔ \underset{θ \in Θ}{argmax} \frac{1}{T} \sum_{t = 1}^{T} m_{θ} (Y_{t}, X_{t}, A_{t}) .

(2)

For example, in a classical linear regression setting with $∣ A ∣ < \infty$ actions, a natural choice for m_θ is the negative of the squared loss function, $m_{θ} (Y_{t}, X_{t}, A_{t}) = - (Y_{t} - X_{t}^{T} θ_{A_{t}})^{2}$ . When Y_t is binary, a natural choice is instead the negative log-likelihood function for a logistic regression model, i.e., $m_{θ} (Y_{t}, X_{t}, A_{t}) = - [Y_{t} X_{t}^{T} θ_{A_{t}} - \log (1 + exp (X_{t}^{T} θ_{A_{t}}))]$ . More generally, m_θ is commonly chosen to be a log-likelihood function or the negative of a robust loss function such as the Huber loss. If the data, ${X_{t}, A_{t}, Y_{t}}_{t = 1}^{T}$ , were independent across time, classical approaches could be used to prove the consistency and asymptotic normality of M-estimators [Van der Vaart, 2000]. However, on data collected with bandit algorithms, standard M-estimators like the ordinary least-squares estimator fail to provide valid confidence intervals [Hadad et al., 2019, Deshpande et al., 2018, Zhang et al., 2020]. In this work, we show that M-estimators can still be used to provide valid statistical inference on adaptively collected data when adjusted with well-chosen adaptive weights.

3. Adaptively Weighted M-Estimators

We consider a weighted M-estimating criteria with adaptive weights $W_{t} \in σ (H_{t - 1}, X_{t}, A_{t})$ given by $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ . Here ${π_{t}^{sta}}_{t \geq 1}$ are pre-specified stabilizing policies that do not depend on data {Y_t, X_t, A_t}_t≥1. A default choice for the stabilizing policy when the action space is of size $∣ A ∣ < \infty$ is just $π_{t}^{sta} (a, x) = 1 ∕ ∣ A ∣$ for all x, a, and t; we discuss considerations for the choice of ${π_{t}^{sta}}_{t = 1}^{T}$ in Section 3.3. We call these weights square-root importance weights because they are the square-root of the standard importance weights [Hammersley, 2013, Wang et al., 2017]. Our proposed estimator for $θ^{*} (P)$ , ${\hat{θ}}_{T}$ , is the maximizer of a weighted version of the M-estimation criterion of Equation (2):

{\hat{θ}}_{T} ≔ \underset{θ \in Θ}{argmax} \frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ} (Y_{t}, X_{t}, A_{t}) ≕ \underset{θ \in Θ}{argmax} M_{T} (θ) .

Note that M_T(θ) defined above depends on both the data ${X_{t}, A_{t}, Y_{t}}_{t = 1}^{T}$ and weights ${W_{t}}_{t = 1}^{T}$ . We provide asymptotically valid confidence regions for $θ^{*} (P)$ by deriving the asymptotic distribution of ${\hat{θ}}_{T}$ as T → ∞ and by proving that the convergence in distribution is uniform over $P \in P$ . Such convergence allows us to construct a uniformly asymptotically valid 1 − α level confidence region, CT(α), for $θ^{*} (P)$ , which is a confidence region that satisfies

\underset{T \to \infty}{lim inf} inf_{P \in P} P_{P, π} (θ^{*} (P) \in C_{T} (α)) \geq 1 - α .

(3)

If C_T(α) were not uniformly valid, then there would exist an ϵ > 0 such that for every sample size T, C_T(α)’s coverage would be below 1 − α − ϵ for some worst-case P_T ∈ P. Confidence regions which are asymptotically valid, but not uniformly asymptotically valid, fail to be reliable in practice [Leeb and Pötscher, 2005, Romano et al., 2012]. Note that on i.i.d. data it is generally straightforward to show that estimators that converge in distribution do so uniformly; however, as discussed in Zhang et al. [2020] and Appendix D, this is not the case on data collected with bandit algorithms.

To construct uniformly valid confidence regions for $θ^{*} (P)$ we prove that ${\hat{θ}}_{T}$ is uniformly asymptotically normal in the following sense:

Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P,

(4)

where ${\ddot{M}}_{T} (θ) ≔ \frac{\partial^{2}}{\partial^{2} θ} M_{T} (θ)$ and $Σ_{T} (P) ≔ \frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})^{\otimes 2}]$ . We define ${\dot{m}}_{θ} ≔ \frac{\partial}{\partial θ} m_{θ}$ . Similarly we define respectively ${\ddot{m}}_{θ}$ and ${\overset{⃛}{m}}_{θ}$ as the second and third partial derivatives of m_θ with respect to θ. For any vector z we define z^⊗2 ≔ zz^⊤.

3.1. Intuition for Square-Root Importance Weights

The critical role of the square-root importance weights $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ is to adjust for instability in the variance of M-estimators due to the bandit algorithm. These weights act akin to standard importance weights when squared and adjust a key term in the variance of M-estimators from depending on adaptive policies ${π_{t}}_{t = 1}^{T}$ , which can be ill-behaved, to depending on the pre-specified stabilizing policies ${π_{t}^{sta}}_{t = 1}^{T}$ . See Zhang et al. [2020] and Deshpande et al. [2018] for more discussion of the ill-behavior of the action selection probabilities for common bandit algorithms, which occurs particularly when there is no unique optimal policy.

As an illustrative example, consider the least-squares estimators in a finite-arm linear contextual bandit setting. Assume that $E_{P} [Y_{t} ∣ X_{t}, A_{t} = a] = X_{t}^{T} θ_{a}^{*} (P)$ w.p. 1. We focus on estimating $θ_{a}^{*} (P)$ for some $a \in A$ . The least-squares estimator corresponds to an M-estimator with $m_{θ_{a}} (Y_{t}, X_{t}, A_{t}) = - 1_{A_{t} = a} (Y_{t} - X_{t}^{T} θ_{a})^{2}$ . The adaptively weighted least-squares (AW-LS) estimator is ${\hat{θ}}_{T, a}^{AW-LS} ≔ {argmax}_{θ_{a}} {- \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a} (Y_{t} - X_{t}^{T} θ_{a})^{2}}$ . For simplicity, suppose that the stabilizing policy does not change with t and drop the index t to get π^sta. Taking the derivative of this criterion, we get $0 = \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a} X_{t} (Y_{t} - X_{t}^{T} {\hat{θ}}_{T, a}^{AW-LS})$ , and rearranging terms gives

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a} X_{t} X_{t}^{T} ({\hat{θ}}_{T, a}^{AW-LS} - θ_{a}^{*} (P)) = \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a} X_{t} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P)) .

(5)

Note that the right hand side of Equation (5) is a martingale difference sequence with respect to history ${H_{t}}_{t = 0}^{T}$ because $E_{P, π} [W_{t} 1_{A_{t} = a} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P)) ∣ H_{t - 1}] = 0$ for all t; by law of iterated expectations and since $W_{t} \in σ (H_{t - 1}, X_{t}, A_{t})$ , $E_{P, π} [W_{t} 1_{A_{t} = a} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P)) ∣ H_{t - 1}]$ equals

E_{P} [W_{t} π_{t} (a, X_{t}, H_{t - 1}) E_{P} [Y_{t} - X_{t}^{T} θ_{a}^{*} (P) ∣ H_{t - 1}, X_{t}, A_{t} = a] ∣ H_{t - 1}] \underset{(i)}{=} [E_{P} [W_{t} π_{t} (a, X_{t}, H_{t - 1}) E_{P} [Y_{t} - X_{t}^{T} θ_{a}^{*} (P) ∣ X_{t}, A_{t} = a] ∣ H_{t - 1}] \underset{(i i)}{=} 0 .

(i) holds by our i.i.d. potential outcomes assumption. (ii) holds since $E_{P} [Y_{t} ∣ X_{t}, A_{t} = a] = X_{t}^{T} θ_{a}^{*} (P)$ . We prove that (5) is uniformly asymptotically normal by applying a martingale central limit theorem (Appendix B.4). The key condition in this theorem is that the conditional variance converges uniformly, for which it is sufficient to show that the conditional covariance of $W_{t} 1_{A_{t} = a} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P))$ given $H_{t - 1}$ equals some positive-definite matrix $Σ (P)$ for every t, i.e.,

E_{P, π} [W_{t}^{2} 1_{A_{t} = a} X_{t}, X_{t}^{T} {(Y_{t} - X_{t}^{T} θ_{a}^{*} (P))}^{2} ∣ H_{t - 1}] = Σ (P) .

(6)

By law of iterated expectations, $E_{P, π} [W_{t}^{2} 1_{A_{t} = a} X_{t} X_{t}^{T} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P))^{2} ∣ H_{t - 1}]$ equals

E_{P} [E_{P, π} [\frac{π^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})} 1_{A_{t} = a} X_{t} X_{t}^{T} {(Y_{t} - X_{t}^{T} θ_{a}^{*} (P))}^{2} ∣ H_{t - 1}, X_{t}] ∣ H_{t - 1}] \underset{(a)}{=} E_{P} [E_{P, π^{sta}} [1_{A_{t} = a} X_{t} X_{t}^{T} {(Y_{t} - X_{t}^{T} θ_{a}^{*} (P))}^{2} ∣ H_{t - 1}, X_{t}] ∣ H_{t - 1}] \underset{(b)}{=} E_{P} [E_{P, π^{sta}} [1_{A_{t} = a} X_{t} X_{t}^{T} {(Y_{t} - X_{t}^{T} θ_{a}^{*} (P))}^{2} ∣ X_{t}] ∣ H_{t - 1}] \underset{(c)}{=} E_{P} [E_{P, π^{sta}} [1_{A_{t} = a} X_{t} X_{t}^{T} {(Y_{t} - X_{t}^{T} θ_{a}^{*} (P))}^{2} ∣ X_{t}]] \underset{(d)}{=} E_{P, π^{sta}} [1_{A_{t} = a} X_{t} X_{t}^{T} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P))^{2}] ≕ Σ (P) .

(7)

Above, (a) holds because the importance weights change the sampling measure from the adaptive policy π_t to the pre-specified stabilizing policy π^sta. (b) holds by our i.i.d. potential outcomes assumption and because π^sta is a pre-specified policy. (c) holds because X_t does not depend on $H_{t - 1}$ by our i.i.d. potential outcomes assumption. (d) holds by the law of iterated expectations. Note that $Σ (P)$ does not depend on t because π^sta is not time-varying. In contrast, without the adaptive weighting, i.e., when W_t = 1, the conditional covariance of $1_{A_{t} = a} (Y_{t} - X_{t}^{T} θ_{a}^{*} (P))$ on $H_{t - 1}$ is a random variable, due to the adaptive policy π_t.

In Figure 1 we plot the empirical distributions of the z-statistic for the least-squares estimator both with and without adaptive weighting. We consider a two-armed bandit with A_t ∈ {0, 1}. Let $θ_{1}^{*} (P) ≔ E_{P} [Y_{t} (1)]$ and $m_{θ_{1}} (Y_{t}, A_{t}) ≔ - A_{t} (Y_{t} - θ_{1})^{2}$ . The unweighted version, i.e., the ordinary least-squares (OLS) estimator, is ${\hat{θ}}_{T, 1}^{OLS} ≔ {argmax}_{θ_{1}} \frac{1}{T} \sum_{t = 1}^{T} m_{θ_{1}} (Y_{t}, A_{t})$ . The adaptively weighted version is ${\hat{θ}}_{T, 1}^{AW-LS} ≔ {argmax}_{θ_{1}} \frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ_{1}} (Y_{t}, A_{t})$ . We collect data using Thompson Sampling and use a uniform stabilizing policy where π^sta(1) = π^sta(0) = 0.5. It is clear that the least-squares estimator with adaptive weighting has a z-statistic that is much closer to a normal distribution.

The square-root importance weights are a form of variance stabilizing weights, akin to those introduced in Hadad et al. [2019] for estimating means and differences in means on data collected with multi-armed bandits. In fact, in the special case that $∣ A ∣ < \infty$ and $ϕ (X_{t}, A_{t}) = [1_{A_{t} = 1}, 1_{A_{t} = 2}, \dots, 1_{A_{t} = ∣ A ∣}]^{T}$ , the adaptively weighted least-squares estimator is equivalent to the weighted average estimator of Hadad et al. [2019]. See Section 4 for more on Hadad et al. [2019].

3.2. Asymptotic Normality and Confidence Regions

We now discuss conditions under which the adaptively weighted M-estimators are asymptotically normal in the sense of Equation (4). In general, our conditions differ from those made for standard M-estimators on i.i.d. data because (i) the data is adaptively collected, i.e., π_t can depend on $H_{t - 1}$ and (ii) we ensure uniform convergence over $P \in P$ , which is stronger than guaranteeing convergence pointwise for each $P \in P$ .

Condition 1 (Stochastic Bandit Environment). Potential outcomes ${X_{t}, Y_{t} (a) : a \in A} \overset{i . i . d .}{\sim} P \in P$ over t ∈ [1: T].

Condition 1 implies that Y_t is independent of $H_{t - 1}$ given X_t and A_t, and the conditional distribution Y_t ∣ X_t, A_t is invariant over time. Also note that action space $A$ can be finite or infinite.

Condition 2 (Differentiable). The first three derivatives of m_θ (y, x, a) with respect to θ exist for every θ ∈ Θ, every $a \in A$ , and every (x, y) in the joint support of ${P : P \in P}$ .

Condition 3 (Bounded Parameter Space). For all $P \in P$ , $θ^{*} (P) \in Θ$ , a bounded open subset of $R^{d}$ .

Condition 4 (Lipschitz). There exists some real-valued function g such that (i) ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}]$ is bounded and (ii) for all θ, θ′ ∈ Θ,

∣ m_{θ} (Y_{t}, X_{t}, A_{t}) - m_{θ^{'}} (Y_{t}, X_{t}, A_{t}) ∣ \leq g (Y_{t}, X_{t}, A_{t}) ‖ θ - θ^{'} ‖_{2} .

Conditions 3 and 4 together restrict the complexity of the function m in order to ensure a martingale law of large numbers result holds uniformly over functions {m_θ : θ ∈ Θ}; this is used to prove the consistency of ${\hat{θ}}_{T}$ . Similar conditions are commonly used to prove consistency of M-estimators based on i.i.d. data, although the boundedness of the parameter space can be dropped when m_θ is a concave function of θ for all Y_t, A_t, X_t (as it is in many canonical examples such as least squares) [Van der Vaart, 2000, Engle, 1994, Bura et al., 2018]; we expect that a similar result would hold for adaptively weighted M-estimators.

Condition 5 (Moments). The fourth moments of $m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})$ , ${\dot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})$ , and ${\ddot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})$ with respect to $P$ and policy $π_{t}^{sta}$ are bounded uniformly over $P \in P$ and t ≥ 1. For all sufficiently large T, the minimum eigenvalue of $Σ_{T, P} ≔ \frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})^{\otimes 2}]$ is bounded above $δ_{{\dot{m}}^{2}} > 0$ for all $P \in P$ .

Condition 5 is similar to those of Van der Vaart [2000, Theorem 5.41]. However, to guarantee uniform convergence we assume that moment bounds hold uniformly over $P \in P$ and t ≥ 1.

Condition 6 (Third Derivative Domination). For $B \in R^{d \times d \times d}$ , we define $‖ B ‖_{1} ≔ \sum_{i = 1}^{d} \sum_{j = 1}^{d} \sum_{k = 1}^{d} ∣ B_{i, j, k} ∣$ . There exists a function $\overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) \in R^{d \times d \times d}$ such that (i) ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}^{2}]$ is bounded and (ii) for all $P \in P$ there exists some $ϵ_{\overset{⃛}{m}} > 0$ such that the following holds with probability 1,

sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} ‖ {\overset{⃛}{m}}_{θ} (Y_{t}, X_{t}, A_{t}) ‖_{1} \leq ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} .

Condition 6 is again similar to those in classical M-estimator asymptotic normality proofs [Van der Vaart, 2000, Theorem 5 .41].

Condition 7 (Maximizing Solution).

(i) For all $P \in P$ , there exists a $θ^{*} (P) \in Θ$ such that (a) $θ^{*} (P) \in {argmax}_{θ \in Θ} E_{P} [m_{θ} (Y_{t}, X_{t}, A_{t}) ∣ X_{t}, A_{t}]$ w.p. 1, (b) $E_{P} [{\dot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t}) ∣ X_{t}, A_{t}] = 0$ w.p. 1, and (c) $E_{P} [{\ddot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t}) ∣ X_{t}, A_{t}] \underline{≺} 0$ w.p. 1.

(ii) There exists some positive definite matrix H such that $- \frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{\ddot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})] \underline{≻}$ for all $P \in P$ and all sufficiently large T.

For matrices A, B, we define A ⪰ B to mean that A − B is positive semi-definite, as used above. Condition 7 (i) ensures that $θ^{*} (P)$ is a conditionally maximizing solution for all contexts X_t and actions A_t; this ensures that ${{\dot{m}}_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})}_{t = 1}^{T}$ is a martingale difference sequence with respect to ${H_{t}}_{t = 1}^{T}$ . Note it does not require $θ^{*} (P)$ to always be a conditionally unique optimal solution. Condition 7 (ii) is related to the local curvature at the maximizing solution and the analogous condition in the i.i.d. setting is trivially satisfied; we specifically use this condition to ensure we can replace $\ddot{M} (θ^{*} (P))$ with $\ddot{M} ({\hat{θ}}_{T})$ in our asymptotic normality result, i.e., that $\ddot{M} (θ^{*} (P))^{- 1} \ddot{M} ({\hat{θ}}_{T}) \overset{P}{\to} I_{d}$ uniformly over $P \in P$ .

Condition 8 (Well-Separated Solution). For all sufficiently large T, for any ϵ > 0, there exists some δ > 0 such that for all $P \in P$ ,

inf_{θ \in Θ : ‖ θ - θ^{*} (P) ‖_{2} > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t}) - m_{θ} (Y_{t}, X_{t}, A_{t})]} \geq δ .

A well-separated solution condition akin to Condition 8 is commonly assumed in order to prove consistency of M-estimators, e.g., see Van der Vaart [2000, Theorem 5.7]. Note that the difference between Condition 7 (i) and Condition 8 is that the former is a conditional statement (conditional on X_t, A_t) and the latter is a marginal statement (marginal over X_t, A_t, where A_t is chosen according to stabilizing policies $π_{t}^{sta}$ ). Condition 7 (i) means there is a $θ^{*} (P)$ solution for all contexts X_t and actions A_t that does not need to be unique, however Condition 8 assumes that marginally over X_t, A_t there is a well-separated solution.

Condition 9 (Bounded Importance Ratios). ${π_{t}^{sta}}_{t = 1}^{T}$ do not depend on data ${Y_{t}, X_{t}, A_{t}}_{t = 1}^{T}$ . For all t ≥ 1, $ρ_{min} \leq \frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})} \leq ρ_{\max}$ w.p. 1 for some constants 0 < ρ_min ≤ ρ_max < ∞.

Note that Condition 9 implies that for a stabilizing policy that is not time-varying, the action selection probabilities of the bandit algorithm $π_{t} (A_{t}, X_{t}, H_{t - 1})$ must be bounded away from zero w.p. 1. Similar boundedness assumptions are also made in the off-policy evaluation literature [Thomas and Brunskill, 2016, Kallus and Uehara, 2020]. We discuss this condition further in Sections 3.3 and 6.

Theorem 1 (Uniform Asymptotic Normality of Adaptively Weighted M-Estimators). Under Conditions 1-9 we have that ${\hat{θ}}_{T} \overset{P}{\to} θ^{*} (P)$ uniformly over $P \in P$ . Additionally,

Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(8)

The asymptotic normality result of equation (8) guarantees that for d-dimensional $θ^{*} (P)$ ,

\underset{T \to \infty}{lim inf} inf_{P \in P} P_{P, π} ({[Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P))]}^{\otimes 2} \leq χ_{d, (1 - α)}^{2}) = 1 - α .

Above $χ_{d, (1 - α)}^{2}$ is the 1 − α quantile of the χ² distribution with d degrees of freedom. Note that the region $C_{T} (α) ≔ {θ \in Θ : [Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P))]^{\otimes 2} \leq χ_{d, (1 - α)}^{2}}$ defines a d-dimensional hyper-ellipsoid confidence region for $θ^{*} (P)$ . Also note that since ${\ddot{M}}_{T} ({\hat{θ}}_{T})$ does not concentrate under standard bandit algorithms, we cannot use standard arguments to justify treating ${\hat{θ}}_{T}$ as multivariate normal with covariance ${\ddot{M}}_{T} ({\hat{θ}}_{T})^{- 1} Σ_{T} (P) {\ddot{M}}_{T} ({\hat{θ}}_{T})^{- 1}$ . Nevertheless, Theorem 1 can be used to guarantee valid confidence regions for subset of entries in $θ^{*} (P)$ by using projected confidence regions [Nickerson, 1994]. Projected confidence regions take a confidence region for all parameters $θ^{*} (P)$ and project it onto the lower dimensional space on which the subset of target parameters lie (Appendix A.2).

3.3. Choice of Stabilizing Policy

When the action space is bounded, using weights $W_{t} = 1 ∕ \sqrt{π_{t} (A_{t}, X_{t}, H_{t - 1})}$ is equivalent to using square-root importance weights with a stabilizing policy that selects actions uniformly over $A$ ; this is because weighted M-estimators are invariant to all weights being scaled by the same constant. It can make sense to choose a non-uniform stabilizing policy in order to prevent the square-root importance weights from growing too large and to ensure Condition 9 holds; disproportionately up-weighting a few observations can lead to unstable estimators. Note that an analogue of our stabilizing policy exists in the causal inference literature, namely, “stabilized weights” use a probability density in the numerator of the weights to prevent them from becoming too large [Robins et al., 2000].

We now discuss how to choose stabilizing policies ${π_{t}^{sta}}_{t \geq 1}$ in order to minimize the asymptotic variance of adaptively weighted M-estimators. We focus on the adaptively weighted least-squares estimator when we have a linear outcome model $E_{P} [Y_{t} ∣ X_{t}, A_{t}] = X_{t}^{T} θ_{A_{t}}$ :

{\hat{θ}}^{AW-LS} ≔ \underset{θ \in Θ}{argmax} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} (Y_{t} - X_{t}^{T} θ_{A_{t}})^{2}} .

(9)

Recall that our use of adaptive weights is to adjust for instability in the variance of M-estimators induced by the bandit algorithm in order to construct valid confidence regions; note that weighted estimators are not typically used for this reason. On i.i.d. data, the least-squares criterion is weighted like in Equation (9) in order to minimize the variance of estimators under noise heteroskedasticity; in this setting, the best linear unbiased estimator has weights W_t = 1/σ²(A_t, X_t) where $σ^{2} (A_{t}, X_{t}) ≔ E_{P} [(Y_{t} - X_{t}^{T} θ_{A_{t}}^{*} (P))^{2} ∣ X_{t}, A_{t}]$ ; this up-weights the importance of observations with low noise variance. Intuitively, if we do not need to variance stabilize, {W_t}_t≥1 should be determined by the relative importance of minimizing the errors for different observations, i.e., their noise variance.

In light of this observation, we expect that under homoskedastic noise there is no reason to up-weight some observations over others. This would recommend choosing the stabilizing policy to make $W_{t} = \sqrt{π_{t}^{sta} (A_{t}, X_{t}) ∕ π_{t} (A_{t}, X_{t}, H_{t - 1})}$ as close to 1 as possible, subject to the constraint that the stabilizing policies are pre-specified, i.e., ${π_{t}^{sta}}_{t \geq 1}$ do not depend on data {Y_t, X_t, A_t}_t≥1 (see Appendix C for details). Since adjusting for heteroskedasticity and variance stabilization are distinct uses of weights, under heteroskedasticity, we recommend that the weights are combined in the following sense: $W_{t} = (1 ∕ σ^{2} (A_{t}, X_{t})) \sqrt{π_{t}^{sta} (A_{t}, X_{t}) ∕ π_{t} (A_{t}, X_{t}, H_{t - 1})}$ . This would mean that to minimize variance, we still want to choose the stabilizing policies to make $π_{t}^{sta} (A_{t}, X_{t}) ∕ π_{t} (A_{t}, X_{t}, H_{t - 1})$ as close to 1 possible, subject to the pre-specified constraint.

4. Related Work

Villar et al. [2015] and Rafferty et al. [2019] empirically illustrate that classical ordinary least squares (OLS) inference methods have inflated Type-1 error when used on data collected with a variety of regret-minimizing multi-armed bandit algorithms. Chen et al. [2020] prove that the OLS estimator is asymptotically normal on data collected with an ϵ-greedy algorithm, but their results do not cover settings in which there is no unique optimal policy, e.g., a multi-arm bandit with two identical arms (Appendix E). Recent work has discussed the non-normality of OLS on data collected with bandit algorithms when there is no unique optimal policy and proposed alternative methods for statistical inference. A common thread between these methods is that they all utilize a form of adaptive weighting. Deshpande et al. [2018] introduced the W-decorrelated estimator, which adjusts the OLS estimator with a sum of adaptively weighted residuals. In the multi-armed bandit setting, the W-decorrelated estimator up-weights observations from early in the study and down-weights observations from later in the study [Zhang et al., 2020]. In the batched bandit setting, Zhang et al. [2020] show that the Z-statistics for the OLS estimators computed separately on each batch are jointly asymptotically normal. Standardizing the OLS statistic for each batch effectively adaptively re-weights the observations in each batch.

Hadad et al. [2019] introduce adaptively weighted versions of both the standard augmented-inverse propensity weighted estimator (AW-AIPW) and the sample mean (AWA) for estimating parameters of simple models on data collected with bandit algorithms. They introduce a class of adaptive “variance stabilizing” weights, for which the variance of a normalized version of their estimators converges in probability to a constant. In their discussion section they note open questions, two of which this work addresses: 1) “What additional estimators can be used for normal inference with adaptively collected data?” and 2) How do their results generalize to more complex sampling designs, like data collected with contextual bandit algorithms? We demonstrate that variance stabilizing adaptive weights can be used to modify a large class of M-estimators to guarantee valid inference. This generalization allows us to perform valid inference for a large class of important inferential targets: parameters of models for expected outcomes that are context dependent.

Recently, adaptive weighting has also been used in off-policy evaluation methods for when the behavior policy (policy used to collect the data) is a contextual bandit algorithm [Bibaut et al., 2021, Zhan et al., 2021]. In this literature the estimand is the value, or average expected reward, of a pre-specified policy (note this is a scalar value). In contrast, in our work we are interested in constructing confidence regions for parameters of a model for an outcome (that could be the reward)—for example, this could be parameters of a logistic regression model for a binary outcome. We believe in the future there could be theory that could unify these adaptive weighting methods for these different estimands.

An alternative to using asymptotic approximations to construct confidence intervals is to use high-probability confidence bounds. These bounds provide stronger guarantees than those based on asymptotic approximations, as they are guaranteed to hold for finite samples. The downside is that these bounds are typically much wider, which is why much of classical statistics uses asymptotic approximations. Here we do the same. In Section 5, we empirically compare our to the self-normalized martingale bound [Abbasi-Yadkori et al., 2011], a high-probability bound commonly used in the bandit literature.

5. Simulation Results

In this section, R_t = Y_t. We consider two settings: a continuous reward setting and a binary reward setting. In the continuous reward setting, the rewards are generated with mean $E_{P} [R_{t} ∣ X_{t}, A_{t}] = {\tilde{X}}_{t}^{T} θ_{0}^{*} (P) + A_{t} {\tilde{X}}_{t}^{T} θ_{1}^{*} (P)$ and noise drawn from a student’s t distribution with five degrees of freedom; here ${\tilde{X}}_{t} = [1, X_{t}] \in R^{3}$ (X_t with intercept term), actions A_t ∈ {0, 1}, and parameters $θ_{0}^{*} (P)$ , $θ_{1}^{*} (P) \in R^{3}$ . In the binary reward setting, the reward R_t is generated as a Bernoulli with success probability $E_{P} [R_{t} ∣ X_{t}, A_{t}] = [1 + exp (- {\tilde{X}}_{t}^{T} θ_{0}^{*} (P) - A_{t} {\tilde{X}}_{t}^{T} θ_{1}^{*} (P))]^{- 1}$ . Furthermore, in both simulation settings we set $θ_{0}^{*} (P) = [0.1, 0.1, 0.1]$ and $θ_{1}^{*} (P) = [0, 0, 0]$ , so there is no unique optimal arm; we call vector parameter $θ_{1}^{*} (P)$ the advantage of selecting A_t = 1 over A_t = 0. Also in both settings, the contexts X_t are drawn i.i.d. from a uniform distribution.

In both simulation settings we collect data using Thompson Sampling with a linear model for the expected reward and normal priors [Agrawal and Goyal, 2013] (so even when the reward is binary). We constrain the action selection probabilities with clipping at a rate of 0.05; this means that while typical Thompson Sampling produces action selection probabilities $π_{t}^{TS} (A_{t}, X_{t}, H_{t - 1})$ , we instead use action selection probabilities $π_{t} (A_{t}, X_{t}, H_{t - 1}) = 0.05 \lor (0.95 \land π_{t}^{TS} (A_{t}, X_{t}, H_{t - 1}))$ to select actions. We constrain the action selection probabilities in order to ensure weights W_t are bounded when using a uniform stabilizing policy; see Sections 3.2 and 6 for more discussion on this boundedness assumption. Also note that increasing the amount the algorithm explores (clipping) decreases the expected width of confidence intervals constructed on the resulting data (see Section 6).

To analyze the data, in the continuous reward setting, we use least-squares estimators with a correctly specified model for the expected reward, i.e., M-estimators with $m_{θ} (R_{t}, X_{t}, A_{t}) = - (R_{t} - {\tilde{X}}_{t}^{T} θ_{0} - A_{t} {\tilde{X}}_{t}^{T} θ_{1})^{2}$ . We consider both the unweighted and adaptively weighted versions. We also compare to the self-normalized martingale bound [Abbasi-Yadkori et al., 2011] and the W-decorrelated estimator [Deshpande et al., 2018], as they were both developed for the linear expected reward setting. For the self-normalized martingale bound, which requires explicit bounds on the parameter space, we set $Θ = {θ \in R^{6} : ‖ θ ‖_{2} \leq 6}$ . In the binary reward setting, we also assume a correctly specified model for the expected reward. We use both unweighted and adaptively weighted maximum likelihood estimators (MLEs), which correspond to an M-estimators with m_θ(R_t, X_t, A_t) set to the negative log-likelihood of R_t given X_t, A_t. We solve for these estimators using Newton–Raphson optimization and do not put explicit bounds on the parameter space Θ (note in this case m_θ is concave in θ [Agresti, 2015, Chapter 5.4.2]). See Appendix A for additional details and simulation results.

In Figure 4 we plot the empirical coverage probabilities and volumes of 90% confidence regions for $θ^{*} (P) ≔ [θ_{0}^{*} (P), θ_{1}^{*} (P)]$ and $θ_{1}^{*} (P)$ in both the continuous and binary reward settings. While the confidence regions based on the unweighted least-squares estimator (OLS) and the unweighted MLE have significant undercoverage that does not improve as T increases, the confidence regions based on the adaptively weighted versions, AW-LS and AW-MLE, have very reliable coverage. For the confidence regions for $θ_{1}^{*} (P)$ based on the AW-LS and AW-MLE, we include both projected confidence regions (for which we have theoretical guarantees) and non-projected confidence regions. The confidence regions based on projections are conservative but nevertheless have comparable volume to those based on OLS and MLE respectively. We do not prove theoretical guarantees for the non-projection confidence regions for AW-LS and AW-MLE, however they perform well across in our simulations. Both types of confidence regions based on AW-LS have significantly smaller volumes than those constructed using the self-normalized martingale bound and W-decorrelated estimator. Note that the W-decorrelated estimator and self-normalized martingale bounds are designed for linear contextual bandits and are thus not applicable for the logistic regression model setting. The confidence regions constructed using the self-normalized martingale bound have reliable coverage as well, but are very conservative. Empirically, we found that the coverage probabilities of the confidence regions based on the W-decorrelated estimator were very sensitive to the choice of tuning parameters. We use 5,000 Monte-Carlo repetitions and the error bars plotted are standard errors.

6. Discussion

Immediate questions

We assume that ratios $π_{t}^{sta} (A_{t}, X_{t}) ∕ π_{t} (A_{t}, X_{t}, H_{t - 1})$ are bounded for our theoretical results; this precludes $π_{t} (A_{t}, X_{t}, H_{t - 1})$ from going to zero for a fixed stabilizing policy. For simple models, e.g., the AW-LS estimator, we can let these ratios grow at a certain rate and still guarantee asymptotic normality (Appendix B.5); we conjecture similar results hold more generally.

Generality and robustness

This work assumes that we have a well-specified model for the outcome Y_t, i.e., that $θ^{*} (P) \in {argmax}_{θ \in Θ} E_{P} [m_{θ} (Y_{t}, X_{t}, A_{t}) ∣ X_{t}, A_{t}]$ w.p. 1. Our theorems use this assumption to ensure that ${W_{t} {\dot{m}}_{θ} (Y_{t}, X_{t}, A_{t})}_{t \geq 1}$ is a martingale difference sequence with respect to ${H_{t}}_{t \geq 0}$ . On i.i.d. data it is common to define $θ^{*} (P)$ to be the best projected solution, i.e., $θ_{0} (P) \in {argmax}_{θ \in Θ} E_{P, π} [m_{θ} (Y_{t}, X_{t}, A_{t})]$ . Note that the best projected solution, $θ^{*} (P)$ , depends on the distribution of the action selection policy π. It would be ideal to also be able to perform inference for a projected solution on adaptively collected data.

Another natural question is whether adaptive weighting methods work in Markov Decision Processes (MDP) environments. Taking the AW-LS estimator introduced in Section 3.1 as an example, our conditional variance derivation in Equation (7) fails to hold in an MDP setting, specifically equality (c). However, the conditional variance condition can be satisfied if we instead use weights $W_{t} = {[π_{t}^{sta} (A_{t}, X_{t}) p^{sta} (X_{t})] ∕ [π_{t} (A_{t}, X_{t}, H_{t - 1}) P_{P} (X_{t} ∣ X_{t - 1}, A_{t - 1})]}^{1 ∕ 2}$ where $P_{P}$ are the state transition probabilities and p^sta is a pre-specified distribution over states. In general though we do not expect to know the transition probabilities $P_{P}$ and if we tried to estimate them, our theory would require the estimator to have error $o_{p} (1 ∕ \sqrt{T})$ , below the parametric rate.

Trading-off regret minimization and statistical inference objectives

In sequential decision-making problems there is a fundamental trade-off between minimizing regret and minimizing estimation error for parameters of the environment using the resulting data [Bubeck et al., 2009, Dean et al., 2018]. Given this trade-off there are many open problems regarding how to minimize regret while still guaranteeing a certain amount of power or expected confidence interval width, e.g., developing sample size calculators for use in justifying the number of users in a mobile health trial, and developing new adaptive algorithms [Liu et al., 2014, Erraqabi et al., 2017, Yao et al., 2020].

Figure 2: — Empirical coverage probabilities (upper row) and volume (lower row) of 90% confidence ellipsoids. The left two columns are for the linear reward model setting (t-distributed rewards) and the right two columns are for the logistic regression model setting (Bernoulli rewards). We consider confidence ellipsoids for all parameters $θ^{*} (P)$ and for advantage parameters $θ_{1}^{*} (P)$ for both settings.

Acknowledgements and Disclosure of Funding

We thank Yash Nair for feedback on early drafts of this work.

Research reported in this paper was supported by National Institute on Alcohol Abuse and Alcoholism (NIAAA) of the National Institutes of Health under award number R01AA23187, National Institute on Drug Abuse (NIDA) of the National Institutes of Health under award number P50DA039838, National Cancer Institute (NCI) of the National Institutes of Health under award number U01CA229437, and by NIH/NIBIB and OD award number P41EB028242. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Appendix

A. Simulations

A.1. Simulation Details

Simulation Environment

Each dimension of X_t is sampled independently from Uniform(0, 5).
$θ^{*} (P) = [θ_{0}^{*} (P), θ_{1}^{*} (P)] = [0.1, 0.1, 0.1, 0, 0, 0]$ , where $θ_{0}^{*} (P)$ , $θ_{1}^{*} (P) \in R^{3}$ .

Below also include simulations where $[θ_{0}^{*} (P), θ_{1}^{*} (P)] = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ .
t-Distributed rewards: $R_{t} ∣ X_{t}, A_{t} \sim t_{5} + {\tilde{X}}_{t}^{T} θ_{0}^{*} (P) + A_{t} {\tilde{X}}_{t}^{T} θ_{1}^{*} (P)$ , where t₅ is a t-distribution with 5 degrees of freedom.
Bernoulli rewards: R_t∣X_t, A_t ~ Bernoulli(expit(ν_t)) for $ν_{t} = {\tilde{X}}_{t}^{T} θ_{0}^{*} (P) + A_{t} {\tilde{X}}_{t}^{T} θ_{1}^{*} (P)$ and $e x p i t (x) = \frac{1}{1 + exp (- x)}$ .
Poisson rewards: R_t∣X_t, A_t ~ Poisson(exp(ν_t)) for $ν_{t} = {\tilde{X}}_{t}^{T} θ_{0}^{*} (P) + A_{t} {\tilde{X}}_{t}^{T} θ_{1}^{*} (P)$ .

Algorithm

Thompson Sampling with $N (0, I_{d})$ priors on each arm.
0.05 clipping
Pre-processing rewards before received by algorithm:
- Bernoulli: 2R_t − 1
- Poisson: 0.6R_t

Compute Time and Resources

All simulations run within a few hours on a MacBook Pro.

A.2. Details on Constructing of Confidence Regions

For notational convenience, we define $Z_{t} = [{\tilde{X}}_{t}, A_{t} {\tilde{X}}_{t}]$ .

A.2.1. Least Squares Estimators

${\hat{θ}}_{T} = {(\sum_{t = 1}^{T} W_{t} Z_{t} Z_{t}^{T})}^{- 1} \sum_{t = 1}^{T} W_{t} Z_{t} R_{t}$
- For unweighted least squares, W_t = 1 and we call the estimator ${\hat{θ}}_{T}^{OLS}$ .
- For adaptively weighted least squares, $W_{t} = \frac{1}{\sqrt{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ ; this is equivalent to using square-root importance weights with a uniform stabilizing policy. We call the estimator ${\hat{θ}}_{T}^{AW-LS}$ .
We assume homoskedastic errors and estimate the noise variance σ² as follows:
${\hat{σ}}_{T}^{2} = \frac{1}{T} \sum_{t = 1}^{T} (R_{t} - Z_{t}^{T} {\hat{θ}}_{T})^{2} .$
We use a Hotelling t-squared test statistic to construct confidence regions for $θ^{*} (P)$ :
$C_{T} (α) = {θ \in R^{d} : {[{\hat{Σ}}_{T}^{- 1 ∕ 2} (\frac{1}{T} \sum_{t = 1}^{T} W_{t} Z_{t} Z_{t}^{T}) \sqrt{T} ({\hat{θ}}_{T} - θ)]}^{\otimes 2} \leq \frac{d (T - 1)}{T - d} F_{d, T - d} (1 - α)} .$ (10)
- For the unweighted least-squares estimator we use the following variance estimator: ${\hat{Σ}}_{T} = {\hat{σ}}_{T}^{2} \frac{1}{T} \sum_{t = 1}^{T} Z_{t} Z_{t}^{T}$ .
- For the AW-Least Squares estimator we use the following variance estimator: ${\hat{Σ}}_{T} = {\hat{σ}}_{T}^{2} \frac{1}{T} \sum_{t = 1}^{T} {\frac{1}{π_{t} (A_{t}, X_{t}, H_{t - 1})}}^{A_{t}} {\frac{1}{1 - π_{t} (A_{t}, X_{t}, H_{t - 1})}}^{1 - A_{t}} Z_{t} Z_{t}^{T}$ .
To construct (non-projected) confidence regions for $θ_{1}^{*} (P) \in R^{d_{1}}$ we treat the unweighted least squares / AW-LS estimators, ${\hat{θ}}_{T, 1}$ , as $N (θ_{1}^{*} (P), \frac{1}{T} {(\frac{1}{T} \sum_{t = 1}^{T} W_{t} Z_{t} Z_{t}^{T})}^{- 1} {\hat{Σ}}_{T} {(\frac{1}{T} \sum_{t = 1}^{T} W_{t} Z_{t} Z_{t}^{T})}^{- 1})$ . We use a Hotelling t-squared test statistic to construct confidence regions for $θ_{1}^{*} (P)$ :
$C_{T} (α) = {θ_{1} \in R^{d_{1}} : {[V_{1, T}^{- 1 ∕ 2} \sqrt{T} ({\hat{θ}}_{T, 1} - θ_{1})]}^{\otimes 2} \leq \frac{d_{1} (T - 1)}{T - d_{1}} F_{d_{1}, T - d_{1}} (1 - α)},$
where V_1,T is the lower right d₁ × d₁ block of matrix ${(\frac{1}{T} \sum_{t = 1}^{T} W_{t} Z_{t} Z_{t}^{T})}^{- 1} {\hat{Σ}}_{T} {(\frac{1}{T} \sum_{t = 1}^{T} W_{t} Z_{t} Z_{t}^{T})}^{- 1}$ . Recall that for the unweighted least squares estimator W_t = 1 and for AW-LS $W_{t} = \frac{1}{\sqrt{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ .
For the AW-least squares estimator, we also construct projected confidence regions for $θ_{1}^{*} (P)$ using the confidence region defined in equation (10). See Section A.2.5 below for more details on constructing projected confidence regions.

A.2.2. MLE Estimators

Distribution	ν	b(ν)	b′(ν)	b″(ν)	b‴(ν)
$N (μ, 1)$	μ	$\frac{1}{2} ν^{2}$	ν = μ	1	0
Poisson(λ)	log λ	exp(ν)	exp(ν) = λ	exp(ν) = λ	exp(ν) = λ
Bernoulli(p)	$\log (\frac{p}{1 - p})$	log(1 + e^ν)	$\frac{e^{ν}}{1 + e^{ν}} = p$	$\frac{e^{ν}}{(1 + e^{ν})^{2}} = p (1 - p)$	p(1 − p)(1 − 2p)

Open in a new tab

${\hat{θ}}_{T}$ is the root of the score function:
$0 = \sum_{t = 1}^{T} W_{t} (R_{t} - b^{'} ({\hat{θ}}_{T}^{T} Z_{t})) Z_{t} .$

We use Newton Raphson optimization to solve for ${\hat{θ}}_{T}$ .
- For unweighted MLE, W_t = 1.
- For AW-MLE, $W_{t} = \frac{1}{\sqrt{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ ; this is equivalent to using square-root importance weights with a uniform stabilizing policy.
Second derivative of score function: $- \sum_{t = 1}^{T} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}$ .
We use a Hotelling t-squared test statistic to construct confidence regions for $θ^{*} (P)$ :
$C_{T} (α) = {θ \in R^{d} : {[{\hat{Σ}}_{T}^{- 1 ∕ 2} (\frac{1}{T} \sum_{t = 1}^{T} W_{t} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}) \sqrt{T} ({\hat{θ}}_{T} - θ)]}^{\otimes 2} \leq \frac{d (T - 1)}{T - d} F_{d, T - d} (1 - α)} .$ (11)
- For the MLE variance estimator, we use ${\hat{Σ}}_{T} = \frac{1}{T} \sum_{t = 1}^{T} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}$ .
- For the AW-MLE variance estimator, we use ${\hat{Σ}}_{T} = \frac{1}{T} \sum_{t = 1}^{T} {\frac{1}{π_{t} (A_{t}, X_{t}, H_{t - 1})}}^{A_{t}} {\frac{1}{1 - π_{t} (A_{t}, X_{t}, H_{t - 1})}}^{1 - A_{t}} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}$ .
To construct (non-projected) confidence regions for $θ_{1}^{*} (P) \in R^{d_{1}}$ we treat the MLE / AW-MLE estimators, ${\hat{θ}}_{T, 1}$ , as $N (θ_{1}^{*} (P), \frac{1}{T} (\frac{1}{T} \sum_{t = 1}^{T} W_{t} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}) {\hat{Σ}}_{T}^{- 1} (\frac{1}{T} \sum_{t = 1}^{T} W_{t} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}))$ . We use a Hotelling t-squared test statistic to construct confidence regions for $θ_{1}^{*} (P)$ :
$C_{T} (α) = {θ_{1} \in R^{d_{1}} : {[V_{1, T}^{- 1 ∕ 2} \sqrt{T} ({\hat{θ}}_{T, 1} - θ_{1})]}^{\otimes 2} \leq \frac{d_{1} (T - 1)}{T - d_{1}} F_{d_{1}, T - d_{1}} (1 - α)},$
where V_1,T is the lower right d₁ × d₁ block of matrix $(\frac{1}{T} \sum_{t = 1}^{T} W_{t} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T}) {\hat{Σ}}_{T}^{- 1} (\frac{1}{T} \sum_{t = 1}^{T} W_{t} b^{″} ({\hat{θ}}_{T}^{T} Z_{t}) Z_{t} Z_{t}^{T})$ .
For the AW-MLE estimator, we also construct projected confidence regions for $θ_{1}^{*} (P)$ using the confidence region defined in equation (11). See Section A.2.5 below for more details on constructing projected confidence regions.

A.2.3. W-Decorrelated

The following is based on Algorithm 1 of Deshpande et al. [2018].

The W-decorrelated estimator for $θ^{*} (P)$ is constructed as follows with adaptive weights for $W_{t} \in R^{d}$ :
${\hat{θ}}_{T}^{WD} = {\hat{θ}}_{T}^{OLS} + \sum_{t = 1}^{T} W_{t} (R_{t} - {\tilde{X}}_{t}^{T} {\hat{θ}}_{T}^{OLS}) .$
The weights are set as follows: $W_{1} = 0 \in R^{d}$ and $W_{t} = (I_{d} - \sum_{s = 1}^{t} \sum_{u = 1}^{t} W_{s} Z_{u}^{T}) Z_{t} \frac{1}{λ_{T} + ‖ Z_{t} ‖_{2}^{2}}$ for t > 1.
$W_{1} = 0 \in R^{d} and W_{t} = (I_{d} - \sum_{s = 1}^{t} \sum_{u = 1}^{t} W_{s} Z_{u}^{T}) Z_{t} \frac{1}{λ_{T} + ‖ Z_{t} ‖_{2}^{2}} for t > 1 .$
We choose $λ_{T} = {mineig}_{0.01} (Z_{t} Z_{t}^{T})$ / log T and ${mineig}_{α} (Z_{t} Z_{t}^{T})$ represents the α quantile of the minimum eigenvalue of $Z_{t} Z_{t}^{T}$ . This is similar to the procedure used in the simulations of Deshpande et al. [2018] and is guided by Proposition 5 in their paper.
We assume homoskedastic errors and estimate the noise variance σ² as follows:
${\hat{σ}}_{T}^{2} = \frac{1}{T} \sum_{t = 1}^{T} (R_{t} - Z_{t}^{T} {\hat{θ}}_{T}^{OLS})^{2} .$
To construct confidence ellipsoids for $θ^{*} (P)$ are constructed using a Hotelling t-squared statistic:
$C_{T} (α) = {θ \in R^{d} : ({\hat{θ}}_{T}^{WD} - θ)^{T} V_{T}^{- 1} ({\hat{θ}}_{T}^{WD} - θ) \leq \frac{d (T - 1)}{T - d} F_{d, T - d} (1 - α)}$
where $V_{T} = {\hat{σ}}_{T}^{2} \sum_{t = 1}^{T} W_{t} W_{t}^{T}$ .
To construct confidence ellipsoids for $θ_{1}^{*} (P) \in R^{d_{1}}$ with the following confidence ellipsoid where V_T,1 is the lower right d₁ × d₁ block of matrix V_T:
$C_{T} (α) = {θ_{1} \in R^{d_{1}} : ({\hat{θ}}_{T, 1}^{WD} - θ_{1})^{T} V_{T, 1}^{- 1} ({\hat{θ}}_{T, 1}^{WD} - θ_{1}) \leq \frac{d_{1} (T - 1)}{T - d_{1}} F_{d_{1}, T - d_{1}} (1 - α)} .$

A.2.4. Self-Normalized Martingale Bound

We construct 1 − α confidence region using the following equation taken from Theorem 2 of Abbasi-Yadkori et al. [2011]:

C_{T} (α) = {θ \in Θ : ({\hat{θ}}_{T} - θ)^{T} V_{T} ({\hat{θ}}_{T} - θ) \leq σ \sqrt{2 \log (\frac{det (V_{T})^{1 ∕ 2} det (λ I_{d})^{- 1 ∕ 2}}{α})} + λ^{1 ∕ 2} S} .

${\hat{θ}}_{T} = {(λ I_{d} + \sum_{t = 1}^{T} Z_{t} Z_{t}^{T})}^{- 1} \sum_{t = 1}^{T} Z_{t} R_{t}$ .
$V_{T} = I_{d} λ + \sum_{t = 1}^{T} Z_{t} Z_{t}^{T}$ .
λ = 1 (ridge regression regularization parameter).
σ = 1 (assumes rewards are σ-subgaussian).
S = 6, where it is assumed that $‖ θ^{*} (P) ‖ \leq S$ (recall that in our simulations $θ^{*} (P) \in R^{6}$ .
$Θ = {θ \in R^{6} : ‖ θ ‖_{2} \leq 6})$ .
For constructing confidence regions for $θ^{*} (P)$ , we use projected confidence regions.

A.2.5. Construction of Projected Confidence Regions

We are interested in getting the confidence ellipsoid of the projection of a d-dimensional ellipsoid onto p-dimensional space, for p < d.

Defining the original d-dimensional ellipsoid, for $x \in R^{d}$ and $B \in R^{d \times d}$ :
$x^{T} Bx = 1$
Partitioning the matrix B and vector x:

For $y \in R^{d - p}$ and $z \in R^{p}$ .
$x = [\begin{matrix} y \\ z \end{matrix}]$

For $C \in R^{d - p \times d - p}$ , $E \in R^{p \times p}$ , and $D \in R^{d - p \times p}$ .
$B = [\begin{matrix} C & D \\ D^{T} & E \end{matrix}]$
Gradient of x^⊤Bx with respect to x:
$(B + B^{T}) x = 2 Bx = [\begin{matrix} C & D \\ D^{T} & E \end{matrix}] [\begin{matrix} y \\ z \end{matrix}] .$

Since we are projecting onto the p-dimensional space, our projection is such that the gradient of x^⊤Bx with respect to y is zero, which means
$Cy + Dz = 0 .$

This means in the projection that y = −C⁻¹Dz.
Returning to our definition of the ellipsoid, plugging in z, we have that
$1 = x^{T} Bx = [y^{T} z^{T}] [\begin{matrix} C & D \\ D^{T} & E \end{matrix}] [\begin{matrix} y \\ z \end{matrix}] = y^{T} Cy + 2 z^{T} D^{T} y + z^{T} Ez = (C^{- 1} Dz)^{T} C (C^{- 1} Dz) - 2 z^{T} D^{T} (C^{- 1} Dz) + z^{T} Ez = z^{T} D^{T} C^{- 1} Dz - 2 z^{T} D^{T} C^{- 1} Dz + z^{T} Ez = z^{T} (E - D^{T} C^{- 1} D) z .$

Thus the equation for the final projected ellipsoid is
$z^{T} (E - D^{T} C^{- 1} D) z = 1 .$

A.3. Additional Simulation Results

In addition to the continuous reward and a binary reward settings, here we also consider a discrete count reward setting. In this discrete reward setting, the reward R_t is generated from a Poisson distribution with expectation $E_{P} [R_{t} ∣ X_{t}, A_{t}] = exp ({\tilde{X}}_{t}^{T} θ_{0}^{*} (P) - A_{t} {\tilde{X}}_{t}^{T} θ_{1}^{*} (P))$ . All other data generation methods are equivalent to those used for the other simulation settings. Additionally we will consider the setting in which $θ^{*} (P) = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ for the continuous reward, binary reward, and discrete count settings.

To analyze the data, in the discrete count reward setting, we assume a correctly specified model for the expected reward. We use both unweighted and adaptively weighted maximum likelihood estimators (MLEs), which correspond to an M-estimators with m_θ(R_t, X_t, A_t) set to the negative log-likelihood of R_t given X_t, A_t. We solve for these estimators using Newton–Raphson optimization and do not put explicit bounds on the parameter space Θ.

Figure 3: — Empirical coverage probabilities for 90% confidence ellipsoids for parameters $θ^{*} (P)$ and parameters $θ_{1}^{*} (P)$ (top row). We also plot the volumes of these 90% confidence ellipsoids for $θ^{*} (P)$ and parameters $θ_{1}^{*} (P)$ (bottom row). We set the true parameters to $θ^{*} (P) = [0.1, 0.1, 0.1, 0, 0, 0]$ (left) and to $θ^{*} (P) = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ (right).

Figure 4: — Empirical coverage probabilities (upper row) and volume (lower row) of 90% confidence ellipsoids. In these simulations, $θ^{*} (P) = [0.1, 0.1, 0.1, 0.2, 0.1, 0]$ . The left two columns are for the linear reward model setting (t-distributed rewards) and the right two columns are for the logistic regression model setting (Bernoulli rewards). We consider confidence ellipsoids for all parameters $θ^{*} (P)$ and for advantage parameters $θ_{1}^{*} (P)$ for both settings.

In Figure 5, we plot the mean squared errors of all estimators for all three simulation settings (same simulation hyperparameters as described previously for the respective simulation settings).

B. Asymptotic Results

Throughout, ∥ · ∥ refers to the L₂ norm.

B.1. Definitions

Here we define convergence in probability and distribution that is uniform over the true parameter. We follow the definitions are based on those in Kasy [2019] and Van Der Vaart and Wellner [1996, Chapter 1.12].

Definition 1 (Uniform Convergence in Probability). Let ${Z_{T} (P)}_{T \geq 1}$ be a sequence of random variables whose distributions are defined by some $P \in P$ and some nuisance component η. We say that $Z_{T} (P) \overset{P}{\to} c$ uniformly over $P \in P$ as T → ∞ if for any ϵ > 0,

sup_{P \in P} P_{P, η} (‖ Z_{T} (P) - c ‖ > ϵ) \to 0 .

(12)

For simplicity of notation, throughout we denote $Z_{T} (P) - c = o_{P \in P} (1)$ to mean $Z_{T} (P) \overset{P}{\to} c$ uniformly over $P \in P$ as T → ∞.

Definition 2 (Uniformly Stochastically Bounded). Let ${Z_{T} (P)}_{T \geq 1}$ be a sequence of random variables whose distributions are defined by some $P \in P$ and some nuisance component η. We say that $Z_{T} (P)$ is uniformly stochastically bounded over $P \in P$ as T → ∞ if for any ϵ > 0 there exists some k < ∞ such that

\underset{T \to \infty}{lim sup} sup_{P \in P} P_{P, η} (‖ Z_{T} (P) ‖ > k) < ϵ .

Similarly we denote $Z_{T} (P) = O_{P \in P} (1)$ to mean $Z_{T} (P)$ is stochastically bounded uniformly over $P \in P$ as T → ∞.

Definition 3 (Uniform Convergence in Distribution). Let $Z (P) \in R^{d_{Z}}$ and ${Z_{T} (P)}_{T \geq 1} \in R^{d_{Z}}$ be a sequence of random variables whose distributions are defined by some $P \in P$ and some nuisance component η. We say that $Z_{T} (P) \overset{D}{\to} Z (P)$ uniformly over $P \in P$ as T → ∞ if

sup_{P \in P} sup_{f \in B L_{1}} ∣ E_{P, η} [f (Z_{T} (P))] - E_{P, η} [f (Z (P))] ∣ \to 0,

(13)

where BL₁ is the set of functions $f : R^{d_{z}} \to R$ with ∥f(z)∥_∞ ≤ 1 and ∣f(z) − f(z′)∣ ≤ ∥z − z′∥ for all z, $z^{'} \in R^{d_{Z}}$ .

As discussed in Kasy [2019], Equation (12) holds if and only if for any ϵ > 0 and any sequence ${P_{T}}_{T \geq 1}$ such that $P_{T} \in P$ for all T ≥ 1, $P_{P_{T}, η} (‖ Z_{T} (P_{T}) - c ‖ > ϵ) \to 0$ .

Similarly, Equation (13) holds if and only if for any sequence ${P_{T}}_{T \geq 1}$ such that $P_{T} \in P$ for all T ≥ 1, ${sup}_{f \in B L_{1}} ∣ E_{P_{T}, η} [f (Z_{T} (P_{T}))] - E_{P_{T}, η} [f (Z (P_{T}))] ∣ \to 0$ .

B.2. Consistency

We prove the first part of Theorem 1, i.e., that ${\hat{θ}}_{T} \overset{P}{\to} θ^{*} (P)$ uniformly over $P \in P$ . We abbreviate m_θ(Y_t, X_t, A_t) with m_θ,t. By definition of ${\hat{θ}}_{T}$ ,

\sum_{t = 1}^{T} W_{t} m_{{\hat{θ}}_{T}, t} = sup_{θ \in Θ} \sum_{t = 1}^{T} W_{t} m_{θ, t} \geq \sum_{t = 1}^{T} W_{t} m_{θ^{*} (P), t} .

Note that $‖ {\hat{θ}}_{T} - θ^{*} (P) ‖ > ϵ > 0$ implies that

sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} \sum_{t = 1}^{T} W_{t} m_{θ, t} = sup_{θ \in Θ} \sum_{t = 1}^{T} W_{t} m_{θ, t} .

Thus, the above two results imply the following inequality:

sup_{P \in P} P_{P, π} (‖ {\hat{θ}}_{T} - θ^{*} (P) ‖ > ϵ) \leq sup_{P \in P} P_{P, π} (sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} \sum_{t = 1}^{T} W_{t} m_{θ, t} \geq \sum_{t = 1}^{T} W_{t} m_{θ^{*} (P), t}) = sup_{P \in P} P_{P, π} (sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ, t}} - \frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ^{*} (P), t} \geq 0) = sup_{P \in P} P_{P, π} (sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}] + E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}]} - \frac{1}{T} \sum_{t = 1}^{T} {W_{t} m_{θ^{*} (P), t} - E_{P, π} [W_{t} m_{θ^{*} (P), t} ∣ H_{t - 1}] + E_{P, π} [W_{t} m_{θ^{*} (P), t} ∣ H_{t - 1}]} \geq 0) .

By triangle inequality,

\leq sup_{P \in P} P_{P, π} \underset{(a)}{\underset{︸}{(sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} (W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}])}}} + \underset{(b)}{\underset{︸}{sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} (m_{θ, t} - m_{θ^{*} (P), t}) ∣ H_{t - 1}]}}} - \underset{(c)}{\underset{︸}{\frac{1}{T} \sum_{t = 1}^{T} {W_{t} m_{θ^{*} (P), t} - E_{P, π} [W_{t} m_{θ^{*} (P), t} ∣ H_{t - 1}]}}} \geq 0) \to 0 .

(14)

We now show that the limit in Equation (14) above holds.

Regarding term (c), by moment bounds of Condition 5 and Lemma 1, $\frac{1}{T} \sum_{t = 1}^{T} {W_{t} m_{θ^{*} (P), t} - E_{P, π} [W_{t} m_{θ^{*} (P), t} ∣ H_{t - 1}]} = o_{P \in P} (1)$ .
Regarding term (a), by Lemma 2, ${sup}_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} (W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}])} = o_{P \in P} (1)$ .

Thus it is sufficient to show that term (b) is such that for some δ′ > 0,

sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} (m_{θ, t} - m_{θ^{*} (P), t}) ∣ H_{t - 1}]} \leq - δ^{'} w.p. 1 .

(15)

By law of iterated expectations,

sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} (m_{θ, t} - m_{θ^{*} (P), t}) ∣ H_{t - 1}]} = sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t} (a, X_{t}, H_{t - 1}) E_{P} [W_{t} (m_{θ, t} - m_{θ^{*} (P), t}) ∣ H_{t - 1}, X_{t}, A_{t} = a] d a ∣ H_{t - 1}]} .

Since $W_{t} \in σ (H_{t - 1}, X_{t}, A_{t})$ , we have that $E_{P} [W_{t} (m_{θ, t} - m_{θ^{*} (P), t}) ∣ H_{t - 1}, X_{t}, A_{t} = a] = W_{t} E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ H_{t - 1}, X_{t}, A_{t} = a]$ . By Condition 1, we have that $W_{t} E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ H_{t - 1}, X_{t}, A_{t} = a] = W_{t} E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ X_{t}, A_{t} = a]$ . Thus we have,

= sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t} (a, X_{t}, H_{t - 1}) W_{t} E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ X_{t}, A_{t} = a] d a ∣ H_{t - 1}]} .

Since for all θ ∈ Θ, $E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ X_{t}, A_{t}] \leq 0$ with probability 1 by Condition 7 and since $0 < \frac{W_{t}}{\sqrt{ρ_{\max}}} \leq 1$ with probability 1 by Condition 9,

\leq sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T \sqrt{ρ_{\max}}} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t} (a, X_{t}, H_{t - 1}) W_{t}^{2} E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ X_{t}, A_{t} = a] d a ∣ H_{t - 1}]} .

Since $W_{t}^{2} = \frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}$ ,

= sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T \sqrt{ρ_{\max}}} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ X_{t}, A_{t} = a] d a ∣ H_{t - 1}]} .

By Condition 1 and since $π_{t}^{sta}$ is pre-specified, we can drop the conditioning on $H_{t - 1}$ , i.e.,

= sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T \sqrt{ρ_{\max}}} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [m_{θ, t} - m_{θ^{*} (P), t} ∣ X_{t}, A_{t} = a] d a]} .

By law of iterated expectations,

= sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ > ϵ} {\frac{1}{T \sqrt{ρ_{\max}}} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [m_{θ, t} - m_{θ^{*} (P), t}]} \leq - \frac{1}{\sqrt{ρ_{\max}}} δ .

The last inequality above holds for some δ > 0 for all sufficiently large T by Condition 8. Thus Equation (15) holds for $δ^{'} = \frac{1}{\sqrt{ρ_{\max}}} δ$ .

B.3. Asymptotic Normality

We prove the second part of Theorem 1, i.e., that

Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(16)

B.3.1. Main Argument

The three results we show to ensure Equation (16) holds are as follows:

Σ_{T} (P)^{- 1 ∕ 2} \sqrt{T} {\dot{M}}_{T} (θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(17)

For ${\overset{⃛}{ϵ}}_{\ddot{m}} > 0$ as defined in Condition 6,

sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} ‖ {\overset{⃛}{M}}_{T} (θ) ‖_{1} = O_{P \in P} (1) .

(18)

For matrix H positive definite,

- {\ddot{M}}_{T} (θ^{*} (P)) \underline{≻} H + o_{P \in P} (1) .

(19)

For a reminder on the notation of $o_{P \in P} (1)$ and $O_{P \in P} (1)$ see definitions 12 and 2. For now, we assume that Equations (17), (18), and (19) hold; we will show they hold in Sections B.3.2, B.3.3, and B.3.4 respectively. Our argument is based on Van der Vaart [2000, Theorem of 5.41].

By differentiability Condition 2, since ${\hat{θ}}_{T}$ is the maximizer of criterion M_T(θ),

0 = {\dot{M}}_{T} ({\hat{θ}}_{T}) .

By differentiability Condition 2 again and Taylor’s theorem we have that for some random ${\tilde{θ}}_{T}$ on the line segment between $θ^{*} (P)$ and ${\hat{θ}}_{T}$ ,

0 = {\dot{M}}_{T} ({\hat{θ}}_{T}) = {\dot{M}}_{T} (θ^{*} (P)) + {\ddot{M}}_{T} (θ^{*} (P)) ({\hat{θ}}_{T} - θ^{*} (P)) + \frac{1}{2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) ({\hat{θ}}_{T} - θ^{*} (P)) .

By rearranging terms and multiplying by $\sqrt{T}$ ,

- \sqrt{T} {\dot{M}}_{T} (θ^{*} (P)) = {\ddot{M}}_{T} (θ^{*} (P)) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) + \frac{1}{2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) = [{\ddot{M}}_{T} (θ^{*} (P)) + \frac{1}{2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T})] \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) .

Note that by the above equation and Equation (17), we have that

Σ_{T} (P)^{- 1 ∕ 2} [{\ddot{M}}_{T} (θ^{*} (P)) + \frac{1}{2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T})] \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(20)

By Equation (19), the probability that ${\ddot{M}}_{T} (θ^{*} (P))$ is invertible goes to 1 uniformly over $P \in P$ . Thus by Equation (20), we have that

Σ_{T} (P)^{- 1 ∕ 2} [I_{d} + \frac{1}{2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1}] {\ddot{M}}_{T} (θ^{*} (P)) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) = [I_{d} + \frac{1}{2} Σ_{T} (P)^{- 1 ∕ 2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} Σ_{T} (P)^{1 ∕ 2}] Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} (θ^{*} (P)) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(21)

We now show that $\frac{1}{2} Σ_{T} (P)^{- 1 ∕ 2} ({\hat{θ}}_{T} - θ^{*} (P))^{T} {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} Σ_{T} (P)^{1 ∕ 2} = o_{P \in P} (1)$ . It is sufficient to show that $‖ Σ_{T} (P)^{- 1 ∕ 2} ‖ ‖ {\hat{θ}}_{T} - θ^{*} (P) ‖ ‖ {\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) ‖_{1} ‖ {\ddot{M}}_{T} (θ^{*} (P))^{- 1} ‖ ‖ Σ_{T} (P)^{1 ∕ 2} ‖ = o_{P \in P} (1)$ .

By Condition 5, the minimum eigenvalue of $Σ_{T} (P)$ is bounded uniformly above some constant greater than zero, so ${sup}_{P \in P} ‖ Σ_{T} (P)^{- 1 ∕ 2} ‖ = O (1)$ .
By uniform consistency of ${\hat{θ}}_{T}, ‖ {\hat{θ}}_{T} - θ^{*} (P) ‖ = o_{P \in P} (1)$ .
By uniform consistency of ${\hat{θ}}_{T}, 1_{‖ {\tilde{θ}}_{T} - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} = o_{P \in P} (1)$ . Thus by Equation (18), ${\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) = O_{P \in P} (1)$ .
By Equation (19), the minimum eigenvalue of $- {\ddot{M}}_{T} (θ^{*} (P))^{- 1}$ is bounded above that of positive definite matrix H. Thus $‖ {\ddot{M}}_{T} (θ^{*} (P))^{- 1} ‖ = O_{P \in P} (1)$ .
By Condition 5, ${sup}_{P \in P} ‖ Σ_{T} (P)^{1 ∕ 2} ‖ = O (1)$ .

Thus, by Slutsky’s Theorem and Equation (21), we have that

Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} (θ^{*} (P)) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(22)

Lastly, to show our desired result, that $Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) \sqrt{T} ({\hat{θ}}_{T} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d})$ uniformly over $P \in P$ , by Equation (22) and Slutsky’s Theorem it is sufficient to show that $Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} Σ_{T} (P)^{1 ∕ 2} \overset{P}{\to} I_{d}$ uniformly over $P \in P$ . Note if we can show that ${\ddot{M}}_{T} ({\hat{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} \overset{P}{\to} I_{d}$ uniformly over $P \in P$ , then $Σ_{T} (P)^{- 1 ∕ 2} {\ddot{M}}_{T} ({\hat{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} Σ_{T} (P)^{1 ∕ 2} = Σ_{T} (P)^{- 1 ∕ 2} [I_{d} + o_{P \in P} (1)] Σ_{T} (P)^{1 ∕ 2} = I_{d} + Σ_{T} (P)^{- 1 ∕ 2} o_{P \in P} (1) Σ_{T} (P)^{1 ∕ 2} = I_{d} + o_{P \in P} (1)$ . The last limit holds since $‖ Σ_{T} (P)^{- 1 ∕ 2} ‖ = O_{P \in P} (1)$ and $‖ Σ_{T} (P)^{1 ∕ 2} ‖ = O_{P \in P} (1)$ by Condition 5 (use the same argument as that used in the bullet points below Equation (21)).

Thus it is sufficient to show that ${\ddot{M}}_{T} ({\hat{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} \overset{P}{\to} I_{d}$ uniformly over $P \in P$ . By Taylor’s Theorem, for some random ${\bar{θ}}_{T}$ on the line segment between ${\hat{θ}}_{T}$ and $θ^{*} (P)$ ,

{\ddot{M}}_{T} ({\hat{θ}}_{T}) = {\ddot{M}}_{T} (θ^{*} (P)) + {\overset{⃛}{M}}_{T} ({\bar{θ}}_{T}) ({\hat{θ}}_{T} - θ^{*} (P)) .

Recall that the probability the inverse of ${\ddot{M}}_{T} (θ^{*} (P))$ exists goes to 1 by Equation (19) (use the same argument as that used in the bullet points below Equation (21)). Thus we have that ${\ddot{M}}_{T} ({\hat{θ}}_{T}) {\ddot{M}}_{T} (θ^{*} (P))^{- 1}$ equals the following:

[{\ddot{M}}_{T} (θ^{*} (P)) + {\overset{⃛}{M}}_{T} ({\bar{θ}}_{T}) ({\hat{θ}}_{T} - θ^{*} (P))] {\ddot{M}}_{T} (θ^{*} (P))^{- 1} = I_{d} + {\overset{⃛}{M}}_{T} ({\bar{θ}}_{T}) ({\hat{θ}}_{T} - θ^{*} (P)) {\ddot{M}}_{T} (θ^{*} (P))^{- 1}

Note that ${\overset{⃛}{M}}_{T} ({\bar{θ}}_{T}) ({\hat{θ}}_{T} - θ^{*} (P)) {\ddot{M}}_{T} (θ^{*} (P))^{- 1} = o_{P \in P} (1)$ because

By uniform consistency of ${\hat{θ}}_{T}, 1_{‖ {\tilde{θ}}_{T} - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} = o_{P \in P} (1)$ . Thus by Equation (18), ${\overset{⃛}{M}}_{T} ({\tilde{θ}}_{T}) = O_{P \in P} (1)$ .
By uniform consistency of ${\hat{θ}}_{T}, ‖ {\hat{θ}}_{T} - θ^{*} (P) ‖ = o_{P \in P} (1)$ .
By Equation (19), $‖ {\ddot{M}}_{T} (θ^{*} (P))^{- 1} ‖ = O_{P \in P} (1)$ .

B.3.2. Asymptotic Normality of $Σ_{T} (P)^{- 1 ∕ 2} \sqrt{T} {\dot{M}}_{T} (θ^{*} (P))$

We will show that Equation (17) holds by applying a martingale central limit theorem. For notational convenience, we let ${\dot{m}}_{θ, t} ≔ {\dot{m}}_{θ} (Y_{t}, X_{t}, A_{t})$ . Note that by definition $Σ_{T} (P)^{- 1 ∕ 2} \sqrt{T} {\dot{M}}_{T} (θ^{*} (P)) = Σ_{T} (P)^{- 1 ∕ 2} \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} {\dot{m}}_{θ^{*} (P), t}$ . We first show that ${Σ_{T} (P)^{- 1 ∕ 2} \frac{1}{\sqrt{T}} W_{t} {\dot{m}}_{θ^{*} (P), t}}_{t = 1}^{T}$ is a martingale difference sequence with respect to ${H_{t}}_{t = 0}^{T}$ . For any t ∈ [1: T],

E_{P, π} [\frac{1}{\sqrt{T}} Σ_{T} (P)^{- 1 ∕ 2} W_{t} c^{T} {\dot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}] [\underset{(a)}{=} \frac{1}{\sqrt{T}} E_{P, π} [E_{P} [Σ_{T} (P)^{- 1 ∕ 2} W_{t} c^{T} {\dot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}, X_{t}, A_{t}] ∣ H_{t - 1}] [\underset{(b)}{=} \frac{1}{\sqrt{T}} Σ_{T} (P)^{- 1 ∕ 2} E_{P, π} [W_{t} c^{T} E_{P} [{\dot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}, X_{t}, A_{t}] ∣ H_{t - 1}] \underset{(c)}{=} 0

Above, (a) holds by law of iterated expectations.
(b) holds since $W_{t} \in σ (H_{t - 1}, X_{t}, A_{t})$ and since $Σ_{T} (P)$ are a function of stabilizing policies ${π_{t}^{sta}}_{t \geq 1}$ , which are pre-specified.
By Condition 1, $E_{P} [{\dot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}, X_{t}, A_{t}] = E_{P} [{\dot{m}}_{θ^{*} (P), t} ∣ X_{t}, A_{t}]$ . Equality (c) holds because $E_{P} [{\dot{m}}_{θ^{*} (P), t} ∣ X_{t}, A_{t}] = 0$ with probability 1 by Condition 7; note that $θ^{*} (P)$ is a critical point of $E_{P} [m_{θ, t} ∣ X_{t}, A_{t}]$ .

By Cramer-Wold device, to show that Equation (17) holds, it is sufficient to show that for any fixed $c \in R^{d}$ with ∥c∥₂ = 1, that $c^{T} Σ_{T} (P)^{- 1 ∕ 2} \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} {\dot{m}}_{θ^{*} (P), t} \overset{D}{\to} N (0, c^{T} I_{d} c)$ uniformly over $P \in P$ . We now apply Theorem 2, a uniform version of the martingale central limit theorem of Dvoretzky [1972]; while the original theorem holds for any fixed $P$ , we can show uniform convergence in distribution by ensuring that the conditions of the theorem hold uniformly over $P \in P$ (see Definition 3). By Theorem 2, it is sufficient to show that the following two conditions hold:

1. Conditional Variance: $\frac{1}{T} [\sum_{t = 1}^{T} E_{P, π} [{c^{T} Σ_{T} (P)^{- 1 ∕ 2} W_{t} {\dot{m}}_{θ^{*} (P), t}}^{2} ∣ H_{t - 1}] \overset{P}{\to} σ^{2}$ uniformly over $P \in P$ .

2. Conditional Lindeberg: For any δ > 0, $\frac{1}{T} [\sum_{t = 1}^{T} E_{P, π} [{c^{T} Σ_{T} (P)^{- 1 ∕ 2} W_{t} {\dot{m}}_{θ^{*} (P), t}}^{2} 1_{∣ c^{T} Σ_{T} (P)^{- 1 ∕ 2} W_{t} {\dot{m}}_{θ^{*} (P), t} ∣ > δ \sqrt{T}} ∣ H_{t - 1}] \overset{P}{\to} 0$ uniformly over $P \in P$ .

1. Conditional Variance

\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [{(c^{T} W_{t} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t})}^{2} ∣ H_{t - 1}] = \frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{2} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c ∣ H_{t - 1}] \underset{(a)}{=} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {[\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ H_{t - 1}]} Σ_{T} (P)^{- 1 ∕ 2} c \underset{(b)}{=} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {[\frac{1}{T} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t} (a, X_{t}, H_{t - 1}) E_{P} [W_{t}^{2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ H_{t - 1}, X_{t}, A_{t} = a] d a ∣ H_{t - 1}]} Σ_{T} (P)^{- 1 ∕ 2} c \underset{(c)}{=} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {[\frac{1}{T} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ H_{t - 1}, X_{t}, A_{t} = a] d a ∣ H_{t - 1}]} Σ_{T} (P)^{- 1 ∕ 2} c \underset{(d)}{=} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {[\frac{1}{T} \sum_{t = 1}^{T} E_{P} [E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ X_{t}] ∣ H_{t - 1}]} Σ_{T} (P)^{- 1 ∕ 2} c \underset{(e)}{=} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2}]} Σ_{T} (P)^{- 1 ∕ 2} c \underset{(f)}{=} c^{T} Σ_{T} (P)^{- 1 ∕ 2} Σ_{T} (P) Σ_{T} (P)^{- 1 ∕ 2} c = c^{T} I_{d} c

Above, (a) holds since $Σ_{T} (P)$ are a function of stabilizing policies ${π_{t}^{sta}}_{t \geq 1}$ , which are pre-specified.
Equality (b) holds by law of iterated expectations.
Equality (c) holds since $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}} \in σ (H_{t - 1}, X_{t}, A_{t})$ .
Equality (d) holds because by Condition 1, $E_{P} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ H_{t - 1}, X_{t}, A_{t} = a] = E_{P} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ X_{t}, A_{t} = a]$ and by law of iterated expectations.
Equality (e) holds because by Condition 1, the distribution of X_t does not depend on $H_{t - 1}$ , so $E_{P} [E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ X_{t}] ∣ H_{t - 1}] = E_{P} [E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2} ∣ X_{t}]] = E_{P, π_{t}^{sta}} [{\dot{m}}_{θ^{*} (P), t}^{\otimes 2}]$ ; the last equality holds by law of iterated expectations.
Equality (f) holds by definition.

2. Conditional Lindeberg

\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [{(c^{T} W_{t} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t})}^{2} 1_{∣ c^{T} W_{t} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t} ∣ > δ \sqrt{T}} ∣ H_{t - 1}] = \frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{2} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c 1_{∣ c^{T} W_{t} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t} ∣ > δ \sqrt{T}} ∣ H_{t - 1}] \underset{(a)}{\leq} \frac{1}{T^{2} δ^{2}} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{4} {(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2} ∣ H_{t - 1}] \underset{(b)}{\leq} \frac{ρ_{\max}}{T^{2} δ^{2}} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{4} {(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2} ∣ H_{t - 1}] \underset{(c)}{\leq} \frac{ρ_{\max}}{T^{2} δ^{2}} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{i} (a, X_{t}, H_{t - 1}) E_{P} [W_{t}^{2} {(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2} ∣ H_{t - 1}, X_{t}, A_{t} = a] d a ∣ H_{t - 1}] \underset{(d)}{\leq} \frac{ρ_{\max}}{T^{2} δ^{2}} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [{(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2} ∣ H_{t - 1}, X_{t}, A_{t} = a] d a ∣ H_{t - 1}] \underset{(e)}{\leq} \frac{ρ_{\max}}{T^{2} δ^{2}} \sum_{t = 1}^{T} E_{P} [E_{P} [{(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2} ∣ X_{t}] ∣ H_{t - 1}] \underset{(f)}{\leq} \frac{ρ_{\max}}{T^{2} δ^{2}} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2}] \underset{(g)}{\to} 0

Above, inequality (a) holds because $1_{∣ W_{t} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t} ∣ > \sqrt{T} δ} = 1$ if and only if $W_{t}^{2} \frac{1}{T δ^{2}} c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c > 1$ .
Inequality (b) holds because by Condition 9, $W_{t}^{2} \leq ρ_{\max}$ with probability 1.
Equality (c) holds by the law of iterated expectations.
Equality (d) holds since $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}} \in σ (H_{t - 1}, X_{t}, A_{t})$ .
Equality (e) holds because by Condition 1, $E_{P} [(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)^{2} ∣ H_{t - 1}, X_{t}, A_{t} = a] = E_{P} [(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)^{2} ∣ X_{t}]$ and by law of iterated expectations.
Equality (f) holds since the distribution of X_t does not depend on $H_{t - 1}$ by Condition 1 and by law of iterated expectations.
Regarding limit (g), it is sufficient to show that $\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2}]$ is uniformly bounded over $P \in P$ for all sufficiently large T. By Condition 5, the minimum eigenvalue of Σ_T(P) is bounded above zero uniformly over $P \in P$ for all sufficiently large T; this bounds the maximum eigenvalue of Σ_T(P)⁻¹. Also by Condition 5 the fourth moment of ${\dot{m}}_{θ^{*} (P), t}$ with respect to $P$ and policy $π_{t}^{sta}$ is uniformly bounded over $P \in P$ and t ≥ 1. With these two properties we have that $\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{(c^{T} Σ_{T} (P)^{- 1 ∕ 2} {\dot{m}}_{θ^{*} (P), t}^{\otimes 2} Σ_{T} (P)^{- 1 ∕ 2} c)}^{2}]$ is uniformly bounded over $P \in P$ for all sufficiently large T.

B.3.3. Showing that ${sup}_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} ‖ {\overset{⃛}{M}}_{T} (θ) ‖_{1}$ is bounded in probability

Recall that for any $B \in R^{d \times d \times d}$ , we denote $‖ B ‖_{1} = \sum_{i = 1}^{d} \sum_{j = 1}^{d} \sum_{k = 1}^{d} ∣ B_{i, j, k} ∣$ . We abbreviate ${\overset{⃛}{m}}_{θ} (Y_{t}, X_{t}, A_{t})$ with ${\overset{⃛}{m}}_{θ, t}$ .

By triangle inequality, $‖ {\overset{⃛}{M}}_{T} (θ) ‖_{1} = {‖ \frac{1}{T} \sum_{t = 1}^{T} W_{t} {\overset{⃛}{m}}_{θ, t} ‖}_{1} \leq \frac{1}{T} \sum_{t = 1}^{T} W_{t} ‖ {\overset{⃛}{m}}_{θ, t} ‖_{1}$ . Thus we have that

sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} ‖ {\overset{⃛}{M}}_{T} (θ) ‖_{1} \leq sup_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} \frac{1}{T} \sum_{t = 1}^{T} W_{t} ‖ {\overset{⃛}{m}}_{θ, t} ‖_{1} .

By Condition 6 (ii), there exists a function $\overset{⃛}{m}$ (note it is not indexed by θ) such that for all $P \in P$ , we have that ${sup}_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} ‖ {\overset{⃛}{m}}_{θ, t} ‖_{1} \leq ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}$ .

\leq \frac{1}{T} \sum_{t = 1}^{T} W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} .

Adding and subtracting $\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}]$ ,

\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}], = \frac{1}{T} \sum_{t = 1}^{T} W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} - E_{P, π} [W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}] + E_{P, π} [W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}] .

By second moment bounds on $‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}$ from Condition 6 (i), by Lemma 1, we have that $\frac{1}{T} \sum_{t = 1}^{T} W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} - E_{P, π} [W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}] = o_{P \in P} (1)$ .

= o_{P \in P} (1) + \frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}]

Since by Condition 9, $\frac{W_{t}}{\sqrt{ρ_{min}}} \geq 1$ with probability 1,

\leq o_{P \in P} (1) + \frac{1}{T \sqrt{ρ_{min}}} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{2} ‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1} ∣ H_{t - 1}]

Since $W_{t}^{2} = \frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}$ and by Condition 1,

= o_{P \in P} (1) + \frac{1}{T \sqrt{ρ_{min}}} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}] = O_{P \in P} (1) .

Note that by Jensen’s inequality, $E_{P, π_{t}^{sta}} [‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}] \leq \sqrt{E_{P, π_{t}^{sta}} [‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}^{2}]}$ . By Condition 6 (i), ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [‖ \overset{⃛}{m} (Y_{t}, X_{t}, A_{t}) ‖_{1}^{2}]$ is bounded, which implies the final limit above.

B.3.4. Lower bounding – $- {\ddot{M}}_{T} (θ^{*} (P))$

We now show that $- {\ddot{M}}_{T} (θ^{*} (P)) \underline{≻} H + o_{P \in P} (1)$ , for positive definite matrix H introduced in Condition 7 (ii).

By Condition 5 and Lemma 1, $\frac{1}{T} \sum_{t = 1}^{T} W_{t} {\ddot{m}}_{θ^{*} (P), t} - E_{P, π} [W_{t} {\ddot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}] = o_{P \in P} (1)$ , so

- {\ddot{M}}_{T} (θ^{*} (P)) = - \frac{1}{T} \sum_{t = 1}^{T} W_{t} {\ddot{m}}_{θ^{*} (P), t} = o_{P \in P} (1) - \frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} {\ddot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}]

By law of iterated expectations,

= o_{P \in P} (1) - \frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} E_{P} [{\ddot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}, X_{t}, A_{t}] ∣ H_{t - 1}]

By Condition 1,

= o_{P \in P} (1) - \frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} E_{P} [{\ddot{m}}_{θ^{*} (P), t} ∣ X_{t}, A_{t}] ∣ H_{t - 1}]

By Condition 7, we have that $E_{P} [{\ddot{m}}_{θ^{*} (P), t} ∣ X_{t}, A_{t}] \underline{≺} 0$ ; recall that $θ^{*} (P)$ is a maximizing value of $E_{P, π} [m_{θ, t} ∣ X_{t}, A_{t}]$ . Also since $\frac{W_{t}}{\sqrt{ρ_{\max}}} \leq 1$ with probability 1 by Condition 9,

\underline{≻} o_{P \in P} (1) - \frac{1}{T \sqrt{ρ_{\max}}} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{2} E_{P, π} [{\ddot{m}}_{θ^{*} (P), t} ∣ X_{t}, A_{t}] ∣ H_{t - 1}]

Since $W_{t}^{2} = \frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}$ ,

= o_{P \in P} (1) - \frac{1}{T \sqrt{ρ_{\max}}} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{\ddot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}]

Note that for any t ≥ 1, $E_{P, π_{t}^{sta}} [{\ddot{m}}_{θ^{*} (P), t} ∣ H_{t - 1}] = E_{P, π_{t}^{sta}} [{\ddot{m}}_{θ^{*} (P), t}]$ because ${π_{t}^{sta}}_{t \geq 1}$ are pre-specified. Recall that by Condition 7 for all sufficiently large T, $- \frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [{\ddot{m}}_{θ^{*} (P), t}] \underline{≻} H$ for all $P \in P$ . Thus our final result is that

- {\ddot{M}}_{T} (θ^{*} (P)) \underline{≻} H + o_{P \in P} (1) .

(23)

B.4. Lemmas and Other Helpful Results

Theorem 2 (Uniform Martingale Central Limit Theorem). Let ${Z_{T} (P)}_{T \geq 1}$ be a sequence of random variables whose distributions are defined by some $P \in P$ and some nuisance component η. Moreover, let ${Z_{T} (P)}_{T \geq 1}$ be a martingale difference sequence with respect to $F_{t}$ , meaning $E_{P, η} [Z_{t} (P) ∣ F_{t - 1}] = 0$ for all t ≥ 1 and $P \in P$ .

$\frac{1}{T} \sum_{t = 1}^{T} E_{P, η} [Z_{t} (P)^{2} ∣ F_{t - 1}] \overset{P}{\to} σ^{2}$ uniformly over $P \in P$ , where σ² is a constant 0 < σ² < ∞.
For any ϵ > 0, $\frac{1}{T} \sum_{t = 1}^{T} E_{P, η} [Z_{t} (P)^{2} 1_{∣ Z_{t} (P) ∣ > ϵ} ∣ F_{t - 1}] \overset{P}{\to} 0$ uniformly over $P \in P$ .

Under the above conditions,

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} Z_{t} (P) \overset{D}{\to} N (0, σ^{2}) uniformly over P \in P .

Proof: By by Kasy [2019, Lemma 1], it is sufficient to show that for any sequence ${P_{T}}_{T = 1}^{\infty}$ with $P_{T} \in P$ for all T ≥ 1, $\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} Z_{t} (P_{T}) \overset{D}{\to} N (0, σ^{2})$ . In this setting, since $P_{T}$ depends on T, we consider triangular array asymptotics and additionally index by T, e.g., $F_{T, t}$ .

Note that $\frac{1}{T} \sum_{t = 1}^{T} E_{P_{T}, η} [Z_{t} (P_{T})^{2} ∣ F_{T, t - 1}] \overset{P}{\to} σ^{2}$ , by Kasy [2019, Lemma 1] and condition (a) above.

Also, for any ϵ > 0, $\frac{1}{T} \sum_{t = 1}^{T} E_{P_{T}, η} [Z_{t} (P_{T})^{2} 1_{∣ Z_{t} (P_{T}) ∣ > ϵ} ∣ F_{T, t - 1}] \overset{P}{\to} 0$ , by Kasy [2019, Lemma 1] and condition (b) above.

Thus by the martingale central limit theorem of Dvoretzky [1972], we have that for the sequence ${P_{T}}_{T = 1}^{\infty}$ ,

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} Z_{t} (P_{T}) \overset{D}{\to} N (0, 1) .

Since the sequence ${P_{T}}_{T = 1}^{\infty}$ were chosen arbitrarily from P, the desired result is implied again by Kasy [2019, Lemma 1].

Lemma 1. Let $f (Y_{t}, X_{t}, A_{t}) \in R^{d_{f}}$ be a function such that ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [‖ f (Y_{t}, X_{t}, A_{t}) ‖^{2}] < m$ for some m < ∞. Under Conditions 1 and 9,

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} {W_{t} f (Y_{t}, X_{t}, A_{t}) - E_{P, π} [W_{t} f (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}]} = O_{P \in P} (1) .

(24)

Note that the above equation implies that

\frac{1}{T} \sum_{t = 1}^{T} {W_{t} f (Y_{t}, X_{t}, A_{t}) - E_{P, π} [W_{t} f (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}]} = o_{P \in P} (1) .

Lemma 1 is a type of martingale weak law of large number result and the proof is similar to the weak law of large numbers proofs for i.i.d. random variables.

Proof: We denote the k^th ∈ [1: d_f] dimension of vector f(Y_t, X_t, A_t) as f^k(Y_t, X_t, A_t). It is sufficient to show the result for any dimension of vector f(Y_t, X_t, A_t). For notational convenience, let f_t ≔ f^k(Y_t, X_t, A_t). Let ϵ > 0.

sup_{P \in P} P_{P, π} (∣ \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} {W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]} ∣ > ϵ) \underset{(a)}{\leq} \frac{1}{T ϵ^{2}} sup_{P \in P} E_{P, π} [{(\sum_{t = 1}^{T} {W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]})}^{2}] \underset{(b)}{=} \frac{1}{T ϵ^{2}} sup_{P \in P} \sum_{t = 1}^{T} E_{P, π} [{W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]}^{2}] \underset{(c)}{\leq} \frac{1}{T ϵ^{2}} sup_{P \in P} \sum_{t = 1}^{T} E_{P, π} [W_{t}^{2} f_{t}^{2}] \underset{(d)}{=} \frac{1}{T ϵ^{2}} sup_{P \in P} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} W_{t}^{2} π_{t} (a, X_{t}, H_{t - 1}) E_{P} [f_{t}^{2} ∣ H_{t - 1}, X_{t}, A_{t} = a] d a] \underset{(e)}{=} \frac{1}{T ϵ^{2}} sup_{P \in P} \sum_{t = 1}^{T} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [f_{t}^{2} ∣ H_{t - 1}, X_{t}, A_{t} = a] d a] \underset{(f)}{=} \frac{1}{T ϵ^{2}} sup_{P \in P} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [f_{t}^{2}] \underset{(g)}{\leq} \frac{4 m}{ϵ^{2}}

Above (a) holds by Chebyshev’s inequality.
(b) holds because the above terms form a martingale difference sequence with respect to $H_{t - 1}$ , i.e., $E_{P, π} [W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}] ∣ H_{t - 1}] = 0$ ; this implies that cross terms disappear, i.e., for t > s,
$E_{P, π} [(W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]) (W_{s} f_{s} - E_{P, π} [W_{s} f_{s} ∣ H_{s - 1}])] = [E_{P, π} [E_{P, π} [(W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]) (W_{s} f_{s} - E_{P, π} [W_{s} f_{s} ∣ H_{s - 1}]) ∣ H_{t - 1}]]$

Since s > t,
$= [E_{P, π} [(W_{s} f_{s} - E_{P, π} [W_{s} f_{s} ∣ H_{s - 1}]) E_{P, π} [W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}] ∣ H_{t - 1}]] = 0 .$
(c) holds because $E_{P, π} [{W_{t} f_{t} - E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]}^{2}] = E_{P, π} [W_{t}^{2} f_{t}^{2}] - E_{P, π} [E_{P, π} [W_{t} f_{t} ∣ H_{t - 1}]^{2}] \leq E_{P, π} [W_{t}^{2} f_{t}^{2}]$ .
(d) holds by law of iterated expectations.
(e) holds because $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ .
(f) holds since by Condition 1, $E_{P} [f_{t}^{2} ∣ H_{t - 1}, X_{t}, A_{t}] = E_{P} [f_{t}^{2} ∣ X_{t}, A_{t}]$ and by law of iterated expectations $E_{P, π_{t}^{sta}} [f_{t}^{2}] = E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [f_{t}^{2} ∣ X_{t}, A_{t} = a] d a]$ .
(g) holds since ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [f_{t}^{2}] < m < \infty$ .

Lemma 2. Let m_θ,t ≔ m_θ(Y_t, X_t, A_t). Under Conditions 1, 3, 4, 5, 7, and 9,

sup_{θ \in Θ} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}]} = O_{P \in P} (1) .

(25)

Lemma 1 is a type of martingale functionally uniform law of large number result and the proof is similar to the functionally uniform law of large numbers proofs for i.i.d. random variables Van Der Vaart and Wellner [1996, Theorem 2.4.1].

Proof:

Finite Bracketing Number: Let δ > 0. We construct a set B_δ which is made up of pairs of functions (l, u). We show that we can find B_δ that satisfies the following:

For any θ ∈ Θ, we can find (l, u) ∈ B_δ such that
1. l(y, x, a) ≤ m_θ(y, x, a) ≤ u(y, x, a) for all (x, y) in the joint support of ${P \in P}$ and all $a \in A$ .
2. ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [∣ u (Y_{t}, X_{t}, A_{t}) - l (Y_{t}, X_{t}, A_{t}) ∣] \leq δ$ .
The number of pairs in this set is finite, i.e., ∣B_δ∣ < ∞.
For any (l, u) ∈ B_δ, for some m < ∞ which does no depend on δ, ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [u (Y_{t}, X_{t}, A_{t})^{2}] \leq m$ and ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [l (Y_{t}, X_{t}, A_{t})^{2}] \leq m$ .

Showing that we can find B_δ that satisfy (a), means that ∣B_δ∣ is an upper bound on the bracketing number of {m_θ : θ ∈ Θ}. For more information on bracketing functions, see Van Der Vaart and Wellner [1996] and Van der Vaart [2000].

To construct B_δ, we follow a similar argument to Example 19.7 of Van der Vaart [2000] (page 271). Make a grid over Θ with meshwidth λ/2 > 0 and let the points in this grid be the set G_λ/2 ⊆ Θ; we will specify λ later. Note that by construction, for any θ ∈ Θ we can find a θ ∈ G_λ/2 such that ∥θ′ − θ∥ ≤ λ.

By our Lipschitz Condition 4, we have that for any θ, θ′ ∈ Θ, ∣m_θ(Y_t, X_t, A_t) − m_θ′(Y_t, X_t, A_t)∣ ≤ g(Y_t, X_t, A_t)∥θ − θ′∥ for function g such that for some m_g < ∞,

sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}] \leq m_{g} .

(26)

We now show that we can choose B_δ = {(m_θ − g(Y_t, X_t, A_t), m_θ + g(Y_t, X_t, A_t)) : θ ∈ G_λ/2}. Note that by compactness of Θ, Condition 3, the number of points in G_λ/2 is finite, so (b) above holds.

To show that (a) holds for our choice of B_δ, recall that for any θ ∈ Θ we can find a θ′ ∈ G_λ/2 such that ∥θ′ − θ∥ ≤ λ. Also, by the Lipschitz Condition 4, ∣m_θ(Y_t, X_t, A_t) − m_θ′(Y_t, X_t, A_t)∣ ≤ g(Y_t, X_t, A_t)∥θ − θ′∥ ≤ g(Y_t, X_t, A_t)λ. Thus we have that

m_{θ^{'}} (Y_{t}, X_{t}, A_{t}) - g (Y_{t}, X_{t}, A_{t}) λ \leq m_{θ} (Y_{t}, X_{t}, A_{t}) \leq m_{θ^{'}} (Y_{t}, X_{t}, A_{t}) + g (Y_{t}, X_{t}, A_{t}) λ .

Note that

sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [m_{θ^{'}} (Y_{t}, X_{t}, A_{t}) + g (Y_{t}, X_{t}, A_{t}) λ - {m_{θ^{'}} (Y_{t}, X_{t}, A_{t}) - g (Y_{t}, X_{t}, A_{t}) λ}] = 2 λ sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})] \leq 2 λ \sqrt{m_{g}} < \infty .

The inequalities above hold by Equation (26) and since $E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})] \leq \sqrt{E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}]}$ by Jensen’s inequality. (a) above holds for our choice of B_δ by letting meshwidth $λ = δ ∕ (2 \sqrt{m_{g}})$ .

We now show that (c) above holds. Note that

sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [{m_{θ} (Y_{t}, X_{t}, A_{t}) + g (Y_{t}, X_{t}, A_{t})}^{2}] \leq 3 sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [m_{θ} (Y_{t}, X_{t}, A_{t})^{2}] + 3 sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}] .

(27)

Note that the above upper bound, Equation (27), also holds for ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [{m_{θ} (Y_{t}, X_{t}, A_{t}) - g (Y_{t}, X_{t}, A_{t})}^{2}]$ .

Since, $m_{θ} (Y_{t}, X_{t}, A_{t}) = m_{θ} (Y_{t}, X_{t}, A_{t}) - m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t}) + m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})$ ,

\leq 9 sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [{m_{θ} (Y_{t}, X_{t}, A_{t}) - m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})}^{2}] + 9 sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})^{2}] + 3 sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}] .

Note that ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})^{2}]$ is bounded by our moment Condition 5 and that ${sup}_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}]$ is bounded by Equation (26).

By our Lipschitz Condition 4, for any θ ∈ Θ, $∣ m_{θ} (Y_{t}, X_{t}, A_{t}) - m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t}) ∣ \leq g (Y_{t}, X_{t}, A_{t}) ‖ θ - θ^{*} (P) ‖$ . Thus,

sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [{m_{θ} (Y_{t}, X_{t}, A_{t}) - m_{θ^{*} (P)} (Y_{t}, X_{t}, A_{t})}^{2}] \leq sup_{P \in P, t \geq 1} E_{P, π_{t}^{sta}} [g (Y_{t}, X_{t}, A_{t})^{2}] ‖ θ - θ^{*} (P) ‖^{2} .

The above is bounded by Equation (26) and by compactness of Θ, Condition 3. Thus (c) above holds for our choice of B_δ.

Main Argument: We now show that for any ϵ > 0,

sup_{P \in P} P_{P, π} (sup_{θ \in Θ} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}]} > ϵ) \to 0 .

(28)

An analogous argument can be made to show that ${sup}_{P \in P} P_{P, π} ({sup}_{θ \in Θ} {- \frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}]} > ϵ) \to 0$ .

Let δ > 0; we will choose δ later. Let B_δ be the set of pairs of functions as constructed earlier.

sup_{θ \in Θ} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} m_{θ, t} - E_{P, π} [W_{t} m_{θ, t} ∣ H_{t - 1}]}

Note that by (a), we get the following upper bound:

\leq \max_{(l, u) \in B_{δ}} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} u (Y_{t}, X_{t}, A_{t}) - E_{P, π} [W_{t} l (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}]} .

By adding and subtracting $E_{P, π} [W_{t} u (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}]$ and triangle inequality,

\leq \max_{(l, u) \in B_{δ}} {\frac{1}{T} \sum_{t = 1}^{T} E_{P, π} [W_{t} {u (Y_{t}, X_{t}, A_{t}) - l (Y_{t}, X_{t}, A_{t})} ∣ H_{t - 1}]} + \max_{(l, u) \in B_{δ}} {\frac{1}{T} \sum_{t = 1}^{T} W_{t} u (Y_{t}, X_{t}, A_{t}) - E_{P, π} [W_{t} u (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}]} .

Note that by Condition 9, $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}} \leq \sqrt{ρ_{\max}}$ with probability 1, so $E_{P, π} [W_{t} {u (Y_{t}, X_{t}, A_{t}) - l (Y_{t}, X_{t}, A_{t})} ∣ H_{t - 1}] \leq \frac{1}{\sqrt{ρ_{\max}}} E_{P, π} [W_{t}^{2} {u (Y_{t}, X_{t}, A_{t}) - l (Y_{t}, X_{t}, A_{t})} ∣ H_{t - 1}] = \frac{1}{\sqrt{ρ_{\max}}} E_{P, π_{t}^{sta}} [u (Y_{t}, X_{t}, A_{t}) - l (Y_{t}, X_{t}, A_{t})] \leq \frac{1}{\sqrt{ρ_{\max}}} δ$ ; the last equality holds by Condition 1 and the last inequality holds by (a). And since $\max_{i \in [1 : n]} {a_{i}} \leq \sum_{i = 1}^{n} ∣ a_{i} ∣$ ,

\leq \frac{1}{\sqrt{ρ_{\max}}} δ + \sum_{(l, u) \in B_{δ}} ∣ \frac{1}{T} \sum_{t = 1}^{T} W_{t} u (Y_{t}, X_{t}, A_{t}) - E_{P, π} [W_{t} u (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}] ∣

By Lemma 1 and (c), for any (l, u) ∈ B_δ, $\frac{1}{T} \sum_{t = 1}^{T} W_{t} u (Y_{t}, X_{t}, A_{t}) - E_{P, π} [W_{t} u (Y_{t}, X_{t}, A_{t}) ∣ H_{t - 1}] = o_{P \in P} (1)$ . Since ∣B_δ∣ < ∞ by (b), the convergence holds for all (l, u) ∈ B_δ simultaneously, so

= \frac{1}{\sqrt{ρ_{\max}}} δ + o_{P \in P} (1) .

Equation (28) holds by choosing $δ = \sqrt{ρ_{\max}} ϵ ∕ 2$ .

B.5. Least-Squares Estimator

We use ϕ(X_t, A_t) to denote a feature vector that constructed using context X_t and action A_t.

Condition 10 (Linear Expected Outcome). For all $P \in P$ , the following holds w.p. 1,

E_{P} [Y_{t} ∣ X_{t}, A_{t}] = ϕ (X_{t}, A_{t})^{T} θ^{*} (P) .

Condition 11 (Moment Conditions for Least Squares). The fourth moments of $ϕ (X_{t}, A_{t}) (Y_{t} - ϕ (X_{t}, A_{t})^{T} θ^{*} (P))$ and ϕ(X_t, A_t) with respect to $P$ and policy $π_{t}^{sta}$ are respectively bounded uniformly over $P \in P$ and t ≥ 1.

Also the minimum eigenvalue of $Σ_{T} (P) = \frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [ϕ (Y_{t}, X_{t}, A_{t})^{\otimes 2} {(Y_{t} - ϕ (Y_{t}, X_{t}, A_{t})^{T} θ^{*} (P))}^{2}]$ and $\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [ϕ (X_{t}, A_{t})^{\otimes 2}]$ respectively are both bounded above constant some constant greater than zero for all $P \in P$ .

Condition 12 (Importance Ratios for Least Squares). Let ρ_min > 0 and ρ_max,T > 0 be a nonrandom sequence such that $\frac{ρ_{\max, T}}{T} \to 0$ . ${π_{t}^{sta}}_{t = 1}^{T}$ are pre-specified and do not depend on data ${Y_{t}, X_{t}, A_{t}}_{t = 1}^{T}$ . For all $P \in P$ , the following holds w.p. 1,

ρ_{min} \leq \frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})} \leq ρ_{\max, T} .

Note that Condition 12 allows $π_{t} (A_{t}, X_{t}, H_{t - 1})$ to go to zero at some rate for stabilizing policies ${π_{t}^{sta}}_{t \geq 1}$ that are strictly bounded away from 0 and 1.

We now define the AW-LS estimator for $θ^{*} (P) \in R^{d}$ :

{\hat{θ}}_{T}^{AW-LS} ≔ \underset{θ \in R^{d}}{argmax} {- \sum_{t = 1}^{T} W_{t} (Y_{t} - ϕ (X_{t}, A_{t})^{T} θ)^{2}} .

(29)

Theorem 3 (Consistency and Asymptotic Normality of Adaptively-Weighted Least Squares Estimator). Under Conditions 1, 10, 11, and 12,

Σ_{T} (P)^{- 1 ∕ 2} (\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2}) ({\hat{θ}}_{T}^{AW-LS} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P,

where $Σ_{T} (P) ≔ \frac{1}{T} \sum_{t = 1}^{T} ϕ (X_{t}, A_{t})^{\otimes 2} {(Y_{t} - ϕ (X_{t}, A_{t})^{T} θ^{*} (P))}^{2}$ .

Proof: By taking the derivative of Equation (29) with respect to the parameters, we have that

0 = \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t}) (Y_{t} - ϕ (X_{t}, A_{t})^{T} {\hat{θ}}_{T}^{AW-LS}) .

By rearranging terms, we have that

- \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t}) (Y_{t} - ϕ (X_{t}, A_{t})^{T} θ^{*} (P)) = \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} ({\hat{θ}}_{T}^{AW-LS} - θ^{*} (P)) .

(30)

We first show that the following holds:

Σ_{T} (P)^{- 1 ∕ 2} \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t}) (Y_{t} - ϕ (X_{t}, A_{t})^{T} θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(31)

Equation (31) holds by a similar argument as that used in Section B.3.2, for ${\dot{m}}_{θ} (Y_{t}, X_{t}, A_{t}) = ϕ (X_{t}, A_{t}) (Y_{t} - ϕ (X_{t}, A_{t})^{T} θ^{*} (P))$ by showing that the conditions of Theorem 2 hold. It can be checked that all the arguments hold even when we allow ρ_max,T to grow at a rate such that $\frac{ρ_{\max, T}}{T} \to 0$ .

By Equations (30) and (31),

Σ_{T} (P)^{- 1 ∕ 2} \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} ({\hat{θ}}_{T}^{AW-LS} - θ^{*} (P)) \overset{D}{\to} N (0, I_{d}) uniformly over P \in P .

(32)

By Equation (32), to ensure that ${\hat{θ}}_{T}^{AW-LS} \overset{P}{\to} θ^{*} (P)$ uniformly over $P \in P$ , it is sufficient to show that the minimum eigenvalue of $Σ_{T} (P)^{- 1 ∕ 2} \frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2}$ goes to infinity uniformly over $P \in P$ as T → ∞.

By Condition 11, the maximum eigenvalue of $Σ_{T} (P)$ is bounded uniformly over $P \in P$ , so the minimum eigenvalue of $Σ_{T} (P)^{- 1 ∕ 2}$ is bounded uniformly above 0. Thus it is sufficient to show that the minimum eigenvalue of $\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2}$ goes to infinity uniformly over $P \in P$ as T → ∞.

Note that by Lemma 1 and Condition 11,

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} - E_{P, π} [W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} ∣ H_{t - 1}] = O_{P \in P} (1) .

(33)

Note that by law of iterated expectations,

E_{P, π} [W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} ∣ H_{t - 1}] = [E_{P} [\int_{a \in A} π_{t} (a, X_{t}, H_{t - 1}) E_{P} [W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} ∣ H_{t, 1}, X_{t}, a] d a ∣ H_{t - 1}] .

By Condition 1 and since $W_{t} = \sqrt{\frac{π_{t}^{sta} (A_{t}, X_{t})}{π_{t} (A_{t}, X_{t}, H_{t - 1})}}$ ,

= E_{P} [\int_{a \in A} \sqrt{\frac{π_{t} (a, X_{t}, H_{t - 1})}{π_{t}^{sta} (a, X_{t})}} π_{t}^{sta} (a, X_{t}) E_{P} [ϕ (X_{t}, A_{t})^{\otimes 2} ∣ X_{t}, a] d a ∣ H_{t - 1}]

Since by Condition 12, $\frac{π_{t} (a, X_{t}, H_{t - 1})}{π_{t}^{sta} (a, X_{t})} \geq \frac{1}{\sqrt{ρ_{\max, T}}} and ϕ (X_{t}, A_{t})^{\otimes 2} ⪰ 0$ ,

\underline{≻} \frac{1}{\sqrt{ρ_{\max, T}}} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [ϕ (X_{t}, A_{t})^{\otimes 2} ∣ X_{t}, a] d a ∣ H_{t - 1}] .

Since $π_{t}^{sta}$ are pre-specified and since by our i.i.d. potential outcomes assumption (Condition 1) X_t do not depend on $H_{t - 1}$ ,

= \frac{1}{\sqrt{ρ_{\max, T}}} E_{P} [\int_{a \in A} π_{t}^{sta} (a, X_{t}) E_{P} [ϕ (X_{t}, A_{t})^{\otimes 2} ∣ X_{t}, a] d a] .

By law of iterated expectations,

= \frac{1}{\sqrt{ρ_{\max, T}}} E_{P, π_{t}^{sta}} [ϕ (X_{t}, A_{t})^{\otimes 2}] .

The above result and Equation (33) implies that

\frac{1}{\sqrt{T}} \sum_{t = 1}^{T} W_{t} ϕ (X_{t}, A_{t})^{\otimes 2} \underline{≻} O_{P \in P} (1) + \sqrt{\frac{T}{ρ_{\max, T}}} \frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [ϕ (X_{t}, A_{t})^{\otimes 2}] .

(34)

By Condition 11, the minimum eigenvalue of $\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [ϕ (X_{t}, A_{t})^{\otimes 2}]$ is bounded above some constant greater than zero for all $P \in P$ . By Condition 12, $\sqrt{\frac{T}{ρ_{\max, T}}} \to \infty$ . Thus by Equation (32) and Equation (34), we have that ${\hat{θ}}_{T}^{AW-LS} \overset{P}{\to} θ^{*} (P)$ uniformly over $P \in P$ .

C. Choice of Stabilizing Policy

C.1. Optimal Stabilizing Policy in Multi-Arm Bandit Setting

Here we consider the multi-armed bandit setting where $E_{P} [Y_{t} (a)] = θ_{a}^{*} (P)$ and ${Var}_{P} (Y_{t} (a)) = σ^{2}$ . We consider the adaptively-weighted least-squares estimator where $m_{θ} (Y_{t}, A_{t}) = - 1_{A_{t} = a} (Y_{t} - θ_{a}^{*} (P))^{2}$ . By Theorem 1, we have that

{(\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [1_{A_{t} = a} (Y_{t} - θ_{a}^{*} (P))^{2}])}^{- 1 ∕ 2} (\frac{1}{T} \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a}) \sqrt{T} ({\hat{θ}}_{T, a}^{AW-LS} - θ_{a}^{*} (P)) \overset{D}{\to} N (0, 1) .

While the asymptotic variance of $\sqrt{T} ({\hat{θ}}_{T, a}^{AW-LS} - θ_{a}^{*} (P))$ does not necessarily concentrate we can examine the following:

{(\frac{1}{T} \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a})}^{- 1} (\frac{1}{T} \sum_{t = 1}^{T} E_{P, π_{t}^{sta}} [1_{A_{t} = a} (Y_{t} - θ_{a}^{*} (P))^{2}]) {(\frac{1}{T} \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a})}^{- 1}

By Lemma 1, we have that $\frac{1}{T} \sum_{t = 1}^{T} W_{t} 1_{A_{t} = a} - \sqrt{π_{t}^{sta} (a) π_{t} (A_{t}, H_{t - 1})} \overset{P}{\to} 0$ . Thus we have

= (\frac{1}{T} \sum_{t = 1}^{T} π_{t}^{sta} (a) σ^{2}) {(o_{p} (1) + \frac{1}{T} \sum_{t = 1}^{T} \sqrt{π_{t}^{sta} (a) π_{t} (A_{t}, H_{t - 1})})}^{2} .

As long as $π_{t}^{sta} (a)$ , $π_{t} (A_{t}, H_{t - 1})$ are bounded away from zero w.p. 1, the o_p(1) term is asymptotically negligible and we can just consider $(\frac{1}{T} \sum_{t = 1}^{T} π_{t}^{sta} (a) σ^{2}) {(\frac{1}{T} \sum_{t = 1}^{T} \sqrt{π_{t}^{sta} (a) π_{t} (A_{t}, H_{t - 1})})}^{- 2}$ .

By Cauchy-Schwartz inequality,

{(\frac{1}{T} \sum_{t = 1}^{T} \sqrt{π_{t}^{sta} (a) π_{t} (a, H_{t - 1})})}^{2} \leq (\frac{1}{T} \sum_{t = 1}^{T} π_{t}^{sta} (a)) (\frac{1}{T} \sum_{t = 1}^{T} π_{t} (a, H_{t - 1})) .

Thus, $\frac{1}{\frac{1}{T} \sum_{t = 1}^{T} π_{t} (a, H_{t - 1})} \leq \frac{\frac{1}{T} \sum_{t = 1}^{T} π_{t}^{sta} (a)}{{(\frac{1}{T} \sum_{t = 1}^{T} \sqrt{π_{t}^{sta} (a) π_{t} (a, H_{t - 1})})}^{2}}$ , so

\frac{\frac{1}{T} \sum_{t = 1}^{T} π_{t}^{sta} (a)}{{(\frac{1}{T} \sum_{t = 1}^{T} \sqrt{π_{t} (a, H_{t - 1}) π_{t}^{sta} (a)})}^{2}} \geq \frac{1}{\frac{1}{T} \sum_{t = 1}^{T} π_{t} (a, H_{t - 1})} .

Note that this lower bound is achieved when $π_{t}^{sta} (a) = π_{t} (a)$ . However, since π_t is a function of $H_{t - 1}$ and stabilizing policies ${π_{t}^{sta}}_{t = 1}^{T}$ are pre-specified, setting $π_{t}^{sta} (A_{t}) = π_{t, a}$ is generally an unfeasible choice. Thus we want to choose $π_{t}^{sta}$ to be as close to π_t as possible, subject to the constraint that the stabilizing policies are pre-specified, i.e., not a function of the data {Y_t, X_t, A_t}_t≥1.

C.2. Approximating the Optimal Stabilizing Policy

One way to approximately choose the optimal evaluation policy is to select $π_{t}^{sta} (a, x) = E_{P, π} [π_{t} (a, x, H_{t - 1})]$ . Note that $E_{P, π} [π_{t} (a, x, H_{t - 1})]$ depends on the $P$ , which is unknown. Thus it is natural to choose $π_{t}^{sta} (a, x)$ to be $E_{P, π} [π_{t} (a, x, H_{t - 1})]$ weighted by a prior on $P$ . Note that as long as the evaluation policy ensures that weights W_t are bounded, the choice of evaluation policy does not affect the asymptotic validity of the estimator.

In Figure 6, we display the difference in mean squared error for the AW-LS estimator in a two-armed bandit setting for two different choices of evaluation policy: (1) the uniform evaluation policy which selects actions uniformly from $A$ and (2) the expected $π_{t} (a, H_{t - 1})$ evaluation policy for which $π_{t}^{sta} (a) = E_{P, π} [π_{t} (a, H_{t - 1})]$ . We can see in this setting that by setting $π_{t}^{sta} (a) = E_{P, π} [π_{t} (a, H_{t - 1})]$ we are able to decrease the mean squared error of the AW-LS estimator compared AW-LS with the uniform evaluation policy. Note though that in some cases setting $π_{t}^{sta} (a) = E_{P, π} [π_{t} (a, H_{t - 1})]$ is equivalent to choosing the uniform evaluation policy. For example, a two-armed bandit with identical arms so under common bandit algorithms $E_{P, π} [π_{t} (a, H_{t - 1})] = 0.5$ for all t ∈ [1: T], which will make the evaluation policy $π_{t}^{sta} (a) = E_{P, π} [π_{t} (a, H_{t - 1})]$ equivalent to the uniform policy.

D. Need for Uniformly Valid Inference on Data Collected with Bandit Algorithms

Here we consider the two-armed bandit setting where $E_{P} [R_{t} (a)] = θ_{0, a} (P)$ , ${Var}_{P} (R_{t} (a)) = σ^{2}$ , and $E_{P} [R_{t} (a)^{4}] < c < \infty$ for a ∈ {0, 1}. The unweighted least squares estimator is asymptotically normal on adaptively collected data under the following condition of Lai and Wei [1982], there exists a non-random sequence {b_t}_t≥1 such that

b_{T} \cdot \sum_{t = 1}^{T} A_{t} \overset{P}{\to} 1 .

(35)

Specifically, by Theorem 3 of Lai and Wei [1982], under (35),

\sqrt{\sum_{t = 1}^{T} A_{t}} ({\hat{θ}}_{T, 1}^{OLS} - θ_{1}^{*} (P)) = \frac{\sum_{t = 1}^{T} A_{t} (R_{t} - θ_{1}^{*} (P))}{\sqrt{\sum_{t = 1}^{T} A_{t}}} \overset{D}{\to} N (0, σ^{2}) .

However, as discussed in Deshpande et al. [2018] and Zhang et al. [2020], (35) can fail to to hold for common bandit algorithms when there is no unique optimal policy, i.e., when $θ_{0}^{*} (P) - θ_{1}^{*} (P) = 0$ . For example, in Figure 7 we plot $\frac{1}{T} \sum_{t = 1}^{T} A_{t}$ for Thompson Sampling and ϵ-greedy for a bandit with two identical arms.

Figure 7: — Above we plot empirical allocations, $\frac{1}{T} \sum_{t = 1}^{T} A_{t}$ , under both Thompson Sampling (standard normal priors, 0.01 clipping) and ϵ-greedy (ϵ = 0.1) under zero margin $θ_{0}^{*} (P) = θ_{1}^{*} (P) = 0$ . For our simulations T = 100, errors are standard normal, and we use 50k Monte Carlo repetitions.

In order to construct reliable confidence intervals using asymptotic approximations, it is crucial that that estimators converge uniformly in distribution. To illustrate the importance of uniformity, consider the following example. We can modify Thompson Sampling to ensure that $\frac{1}{T} \sum_{t = 1}^{T} A_{t} \overset{P}{\to} 0.5$ when $θ_{1}^{*} (P) - θ_{0}^{*} (P) = 0$ . For example, we could do this by using an algorithm we call Thompson Sampling Hodges (inspired by the Hodges estimator; see Van der Vaart [2000, Page 109]), defined below:

π_{t} (1, H_{t - 1}) = P ({\tilde{θ}}_{1} > {\tilde{θ}}_{0} ∣ H_{t - 1}) 1_{∣ μ_{1, t} - μ_{0, t} ∣ > t^{- 4}} + 0.5 1_{∣ μ_{1, t} - μ_{0, t} ∣ \leq t^{- 4}}

Under standard Thompson Sampling arm one is chosen according to the posterior probability that is optimal, so $π_{t} (1, H_{t - 1}) = P ({\tilde{θ}}_{1} > {\tilde{θ}}_{0} ∣ H_{t - 1})$ . Above, μ_a,t denotes the posterior mean for the mean reward for arm a at time t. Under TS-Hodges, if difference between the posterior means, ∣μ_1,t − μ_0,t∣, is less than t⁻⁴, π_t is set to 0.5. Additionally, we clip the action selection probabilities to bound them strictly away from 0 and 1 for some constant π_min in the following sense clip(π_t) = (1 − π_min) ∧ (π_t ∨ π_min). Under TS-Hodges with clipping, we can show that

\frac{1}{T} \sum_{t = 1}^{T} A_{t} \overset{P}{\to} {\begin{matrix} 1 - π_{min} & if θ_{1}^{*} (P) - θ_{0}^{*} (P) > 0 \\ π_{min} & if θ_{1}^{*} (P) - θ_{0}^{*} (P) < 0 \\ 0.5 & if θ_{1}^{*} (P) - θ_{0}^{*} (P) = 0 \end{matrix}

(36)

By equation (36), we satisfy (35) pointwise for every fixed $P$ and we have that the OLS estimator is asymptotically normal pointwise [Lai and Wei, 1982]. However, equation (36) fails to hold uniformly over $P \in P$ . Specifically, it fails to hold for any sequence of ${P_{t}}_{t = 1}^{\infty}$ such that $θ_{1}^{*} (P_{t}) - θ_{0}^{*} (P_{t}) = t^{- 4}$ . In Figure 8, we show that confidence intervals constructed using normal approximations fail to provide reliable confidence intervals, even for very large sample sizes for the worst case values of $θ_{1}^{*} (P) - θ_{0}^{*} (P)$ .

Figure 8: — Above we construct confidence intervals for $θ_{1}^{*} (P) - θ_{0}^{*} (P)$ using a normal approximation for the OLS estimator. We compare independent sampling (*π_t* = 0.5) and TS Hodges, both with standard normal priors, 0.01 clipping, standard normal errors, and T = 10, 000. We vary the value of $θ_{1}^{*} (P) - θ_{0}^{*} (P)$ in the simulations to demonstrate the non-uniformity of the confidence intervals.

E. Discussion of Chen et al. [2020]

Here we show formally that Theorem 3.1 in Chen et al. [2020], which proves that the OLS estimator is asymptotically normal on data collected with an ϵ-greedy algorithm, does not cover the case in which there is no unique optimal policy.

They assume that for rewards R_t, context vectors X_t, and binary actions A_t ∈ {0, 1},

E [R_{t} ∣ X_{t}, A_{t}] = A_{t} X_{t}^{T} β_{1} + (1 - A_{t}) X_{t}^{T} β_{0} .

They define β ≔ β₁ − β₀.

Specifically at part 1(b) of their proof on page 4 of the supplementary material, they claim that $g ({\hat{β}}_{t}, ϵ) \overset{P}{\to} g (β, ϵ)$ , where ${\hat{β}}_{t}$ is the OLS estimator for β ≔ β₁ − β₀ and g is defined as follows:

g (β_{0}, β_{1}, ϵ) = \frac{ϵ}{2} \int v^{T} {xx}^{T} v d P_{x} + (1 - ϵ) \int 1_{β^{T} x \geq 0} v^{T} {xx}^{T} v d P_{x}

Above $v \in R^{d}$ is arbitrary fixed vector and $x \in R^{d}$ are the context vectors. $P_{x}$ is the distribution of the context vectors X_t.

Specifically, they claim that $g ({\hat{β}}_{t}, ϵ) \overset{P}{\to} g (β, ϵ)$ because ${\hat{β}}_{t} \overset{P}{\to} β$ (Corollary 3.1) and by continuous mapping theorem.

Recall the continuous mapping theorem for convergence in probability [Van der Vaart, 2000, Theorem 2.3]:

Theorem 4 (Continuous Mapping Theorem). Let $g : R^{k} \to R^{m}$ be continuous at every point of a set C such that $P (X \in C) = 1$ . If $X_{n} \overset{P}{\to} X$ , then $g (X_{n}) \overset{P}{\to} g (X)$ .

Note that g is not continuous in β at the value $β = 0 \in R^{d}$ ; this is due to the indicator term $1_{β^{T} x \geq 0}$ . Thus, the standard continuous mapping theorem can not be applied in this setting. Note that the case that 0 = β = β₁ − β₀, is exactly when there is no unique optimal policy. This means that Theorem 3.1 in Chen et al. [2020] does not cover the setting in which there is no unique optimal policy.

Contributor Information

Kelly W. Zhang, Department of Computer Science, Harvard University

Lucas Janson, Departments of Statistics, Harvard University.

Susan A. Murphy, Departments of Statistics and Computer Science, Harvard University

References

Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [Google Scholar]
Agrawal Shipra and Goyal Navin. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013. [Google Scholar]
Agresti Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015. [Google Scholar]
Bibaut Aurélien, Chambaz Antoine, Dimakopoulou Maria, Kallus Nathan, and Laan Mark van der. Post-contextual-bandit inference. NeurIPS 2021, 2021. [PMC free article] [PubMed] [Google Scholar]
Bubeck Sébastien, Munos Rémi, and Stoltz Gilles. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009. [Google Scholar]
Bura Efstathia, Duarte Sabrina, Forzani Liliana, Smucler Ezequiel, and Sued Mariela. Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models. Statistics, 52(5):1005–1024, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Caria Stefano, Kasy Maximilian, Quinn Simon, Shami Soha, Teytelboym Alex, et al. An adaptive targeted field experiment: Job search assistance for refugees in jordan. 2020. [Google Scholar]
Chen Haoyu, Lu Wenbin, and Song Rui. Statistical inference for online decision making: In a contextual bandit setting. Journal of the American Statistical Association, pages 1–16, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dean Sarah, Mania Horia, Matni Nikolai, Recht Benjamin, and Tu Stephen. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, 2018. [Google Scholar]
Deshpande Yash, Mackey Lester, Syrgkanis Vasilis, and Taddy Matt. Accurate inference for adaptive linear models. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [Google Scholar]
Dvoretzky Aryeh. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [Google Scholar]
Engle Robert F. Handbook of econometrics: volume 4. Number 330.015195 E53 v. 4. 1994. [Google Scholar]
Erraqabi Akram, Lazaric Alessandro, Valko Michal, Brunskill Emma, and Liu Yun-En. Trading off rewards and errors in multi-armed bandits. In Artificial Intelligence and Statistics, pages 709–717. PMLR, 2017. [Google Scholar]
Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hammersley John. Monte carlo methods. Springer Science & Business Media, 2013. [Google Scholar]
Huber Peter J. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992. [Google Scholar]
Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
Kallus Nathan and Uehara Masatoshi. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167):1–63, 2020.34305477 [Google Scholar]
Kasy Maximilian. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [Google Scholar]
Kasy Maximilian and Sautmann Anja. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021. [Google Scholar]
Lai Tze Leung and Wei Ching Zong. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [Google Scholar]
Leeb Hannes and Pötscher Benedikt M. Model selection and inference: Facts and fiction. Econometric Theory, pages 21–59, 2005. [Google Scholar]
Liao Peng, Greenewald Kristjan, Klasnja Predrag, and Murphy Susan. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Yun-En, Mandel Travis, Brunskill Emma, and Popovic Zoran. Trading off scientific knowledge and user learning with multi-armed bandits. In EDM, pages 161–168, 2014. [Google Scholar]
Nickerson David M. Construction of a conservative confidence region from projections of an exact confidence region in multiple linear regression. The American Statistician, 48(2):120–124, 1994. [Google Scholar]
Rafferty Anna, Ying Huiji, and Williams Joseph. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM∣ Journal of Educational Data Mining, 11(1):47–79, 2019. [Google Scholar]
Robins James M, Hernan Miguel Angel, and Brumback Babette. Marginal structural models and causal inference in epidemiology, 2000. [DOI] [PubMed] [Google Scholar]
Romano Joseph P, Shaikh Azeem M, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [Google Scholar]
Shaikh Hammad, Modiri Arghavan, Williams Joseph Jay, and Rafferty Anna N. Balancing student success and inferring personalized effects in dynamic experiments. In EDM, 2019. [Google Scholar]
Thomas Philip and Brunskill Emma. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016. [Google Scholar]
Van der Vaart Aad W. Asymptotic Statistics, volume 3. Cambridge University Press, 2000. [Google Scholar]
Van Der Vaart Aad W and Wellner Jon A. Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer, 1996. [Google Scholar]
Villar Sofía S, Bowden Jack, and Wason James. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Yu-Xiang, Agarwal Alekh, and Dudik Miroslav. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017. [Google Scholar]
Yao Jiayu, Brunskill Emma, Pan Weiwei, Murphy Susan, and Doshi-Velez Finale. Power-constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [PMC free article] [PubMed] [Google Scholar]
Yom-Tov Elad, Feraru Guy, Kozdoba Mark, Mannor Shie, Tennenholtz Moshe, and Hochberg Irit. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhan Ruohan, Hadad Vitor, Hirshberg David A, and Athey Susan. Off-policy evaluation via adaptive weighting with data from contextual bandits. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021. [Google Scholar]
Zhang Kelly W, Janson Lucas, and Murphy Susan A. Inference for batched bandits. In Advances in Neural Information Processing Systems, 2020. [PMC free article] [PubMed] [Google Scholar]

[R1] Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [Google Scholar]

[R2] Agrawal Shipra and Goyal Navin. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013. [Google Scholar]

[R3] Agresti Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015. [Google Scholar]

[R4] Bibaut Aurélien, Chambaz Antoine, Dimakopoulou Maria, Kallus Nathan, and Laan Mark van der. Post-contextual-bandit inference. NeurIPS 2021, 2021. [PMC free article] [PubMed] [Google Scholar]

[R5] Bubeck Sébastien, Munos Rémi, and Stoltz Gilles. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009. [Google Scholar]

[R6] Bura Efstathia, Duarte Sabrina, Forzani Liliana, Smucler Ezequiel, and Sued Mariela. Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models. Statistics, 52(5):1005–1024, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Caria Stefano, Kasy Maximilian, Quinn Simon, Shami Soha, Teytelboym Alex, et al. An adaptive targeted field experiment: Job search assistance for refugees in jordan. 2020. [Google Scholar]

[R8] Chen Haoyu, Lu Wenbin, and Song Rui. Statistical inference for online decision making: In a contextual bandit setting. Journal of the American Statistical Association, pages 1–16, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Dean Sarah, Mania Horia, Matni Nikolai, Recht Benjamin, and Tu Stephen. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, 2018. [Google Scholar]

[R10] Deshpande Yash, Mackey Lester, Syrgkanis Vasilis, and Taddy Matt. Accurate inference for adaptive linear models. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [Google Scholar]

[R11] Dvoretzky Aryeh. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [Google Scholar]

[R12] Engle Robert F. Handbook of econometrics: volume 4. Number 330.015195 E53 v. 4. 1994. [Google Scholar]

[R13] Erraqabi Akram, Lazaric Alessandro, Valko Michal, Brunskill Emma, and Liu Yun-En. Trading off rewards and errors in multi-armed bandits. In Artificial Intelligence and Statistics, pages 709–717. PMLR, 2017. [Google Scholar]

[R14] Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Hammersley John. Monte carlo methods. Springer Science & Business Media, 2013. [Google Scholar]

[R16] Huber Peter J. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992. [Google Scholar]

[R17] Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]

[R18] Kallus Nathan and Uehara Masatoshi. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167):1–63, 2020.34305477 [Google Scholar]

[R19] Kasy Maximilian. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [Google Scholar]

[R20] Kasy Maximilian and Sautmann Anja. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021. [Google Scholar]

[R21] Lai Tze Leung and Wei Ching Zong. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [Google Scholar]

[R22] Leeb Hannes and Pötscher Benedikt M. Model selection and inference: Facts and fiction. Econometric Theory, pages 21–59, 2005. [Google Scholar]

[R23] Liao Peng, Greenewald Kristjan, Klasnja Predrag, and Murphy Susan. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Liu Yun-En, Mandel Travis, Brunskill Emma, and Popovic Zoran. Trading off scientific knowledge and user learning with multi-armed bandits. In EDM, pages 161–168, 2014. [Google Scholar]

[R25] Nickerson David M. Construction of a conservative confidence region from projections of an exact confidence region in multiple linear regression. The American Statistician, 48(2):120–124, 1994. [Google Scholar]

[R26] Rafferty Anna, Ying Huiji, and Williams Joseph. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM∣ Journal of Educational Data Mining, 11(1):47–79, 2019. [Google Scholar]

[R27] Robins James M, Hernan Miguel Angel, and Brumback Babette. Marginal structural models and causal inference in epidemiology, 2000. [DOI] [PubMed] [Google Scholar]

[R28] Romano Joseph P, Shaikh Azeem M, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [Google Scholar]

[R29] Shaikh Hammad, Modiri Arghavan, Williams Joseph Jay, and Rafferty Anna N. Balancing student success and inferring personalized effects in dynamic experiments. In EDM, 2019. [Google Scholar]

[R30] Thomas Philip and Brunskill Emma. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016. [Google Scholar]

[R31] Van der Vaart Aad W. Asymptotic Statistics, volume 3. Cambridge University Press, 2000. [Google Scholar]

[R32] Van Der Vaart Aad W and Wellner Jon A. Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer, 1996. [Google Scholar]

[R33] Villar Sofía S, Bowden Jack, and Wason James. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wang Yu-Xiang, Agarwal Alekh, and Dudik Miroslav. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017. [Google Scholar]

[R35] Yao Jiayu, Brunskill Emma, Pan Weiwei, Murphy Susan, and Doshi-Velez Finale. Power-constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [PMC free article] [PubMed] [Google Scholar]

[R36] Yom-Tov Elad, Feraru Guy, Kozdoba Mark, Mannor Shie, Tennenholtz Moshe, and Hochberg Irit. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zhan Ruohan, Hadad Vitor, Hirshberg David A, and Athey Susan. Off-policy evaluation via adaptive weighting with data from contextual bandits. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021. [Google Scholar]

[R38] Zhang Kelly W, Janson Lucas, and Murphy Susan A. Inference for batched bandits. In Advances in Neural Information Processing Systems, 2020. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang

Lucas Janson

Susan A Murphy

Abstract

1. Introduction

2. Problem Formulation

3. Adaptively Weighted M-Estimators

3.1. Intuition for Square-Root Importance Weights

Figure 1:

3.2. Asymptotic Normality and Confidence Regions

3.3. Choice of Stabilizing Policy

4. Related Work

5. Simulation Results

6. Discussion

Immediate questions

Generality and robustness

Trading-off regret minimization and statistical inference objectives

Figure 2:

Acknowledgements and Disclosure of Funding

Appendix

A. Simulations

A.1. Simulation Details

Simulation Environment

Algorithm

Compute Time and Resources

A.2. Details on Constructing of Confidence Regions

A.2.1. Least Squares Estimators

A.2.2. MLE Estimators

A.2.3. W-Decorrelated

A.2.4. Self-Normalized Martingale Bound

A.2.5. Construction of Projected Confidence Regions

A.3. Additional Simulation Results

Figure 3: Poisson Rewards:

Figure 4:

Figure 5:

B. Asymptotic Results

B.1. Definitions

B.2. Consistency

B.3. Asymptotic Normality

B.3.1. Main Argument

B.3.2. Asymptotic Normality of ΣT(P)−1∕2TM.T(θ∗(P))

1. Conditional Variance

2. Conditional Lindeberg

B.3.3. Showing that supθ∈Θ:‖θ−θ∗(P)‖≤ϵm⃛‖M⃛T(θ)‖1 is bounded in probability

B.3.4. Lower bounding – −M¨T(θ∗(P))

B.4. Lemmas and Other Helpful Results

B.5. Least-Squares Estimator

C. Choice of Stabilizing Policy

C.1. Optimal Stabilizing Policy in Multi-Arm Bandit Setting

C.2. Approximating the Optimal Stabilizing Policy

Figure 6:

D. Need for Uniformly Valid Inference on Data Collected with Bandit Algorithms

Figure 7:

Figure 8:

E. Discussion of Chen et al. [2020]

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

B.3.2. Asymptotic Normality of $Σ_{T} (P)^{- 1 ∕ 2} \sqrt{T} {\dot{M}}_{T} (θ^{*} (P))$

B.3.3. Showing that ${sup}_{θ \in Θ : ‖ θ - θ^{*} (P) ‖ \leq ϵ_{\overset{⃛}{m}}} ‖ {\overset{⃛}{M}}_{T} (θ) ‖_{1}$ is bounded in probability

B.3.4. Lower bounding – $- {\ddot{M}}_{T} (θ^{*} (P))$