Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 24.
Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;34:7460–7471.

Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang 1, Lucas Janson 2, Susan A Murphy 3
PMCID: PMC9232184  NIHMSID: NIHMS1762664  PMID: 35757490

Abstract

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators—which includes estimators based on empirical risk minimization as well as maximum likelihood—on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.

1. Introduction

Due to the need for interventions that are personalized to users, (contextual) bandit algorithms are increasingly used to address sequential decision making problems in health-care [Yom-Tov et al., 2017, Liao et al., 2020], online education [Liu et al., 2014, Shaikh et al., 2019], and public policy [Kasy and Sautmann, 2021, Caria et al., 2020]. Contextual bandits personalize, that is, minimize regret, by learning to choose the best intervention in each context, i.e., the action that leads to the greatest expected reward. Besides the goal of regret minimization, another critical goal in these real-world problems is to be able to use the resulting data collected by bandit algorithms to advance scientific knowledge [Liu et al., 2014, Erraqabi et al., 2017]. By scientific knowledge, we mean information gained by using the data to conduct a variety of statistical analyses, including confidence interval construction and hypothesis testing. While regret minimization is a within-experiment learning objective, gaining scientific knowledge from the resulting adaptively collected data is a between-experiment learning objective, which ultimately helps with regret minimization between deployments of bandit algorithms. Note that the data collected by bandit algorithms are adaptively collected because previously observed contexts, actions, and rewards are used to inform what actions to select in future timesteps.

There are a variety of between-experiment learning questions encountered in real-life applications of bandit algorithms. For example, in real-life sequential decision-making problems there are often a number of additional scientifically interesting outcomes besides the reward that are collected during the experiment. In the online advertising setting, the reward might be whether an ad is clicked on, but one may be interested in the outcome of amount of money spent or the subsequent time spent on the advertiser’s website. If it was found that an ad had high click-through rate, but low amounts of money was spent after clicking on the ad, one may redesign the reward used in the next bandit experiment. One type of statistical analysis would be to construct confidence intervals for the relative effect of the actions on multiple outcomes (in addition to the reward) conditional on the context. Furthermore, due to engineering and practical limitations, some of the variables that might be useful as context are often not accessible to the bandit algorithm online. If after-study analyses find some such contextual variables to have sufficiently strong influence on the relative usefulness of an action, this might lead investigators to ensure these variables are accessible to the bandit algorithm in the next experiment.

As discussed above, we can gain scientific knowledge from data collected with (contextual) bandit algorithms by constructing confidence intervals and performing hypothesis tests for unknown quantities such as the expected outcome for different actions in various contexts. Unfortunately, standard statistical methods developed for i.i.d. data fail to provide valid inference when applied to data collected with common bandit algorithms. For example, assuming the sample mean of rewards for an arm is approximately normal can lead to unreliable confidence intervals and inflated type-1 error; see Section 3.1 for an illustration. Recently statistical inference methods have been developed for data collected using bandit algorithms [Hadad et al., 2019, Deshpande et al., 2018, Zhang et al., 2020]; however, these methods are limited to inference for parameters of simple models. There is a lack of general statistical inference methods for data collected with (contextual) bandit algorithms in more complex data-analytic settings, including parameters in non-linear models for outcomes; for example, there are currently no methods for constructing valid confidence intervals for the parameters of a logistic regression model for binary outcomes or for constructing confidence intervals based on robust estimators like minimizers of the Huber loss function.

In this work we show that a wide variety of estimators which are frequently used both in science and industry on i.i.d. data, namely, M-estimators [Van der Vaart, 2000], can be used to conduct valid inference on data collected with (contextual) bandit algorithms when adjusted with particular adaptive weights, i.e., weights that are a function of previously collected data. Different forms of adaptive weights are used by existing methods for simple models [Deshpande et al., 2018, Hadad et al., 2019, Zhang et al., 2020]. Our work is a step towards developing a general framework for statistical inference on data collected with adaptive algorithms, including (contextual) bandit algorithms.

2. Problem Formulation

We assume that the data we have after running a contextual bandit algorithm is comprised of contexts {Xt}t=1T, actions {At}t=1T, and primary outcomes {Yt}t=1T. T is deterministic and known. We assume that rewards are a deterministic function of the primary outcomes, i.e., Rt = f(Yt) for some known function f. We are interested in constructing confidence regions for the parameters of the conditional distribution of Yt given (Xt, At). Below we consider T → ∞ in order to derive the asymptotic distributions of estimators and construct asymptotically valid confidence intervals. We allow the action space A to be finite or infinite. We use potential outcome notation [Imbens and Rubin, 2015] and let {Yt(a):aA} denote the potential outcomes of the primary outcome and let YtYt(At) be the observed outcome. We assume a stochastic contextual bandit environment in which {Xt,Yt(a):aA}i.i.d.PP for t ∈ [1: T]; the contextual bandit environment distribution P is in a space of possible environment distributions P. We define the history Ht{Xt,At,Yt}t=1t for t ≥ 1 and H0. Actions AtA are selected according to policies π ≔ {πt}t≥1, which define action selection probabilities πt(At,Xt,Ht1)P(AtHt1,Xt). Even though the potential outcomes are i.i.d., the observed data {Xt,At,Yt}t=1T are not because the actions are selected using policies πt which are a function of past data, Ht1. Non-independence of observations is a key property of adaptively collected data.

We are interested in constructing confidence regions for some unknown θ(P)ΘRd, which is a parameter of the conditional distribution of Yt given (Xt, At). This work focuses on the setting in which we have a well-specified model for Yt. Specifically, we assume that θ(P) is a conditionally maximizing value of criterion mθ, i.e., for all PP,

θ(P)argmaxθΘEP[mθ(Yt,Xt,At)Xt,At]w.p.1. (1)

Note that θ(P) does not depend on (Xt, At) and it is an implicit modelling assumption that such a θ(P) exists for a given mθ. Note that this formulation includes semi-parametric models, e.g., the model could constrain the conditional mean of Yt to be linear in some function of the actions and context, but allow the residuals to follow any mean-zero distribution, including ones that depend on the actions and/or contexts.

To estimate θ(P), we build on M-estimation [Huber, 1992], which classically selects the estimator θ^ to be the θ ∈ Θ that maximizes the empirical analogue of Equation (1):

θ^TargmaxθΘ1Tt=1Tmθ(Yt,Xt,At). (2)

For example, in a classical linear regression setting with A< actions, a natural choice for mθ is the negative of the squared loss function, mθ(Yt,Xt,At)=(YtXtTθAt)2. When Yt is binary, a natural choice is instead the negative log-likelihood function for a logistic regression model, i.e., mθ(Yt,Xt,At)=[YtXtTθAtlog(1+exp(XtTθAt))]. More generally, mθ is commonly chosen to be a log-likelihood function or the negative of a robust loss function such as the Huber loss. If the data, {Xt,At,Yt}t=1T, were independent across time, classical approaches could be used to prove the consistency and asymptotic normality of M-estimators [Van der Vaart, 2000]. However, on data collected with bandit algorithms, standard M-estimators like the ordinary least-squares estimator fail to provide valid confidence intervals [Hadad et al., 2019, Deshpande et al., 2018, Zhang et al., 2020]. In this work, we show that M-estimators can still be used to provide valid statistical inference on adaptively collected data when adjusted with well-chosen adaptive weights.

3. Adaptively Weighted M-Estimators

We consider a weighted M-estimating criteria with adaptive weights Wtσ(Ht1,Xt,At) given by Wt=πtsta(At,Xt)πt(At,Xt,Ht1). Here {πtsta}t1 are pre-specified stabilizing policies that do not depend on data {Yt, Xt, At}t≥1. A default choice for the stabilizing policy when the action space is of size A< is just πtsta(a,x)=1A for all x, a, and t; we discuss considerations for the choice of {πtsta}t=1T in Section 3.3. We call these weights square-root importance weights because they are the square-root of the standard importance weights [Hammersley, 2013, Wang et al., 2017]. Our proposed estimator for θ(P), θ^T, is the maximizer of a weighted version of the M-estimation criterion of Equation (2):

θ^TargmaxθΘ1Tt=1TWtmθ(Yt,Xt,At)argmaxθΘMT(θ).

Note that MT(θ) defined above depends on both the data {Xt,At,Yt}t=1T and weights {Wt}t=1T. We provide asymptotically valid confidence regions for θ(P) by deriving the asymptotic distribution of θ^T as T → ∞ and by proving that the convergence in distribution is uniform over PP. Such convergence allows us to construct a uniformly asymptotically valid 1 − α level confidence region, CT(α), for θ(P), which is a confidence region that satisfies

liminfTinfPPPP,π(θ(P)CT(α))1α. (3)

If CT(α) were not uniformly valid, then there would exist an ϵ > 0 such that for every sample size T, CT(α)’s coverage would be below 1 − αϵ for some worst-case PTP. Confidence regions which are asymptotically valid, but not uniformly asymptotically valid, fail to be reliable in practice [Leeb and Pötscher, 2005, Romano et al., 2012]. Note that on i.i.d. data it is generally straightforward to show that estimators that converge in distribution do so uniformly; however, as discussed in Zhang et al. [2020] and Appendix D, this is not the case on data collected with bandit algorithms.

To construct uniformly valid confidence regions for θ(P) we prove that θ^T is uniformly asymptotically normal in the following sense:

ΣT(P)12M¨T(θ^T)T(θ^Tθ(P))DN(0,Id)uniformly overPP, (4)

where M¨T(θ)22θMT(θ) and ΣT(P)1Tt=1TEP,πtsta[m.θ(P)(Yt,Xt,At)2]. We define m.θθmθ. Similarly we define respectively m¨θ and mθ as the second and third partial derivatives of mθ with respect to θ. For any vector z we define z⊗2zz.

3.1. Intuition for Square-Root Importance Weights

The critical role of the square-root importance weights Wt=πtsta(At,Xt)πt(At,Xt,Ht1) is to adjust for instability in the variance of M-estimators due to the bandit algorithm. These weights act akin to standard importance weights when squared and adjust a key term in the variance of M-estimators from depending on adaptive policies {πt}t=1T, which can be ill-behaved, to depending on the pre-specified stabilizing policies {πtsta}t=1T. See Zhang et al. [2020] and Deshpande et al. [2018] for more discussion of the ill-behavior of the action selection probabilities for common bandit algorithms, which occurs particularly when there is no unique optimal policy.

As an illustrative example, consider the least-squares estimators in a finite-arm linear contextual bandit setting. Assume that EP[YtXt,At=a]=XtTθa(P) w.p. 1. We focus on estimating θa(P) for some aA. The least-squares estimator corresponds to an M-estimator with mθa(Yt,Xt,At)=1At=a(YtXtTθa)2. The adaptively weighted least-squares (AW-LS) estimator is θ^T,aAW-LSargmaxθa{t=1TWt1At=a(YtXtTθa)2}. For simplicity, suppose that the stabilizing policy does not change with t and drop the index t to get πsta. Taking the derivative of this criterion, we get 0=t=1TWt1At=aXt(YtXtTθ^T,aAW-LS), and rearranging terms gives

1Tt=1TWt1At=aXtXtT(θ^T,aAW-LSθa(P))=1Tt=1TWt1At=aXt(YtXtTθa(P)). (5)

Note that the right hand side of Equation (5) is a martingale difference sequence with respect to history {Ht}t=0T because EP,π[Wt1At=a(YtXtTθa(P))Ht1]=0 for all t; by law of iterated expectations and since Wtσ(Ht1,Xt,At), EP,π[Wt1At=a(YtXtTθa(P))Ht1] equals

EP[Wtπt(a,Xt,Ht1)EP[YtXtTθa(P)Ht1,Xt,At=a]Ht1]=(i)[EP[Wtπt(a,Xt,Ht1)EP[YtXtTθa(P)Xt,At=a]Ht1]=(ii)0.

(i) holds by our i.i.d. potential outcomes assumption. (ii) holds since EP[YtXt,At=a]=XtTθa(P). We prove that (5) is uniformly asymptotically normal by applying a martingale central limit theorem (Appendix B.4). The key condition in this theorem is that the conditional variance converges uniformly, for which it is sufficient to show that the conditional covariance of Wt1At=a(YtXtTθa(P)) given Ht1 equals some positive-definite matrix Σ(P) for every t, i.e.,

EP,π[Wt21At=aXt,XtT(YtXtTθa(P))2Ht1]=Σ(P). (6)

By law of iterated expectations, EP,π[Wt21At=aXtXtT(YtXtTθa(P))2Ht1] equals

EP[EP,π[πsta(At,Xt)πt(At,Xt,Ht1)1At=aXtXtT(YtXtTθa(P))2Ht1,Xt]Ht1]=(a)EP[EP,πsta[1At=aXtXtT(YtXtTθa(P))2Ht1,Xt]Ht1]=(b)EP[EP,πsta[1At=aXtXtT(YtXtTθa(P))2Xt]Ht1]=(c)EP[EP,πsta[1At=aXtXtT(YtXtTθa(P))2Xt]]=(d)EP,πsta[1At=aXtXtT(YtXtTθa(P))2]Σ(P). (7)

Above, (a) holds because the importance weights change the sampling measure from the adaptive policy πt to the pre-specified stabilizing policy πsta. (b) holds by our i.i.d. potential outcomes assumption and because πsta is a pre-specified policy. (c) holds because Xt does not depend on Ht1 by our i.i.d. potential outcomes assumption. (d) holds by the law of iterated expectations. Note that Σ(P) does not depend on t because πsta is not time-varying. In contrast, without the adaptive weighting, i.e., when Wt = 1, the conditional covariance of 1At=a(YtXtTθa(P)) on Ht1 is a random variable, due to the adaptive policy πt.

In Figure 1 we plot the empirical distributions of the z-statistic for the least-squares estimator both with and without adaptive weighting. We consider a two-armed bandit with At ∈ {0, 1}. Let θ1(P)EP[Yt(1)] and mθ1(Yt,At)At(Ytθ1)2. The unweighted version, i.e., the ordinary least-squares (OLS) estimator, is θ^T,1OLSargmaxθ11Tt=1Tmθ1(Yt,At). The adaptively weighted version is θ^T,1AW-LSargmaxθ11Tt=1TWtmθ1(Yt,At). We collect data using Thompson Sampling and use a uniform stabilizing policy where πsta(1) = πsta(0) = 0.5. It is clear that the least-squares estimator with adaptive weighting has a z-statistic that is much closer to a normal distribution.

Figure 1:

Figure 1:

The empirical distributions of the weighted and unweighted least-squares estimators for θ1(P)EP[Yt(1)] in a two arm bandit setting where EP[Yt(1)]=EP[Yt(0)]=0. We perform Thompson Sampling with N(0,1) priors, N(0,1) errors, and T = 1000. Specifically, we plot t=1TAt(θ^T,1OLSθ1(P)) on the left and (1Tt=1T0.5πt(1)At)(θ^T,1AW-LSθ1(P)) on the right.

The square-root importance weights are a form of variance stabilizing weights, akin to those introduced in Hadad et al. [2019] for estimating means and differences in means on data collected with multi-armed bandits. In fact, in the special case that A< and ϕ(Xt,At)=[1At=1,1At=2,,1At=A]T, the adaptively weighted least-squares estimator is equivalent to the weighted average estimator of Hadad et al. [2019]. See Section 4 for more on Hadad et al. [2019].

3.2. Asymptotic Normality and Confidence Regions

We now discuss conditions under which the adaptively weighted M-estimators are asymptotically normal in the sense of Equation (4). In general, our conditions differ from those made for standard M-estimators on i.i.d. data because (i) the data is adaptively collected, i.e., πt can depend on Ht1 and (ii) we ensure uniform convergence over PP, which is stronger than guaranteeing convergence pointwise for each PP.

Condition 1 (Stochastic Bandit Environment). Potential outcomes {Xt,Yt(a):aA}i.i.d.PP over t ∈ [1: T].

Condition 1 implies that Yt is independent of Ht1 given Xt and At, and the conditional distribution YtXt, At is invariant over time. Also note that action space A can be finite or infinite.

Condition 2 (Differentiable). The first three derivatives of mθ (y, x, a) with respect to θ exist for every θ ∈ Θ, every aA, and every (x, y) in the joint support of {P:PP}.

Condition 3 (Bounded Parameter Space). For all PP, θ(P)Θ, a bounded open subset of Rd.

Condition 4 (Lipschitz). There exists some real-valued function g such that (i) supPP,t1EP,πtsta[g(Yt,Xt,At)2] is bounded and (ii) for all θ, θ′ ∈ Θ,

mθ(Yt,Xt,At)mθ(Yt,Xt,At)g(Yt,Xt,At)θθ2.

Conditions 3 and 4 together restrict the complexity of the function m in order to ensure a martingale law of large numbers result holds uniformly over functions {mθ : θ ∈ Θ}; this is used to prove the consistency of θ^T. Similar conditions are commonly used to prove consistency of M-estimators based on i.i.d. data, although the boundedness of the parameter space can be dropped when mθ is a concave function of θ for all Yt, At, Xt (as it is in many canonical examples such as least squares) [Van der Vaart, 2000, Engle, 1994, Bura et al., 2018]; we expect that a similar result would hold for adaptively weighted M-estimators.

Condition 5 (Moments). The fourth moments of mθ(P)(Yt,Xt,At), m.θ(P)(Yt,Xt,At), and m¨θ(P)(Yt,Xt,At) with respect to P and policy πtsta are bounded uniformly over PP and t ≥ 1. For all sufficiently large T, the minimum eigenvalue of ΣT,P1Tt=1TEP,πtsta[m.θ(P)(Yt,Xt,At)2] is bounded above δm.2>0 for all PP.

Condition 5 is similar to those of Van der Vaart [2000, Theorem 5.41]. However, to guarantee uniform convergence we assume that moment bounds hold uniformly over PP and t ≥ 1.

Condition 6 (Third Derivative Domination). For BRd×d×d, we define B1i=1dj=1dk=1dBi,j,k. There exists a function m(Yt,Xt,At)Rd×d×d such that (i) supPP,t1EP,πtsta[m(Yt,Xt,At)12] is bounded and (ii) for all PP there exists some ϵm>0 such that the following holds with probability 1,

supθΘ:θθ(P)ϵmmθ(Yt,Xt,At)1m(Yt,Xt,At)1.

Condition 6 is again similar to those in classical M-estimator asymptotic normality proofs [Van der Vaart, 2000, Theorem 5 .41].

Condition 7 (Maximizing Solution).

(i) For all PP, there exists a θ(P)Θ such that (a) θ(P)argmaxθΘEP[mθ(Yt,Xt,At)Xt,At] w.p. 1, (b) EP[m.θ(P)(Yt,Xt,At)Xt,At]=0 w.p. 1, and (c) EP[m¨θ(P)(Yt,Xt,At)Xt,At]¯0 w.p. 1.

(ii) There exists some positive definite matrix H such that 1Tt=1TEP,πtsta[m¨θ(P)(Yt,Xt,At)]¯ for all PP and all sufficiently large T.

For matrices A, B, we define AB to mean that AB is positive semi-definite, as used above. Condition 7 (i) ensures that θ(P) is a conditionally maximizing solution for all contexts Xt and actions At; this ensures that {m.θ(P)(Yt,Xt,At)}t=1T is a martingale difference sequence with respect to {Ht}t=1T. Note it does not require θ(P) to always be a conditionally unique optimal solution. Condition 7 (ii) is related to the local curvature at the maximizing solution and the analogous condition in the i.i.d. setting is trivially satisfied; we specifically use this condition to ensure we can replace M¨(θ(P)) with M¨(θ^T) in our asymptotic normality result, i.e., that M¨(θ(P))1M¨(θ^T)PId uniformly over PP.

Condition 8 (Well-Separated Solution). For all sufficiently large T, for any ϵ > 0, there exists some δ > 0 such that for all PP,

infθΘ:θθ(P)2>ϵ{1Tt=1TEP,πtsta[mθ(P)(Yt,Xt,At)mθ(Yt,Xt,At)]}δ.

A well-separated solution condition akin to Condition 8 is commonly assumed in order to prove consistency of M-estimators, e.g., see Van der Vaart [2000, Theorem 5.7]. Note that the difference between Condition 7 (i) and Condition 8 is that the former is a conditional statement (conditional on Xt, At) and the latter is a marginal statement (marginal over Xt, At, where At is chosen according to stabilizing policies πtsta). Condition 7 (i) means there is a θ(P) solution for all contexts Xt and actions At that does not need to be unique, however Condition 8 assumes that marginally over Xt, At there is a well-separated solution.

Condition 9 (Bounded Importance Ratios). {πtsta}t=1T do not depend on data {Yt,Xt,At}t=1T. For all t ≥ 1, ρminπtsta(At,Xt)πt(At,Xt,Ht1)ρmax w.p. 1 for some constants 0 < ρminρmax < ∞.

Note that Condition 9 implies that for a stabilizing policy that is not time-varying, the action selection probabilities of the bandit algorithm πt(At,Xt,Ht1) must be bounded away from zero w.p. 1. Similar boundedness assumptions are also made in the off-policy evaluation literature [Thomas and Brunskill, 2016, Kallus and Uehara, 2020]. We discuss this condition further in Sections 3.3 and 6.

Theorem 1 (Uniform Asymptotic Normality of Adaptively Weighted M-Estimators). Under Conditions 1-9 we have that θ^TPθ(P) uniformly over PP. Additionally,

ΣT(P)12M¨T(θ^T)T(θ^Tθ(P))DN(0,Id)uniformly overPP. (8)

The asymptotic normality result of equation (8) guarantees that for d-dimensional θ(P),

liminfTinfPPPP,π([ΣT(P)12M¨T(θ^T)T(θ^Tθ(P))]2χd,(1α)2)=1α.

Above χd,(1α)2 is the 1 − α quantile of the χ2 distribution with d degrees of freedom. Note that the region CT(α){θΘ:[ΣT(P)12M¨T(θ^T)T(θ^Tθ(P))]2χd,(1α)2} defines a d-dimensional hyper-ellipsoid confidence region for θ(P). Also note that since M¨T(θ^T) does not concentrate under standard bandit algorithms, we cannot use standard arguments to justify treating θ^T as multivariate normal with covariance M¨T(θ^T)1ΣT(P)M¨T(θ^T)1. Nevertheless, Theorem 1 can be used to guarantee valid confidence regions for subset of entries in θ(P) by using projected confidence regions [Nickerson, 1994]. Projected confidence regions take a confidence region for all parameters θ(P) and project it onto the lower dimensional space on which the subset of target parameters lie (Appendix A.2).

3.3. Choice of Stabilizing Policy

When the action space is bounded, using weights Wt=1πt(At,Xt,Ht1) is equivalent to using square-root importance weights with a stabilizing policy that selects actions uniformly over A; this is because weighted M-estimators are invariant to all weights being scaled by the same constant. It can make sense to choose a non-uniform stabilizing policy in order to prevent the square-root importance weights from growing too large and to ensure Condition 9 holds; disproportionately up-weighting a few observations can lead to unstable estimators. Note that an analogue of our stabilizing policy exists in the causal inference literature, namely, “stabilized weights” use a probability density in the numerator of the weights to prevent them from becoming too large [Robins et al., 2000].

We now discuss how to choose stabilizing policies {πtsta}t1 in order to minimize the asymptotic variance of adaptively weighted M-estimators. We focus on the adaptively weighted least-squares estimator when we have a linear outcome model EP[YtXt,At]=XtTθAt:

θ^AW-LSargmaxθΘ{1Tt=1TWt(YtXtTθAt)2}. (9)

Recall that our use of adaptive weights is to adjust for instability in the variance of M-estimators induced by the bandit algorithm in order to construct valid confidence regions; note that weighted estimators are not typically used for this reason. On i.i.d. data, the least-squares criterion is weighted like in Equation (9) in order to minimize the variance of estimators under noise heteroskedasticity; in this setting, the best linear unbiased estimator has weights Wt = 1/σ2(At, Xt) where σ2(At,Xt)EP[(YtXtTθAt(P))2Xt,At]; this up-weights the importance of observations with low noise variance. Intuitively, if we do not need to variance stabilize, {Wt}t≥1 should be determined by the relative importance of minimizing the errors for different observations, i.e., their noise variance.

In light of this observation, we expect that under homoskedastic noise there is no reason to up-weight some observations over others. This would recommend choosing the stabilizing policy to make Wt=πtsta(At,Xt)πt(At,Xt,Ht1) as close to 1 as possible, subject to the constraint that the stabilizing policies are pre-specified, i.e., {πtsta}t1 do not depend on data {Yt, Xt, At}t≥1 (see Appendix C for details). Since adjusting for heteroskedasticity and variance stabilization are distinct uses of weights, under heteroskedasticity, we recommend that the weights are combined in the following sense: Wt=(1σ2(At,Xt))πtsta(At,Xt)πt(At,Xt,Ht1). This would mean that to minimize variance, we still want to choose the stabilizing policies to make πtsta(At,Xt)πt(At,Xt,Ht1) as close to 1 possible, subject to the pre-specified constraint.

4. Related Work

Villar et al. [2015] and Rafferty et al. [2019] empirically illustrate that classical ordinary least squares (OLS) inference methods have inflated Type-1 error when used on data collected with a variety of regret-minimizing multi-armed bandit algorithms. Chen et al. [2020] prove that the OLS estimator is asymptotically normal on data collected with an ϵ-greedy algorithm, but their results do not cover settings in which there is no unique optimal policy, e.g., a multi-arm bandit with two identical arms (Appendix E). Recent work has discussed the non-normality of OLS on data collected with bandit algorithms when there is no unique optimal policy and proposed alternative methods for statistical inference. A common thread between these methods is that they all utilize a form of adaptive weighting. Deshpande et al. [2018] introduced the W-decorrelated estimator, which adjusts the OLS estimator with a sum of adaptively weighted residuals. In the multi-armed bandit setting, the W-decorrelated estimator up-weights observations from early in the study and down-weights observations from later in the study [Zhang et al., 2020]. In the batched bandit setting, Zhang et al. [2020] show that the Z-statistics for the OLS estimators computed separately on each batch are jointly asymptotically normal. Standardizing the OLS statistic for each batch effectively adaptively re-weights the observations in each batch.

Hadad et al. [2019] introduce adaptively weighted versions of both the standard augmented-inverse propensity weighted estimator (AW-AIPW) and the sample mean (AWA) for estimating parameters of simple models on data collected with bandit algorithms. They introduce a class of adaptive “variance stabilizing” weights, for which the variance of a normalized version of their estimators converges in probability to a constant. In their discussion section they note open questions, two of which this work addresses: 1) “What additional estimators can be used for normal inference with adaptively collected data?” and 2) How do their results generalize to more complex sampling designs, like data collected with contextual bandit algorithms? We demonstrate that variance stabilizing adaptive weights can be used to modify a large class of M-estimators to guarantee valid inference. This generalization allows us to perform valid inference for a large class of important inferential targets: parameters of models for expected outcomes that are context dependent.

Recently, adaptive weighting has also been used in off-policy evaluation methods for when the behavior policy (policy used to collect the data) is a contextual bandit algorithm [Bibaut et al., 2021, Zhan et al., 2021]. In this literature the estimand is the value, or average expected reward, of a pre-specified policy (note this is a scalar value). In contrast, in our work we are interested in constructing confidence regions for parameters of a model for an outcome (that could be the reward)—for example, this could be parameters of a logistic regression model for a binary outcome. We believe in the future there could be theory that could unify these adaptive weighting methods for these different estimands.

An alternative to using asymptotic approximations to construct confidence intervals is to use high-probability confidence bounds. These bounds provide stronger guarantees than those based on asymptotic approximations, as they are guaranteed to hold for finite samples. The downside is that these bounds are typically much wider, which is why much of classical statistics uses asymptotic approximations. Here we do the same. In Section 5, we empirically compare our to the self-normalized martingale bound [Abbasi-Yadkori et al., 2011], a high-probability bound commonly used in the bandit literature.

5. Simulation Results

In this section, Rt = Yt. We consider two settings: a continuous reward setting and a binary reward setting. In the continuous reward setting, the rewards are generated with mean EP[RtXt,At]=X~tTθ0(P)+AtX~tTθ1(P) and noise drawn from a student’s t distribution with five degrees of freedom; here X~t=[1,Xt]R3 (Xt with intercept term), actions At ∈ {0, 1}, and parameters θ0(P), θ1(P)R3. In the binary reward setting, the reward Rt is generated as a Bernoulli with success probability EP[RtXt,At]=[1+exp(X~tTθ0(P)AtX~tTθ1(P))]1. Furthermore, in both simulation settings we set θ0(P)=[0.1,0.1,0.1] and θ1(P)=[0,0,0], so there is no unique optimal arm; we call vector parameter θ1(P) the advantage of selecting At = 1 over At = 0. Also in both settings, the contexts Xt are drawn i.i.d. from a uniform distribution.

In both simulation settings we collect data using Thompson Sampling with a linear model for the expected reward and normal priors [Agrawal and Goyal, 2013] (so even when the reward is binary). We constrain the action selection probabilities with clipping at a rate of 0.05; this means that while typical Thompson Sampling produces action selection probabilities πtTS(At,Xt,Ht1), we instead use action selection probabilities πt(At,Xt,Ht1)=0.05(0.95πtTS(At,Xt,Ht1)) to select actions. We constrain the action selection probabilities in order to ensure weights Wt are bounded when using a uniform stabilizing policy; see Sections 3.2 and 6 for more discussion on this boundedness assumption. Also note that increasing the amount the algorithm explores (clipping) decreases the expected width of confidence intervals constructed on the resulting data (see Section 6).

To analyze the data, in the continuous reward setting, we use least-squares estimators with a correctly specified model for the expected reward, i.e., M-estimators with mθ(Rt,Xt,At)=(RtX~tTθ0AtX~tTθ1)2. We consider both the unweighted and adaptively weighted versions. We also compare to the self-normalized martingale bound [Abbasi-Yadkori et al., 2011] and the W-decorrelated estimator [Deshpande et al., 2018], as they were both developed for the linear expected reward setting. For the self-normalized martingale bound, which requires explicit bounds on the parameter space, we set Θ={θR6:θ26}. In the binary reward setting, we also assume a correctly specified model for the expected reward. We use both unweighted and adaptively weighted maximum likelihood estimators (MLEs), which correspond to an M-estimators with mθ(Rt, Xt, At) set to the negative log-likelihood of Rt given Xt, At. We solve for these estimators using Newton–Raphson optimization and do not put explicit bounds on the parameter space Θ (note in this case mθ is concave in θ [Agresti, 2015, Chapter 5.4.2]). See Appendix A for additional details and simulation results.

In Figure 4 we plot the empirical coverage probabilities and volumes of 90% confidence regions for θ(P)[θ0(P),θ1(P)] and θ1(P) in both the continuous and binary reward settings. While the confidence regions based on the unweighted least-squares estimator (OLS) and the unweighted MLE have significant undercoverage that does not improve as T increases, the confidence regions based on the adaptively weighted versions, AW-LS and AW-MLE, have very reliable coverage. For the confidence regions for θ1(P) based on the AW-LS and AW-MLE, we include both projected confidence regions (for which we have theoretical guarantees) and non-projected confidence regions. The confidence regions based on projections are conservative but nevertheless have comparable volume to those based on OLS and MLE respectively. We do not prove theoretical guarantees for the non-projection confidence regions for AW-LS and AW-MLE, however they perform well across in our simulations. Both types of confidence regions based on AW-LS have significantly smaller volumes than those constructed using the self-normalized martingale bound and W-decorrelated estimator. Note that the W-decorrelated estimator and self-normalized martingale bounds are designed for linear contextual bandits and are thus not applicable for the logistic regression model setting. The confidence regions constructed using the self-normalized martingale bound have reliable coverage as well, but are very conservative. Empirically, we found that the coverage probabilities of the confidence regions based on the W-decorrelated estimator were very sensitive to the choice of tuning parameters. We use 5,000 Monte-Carlo repetitions and the error bars plotted are standard errors.

6. Discussion

Immediate questions

We assume that ratios πtsta(At,Xt)πt(At,Xt,Ht1) are bounded for our theoretical results; this precludes πt(At,Xt,Ht1) from going to zero for a fixed stabilizing policy. For simple models, e.g., the AW-LS estimator, we can let these ratios grow at a certain rate and still guarantee asymptotic normality (Appendix B.5); we conjecture similar results hold more generally.

Generality and robustness

This work assumes that we have a well-specified model for the outcome Yt, i.e., that θ(P)argmaxθΘEP[mθ(Yt,Xt,At)Xt,At] w.p. 1. Our theorems use this assumption to ensure that {Wtm.θ(Yt,Xt,At)}t1 is a martingale difference sequence with respect to {Ht}t0. On i.i.d. data it is common to define θ(P) to be the best projected solution, i.e., θ0(P)argmaxθΘEP,π[mθ(Yt,Xt,At)]. Note that the best projected solution, θ(P), depends on the distribution of the action selection policy π. It would be ideal to also be able to perform inference for a projected solution on adaptively collected data.

Another natural question is whether adaptive weighting methods work in Markov Decision Processes (MDP) environments. Taking the AW-LS estimator introduced in Section 3.1 as an example, our conditional variance derivation in Equation (7) fails to hold in an MDP setting, specifically equality (c). However, the conditional variance condition can be satisfied if we instead use weights Wt={[πtsta(At,Xt)psta(Xt)][πt(At,Xt,Ht1)PP(XtXt1,At1)]}12 where PP are the state transition probabilities and psta is a pre-specified distribution over states. In general though we do not expect to know the transition probabilities PP and if we tried to estimate them, our theory would require the estimator to have error op(1T), below the parametric rate.

Trading-off regret minimization and statistical inference objectives

In sequential decision-making problems there is a fundamental trade-off between minimizing regret and minimizing estimation error for parameters of the environment using the resulting data [Bubeck et al., 2009, Dean et al., 2018]. Given this trade-off there are many open problems regarding how to minimize regret while still guaranteeing a certain amount of power or expected confidence interval width, e.g., developing sample size calculators for use in justifying the number of users in a mobile health trial, and developing new adaptive algorithms [Liu et al., 2014, Erraqabi et al., 2017, Yao et al., 2020].

Figure 2:

Figure 2:

Empirical coverage probabilities (upper row) and volume (lower row) of 90% confidence ellipsoids. The left two columns are for the linear reward model setting (t-distributed rewards) and the right two columns are for the logistic regression model setting (Bernoulli rewards). We consider confidence ellipsoids for all parameters θ(P) and for advantage parameters θ1(P) for both settings.

Acknowledgements and Disclosure of Funding

We thank Yash Nair for feedback on early drafts of this work.

Research reported in this paper was supported by National Institute on Alcohol Abuse and Alcoholism (NIAAA) of the National Institutes of Health under award number R01AA23187, National Institute on Drug Abuse (NIDA) of the National Institutes of Health under award number P50DA039838, National Cancer Institute (NCI) of the National Institutes of Health under award number U01CA229437, and by NIH/NIBIB and OD award number P41EB028242. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Appendix

A. Simulations

A.1. Simulation Details

Simulation Environment
  • Each dimension of Xt is sampled independently from Uniform(0, 5).

  • θ(P)=[θ0(P),θ1(P)]=[0.1,0.1,0.1,0,0,0], where θ0(P), θ1(P)R3.

    Below also include simulations where [θ0(P),θ1(P)]=[0.1,0.1,0.1,0.2,0.1,0].

  • t-Distributed rewards: RtXt,Att5+X~tTθ0(P)+AtX~tTθ1(P), where t5 is a t-distribution with 5 degrees of freedom.

  • Bernoulli rewards: RtXt, At ~ Bernoulli(expit(νt)) for νt=X~tTθ0(P)+AtX~tTθ1(P) and expit(x)=11+exp(x).

  • Poisson rewards: RtXt, At ~ Poisson(exp(νt)) for νt=X~tTθ0(P)+AtX~tTθ1(P).

Algorithm
  • Thompson Sampling with N(0,Id) priors on each arm.

  • 0.05 clipping

  • Pre-processing rewards before received by algorithm:
    • Bernoulli: 2Rt − 1
    • Poisson: 0.6Rt
Compute Time and Resources

All simulations run within a few hours on a MacBook Pro.

A.2. Details on Constructing of Confidence Regions

For notational convenience, we define Zt=[X~t,AtX~t].

A.2.1. Least Squares Estimators
  • θ^T=(t=1TWtZtZtT)1t=1TWtZtRt
    • For unweighted least squares, Wt = 1 and we call the estimator θ^TOLS.
    • For adaptively weighted least squares, Wt=1πt(At,Xt,Ht1); this is equivalent to using square-root importance weights with a uniform stabilizing policy. We call the estimator θ^TAW-LS.
  • We assume homoskedastic errors and estimate the noise variance σ2 as follows:
    σ^T2=1Tt=1T(RtZtTθ^T)2.
  • We use a Hotelling t-squared test statistic to construct confidence regions for θ(P):
    CT(α)={θRd:[Σ^T12(1Tt=1TWtZtZtT)T(θ^Tθ)]2}{d(T1)TdFd,Td(1α)}. (10)
    • For the unweighted least-squares estimator we use the following variance estimator: Σ^T=σ^T21Tt=1TZtZtT.
    • For the AW-Least Squares estimator we use the following variance estimator: Σ^T=σ^T21Tt=1T1πt(At,Xt,Ht1)At11πt(At,Xt,Ht1)1AtZtZtT.
  • To construct (non-projected) confidence regions for θ1(P)Rd1 we treat the unweighted least squares / AW-LS estimators, θ^T,1, as N(θ1(P),1T(1Tt=1TWtZtZtT)1Σ^T(1Tt=1TWtZtZtT)1). We use a Hotelling t-squared test statistic to construct confidence regions for θ1(P):
    CT(α)={θ1Rd1:[V1,T12T(θ^T,1θ1)]2d1(T1)Td1Fd1,Td1(1α)},
    where V1,T is the lower right d1 × d1 block of matrix (1Tt=1TWtZtZtT)1Σ^T(1Tt=1TWtZtZtT)1. Recall that for the unweighted least squares estimator Wt = 1 and for AW-LS Wt=1πt(At,Xt,Ht1).
  • For the AW-least squares estimator, we also construct projected confidence regions for θ1(P) using the confidence region defined in equation (10). See Section A.2.5 below for more details on constructing projected confidence regions.

A.2.2. MLE Estimators
Distribution ν b(ν) b′(ν) b″(ν) b‴(ν)
N(μ,1) μ 12ν2 ν = μ 1 0
Poisson(λ) log λ exp(ν) exp(ν) = λ exp(ν) = λ exp(ν) = λ
Bernoulli(p) log(p1p) log(1 + eν) eν1+eν=p eν(1+eν)2=p(1p) p(1 − p)(1 − 2p)
  • θ^T is the root of the score function:
    0=t=1TWt(Rtb(θ^TTZt))Zt.

    We use Newton Raphson optimization to solve for θ^T.

    • For unweighted MLE, Wt = 1.

    • For AW-MLE, Wt=1πt(At,Xt,Ht1); this is equivalent to using square-root importance weights with a uniform stabilizing policy.

  • Second derivative of score function: t=1Tb(θ^TTZt)ZtZtT.

  • We use a Hotelling t-squared test statistic to construct confidence regions for θ(P):
    CT(α)={θRd:[Σ^T12(1Tt=1TWtb(θ^TTZt)ZtZtT)T(θ^Tθ)]2}{d(T1)TdFd,Td(1α)}. (11)
    • For the MLE variance estimator, we use Σ^T=1Tt=1Tb(θ^TTZt)ZtZtT.
    • For the AW-MLE variance estimator, we use Σ^T=1Tt=1T1πt(At,Xt,Ht1)At11πt(At,Xt,Ht1)1Atb(θ^TTZt)ZtZtT.
  • To construct (non-projected) confidence regions for θ1(P)Rd1 we treat the MLE / AW-MLE estimators, θ^T,1, as N(θ1(P),1T(1Tt=1TWtb(θ^TTZt)ZtZtT)Σ^T1(1Tt=1TWtb(θ^TTZt)ZtZtT)). We use a Hotelling t-squared test statistic to construct confidence regions for θ1(P):
    CT(α)={θ1Rd1:[V1,T12T(θ^T,1θ1)]2d1(T1)Td1Fd1,Td1(1α)},
    where V1,T is the lower right d1 × d1 block of matrix (1Tt=1TWtb(θ^TTZt)ZtZtT)Σ^T1(1Tt=1TWtb(θ^TTZt)ZtZtT).
  • For the AW-MLE estimator, we also construct projected confidence regions for θ1(P) using the confidence region defined in equation (11). See Section A.2.5 below for more details on constructing projected confidence regions.

A.2.3. W-Decorrelated

The following is based on Algorithm 1 of Deshpande et al. [2018].

  • The W-decorrelated estimator for θ(P) is constructed as follows with adaptive weights for WtRd:
    θ^TWD=θ^TOLS+t=1TWt(RtX~tTθ^TOLS).
  • The weights are set as follows: W1=0Rd and Wt=(Ids=1tu=1tWsZuT)Zt1λT+Zt22 for t > 1.
    W1=0RdandWt=(Ids=1tu=1tWsZuT)Zt1λT+Zt22fort>1.
  • We choose λT=mineig0.01(ZtZtT)/ log T and mineigα(ZtZtT) represents the α quantile of the minimum eigenvalue of ZtZtT. This is similar to the procedure used in the simulations of Deshpande et al. [2018] and is guided by Proposition 5 in their paper.

  • We assume homoskedastic errors and estimate the noise variance σ2 as follows:
    σ^T2=1Tt=1T(RtZtTθ^TOLS)2.
  • To construct confidence ellipsoids for θ(P) are constructed using a Hotelling t-squared statistic:
    CT(α)={θRd:(θ^TWDθ)TVT1(θ^TWDθ)d(T1)TdFd,Td(1α)}
    where VT=σ^T2t=1TWtWtT.
  • To construct confidence ellipsoids for θ1(P)Rd1 with the following confidence ellipsoid where VT,1 is the lower right d1 × d1 block of matrix VT:
    CT(α)={θ1Rd1:(θ^T,1WDθ1)TVT,11(θ^T,1WDθ1)d1(T1)Td1Fd1,Td1(1α)}.
A.2.4. Self-Normalized Martingale Bound

We construct 1 − α confidence region using the following equation taken from Theorem 2 of Abbasi-Yadkori et al. [2011]:

CT(α)={θΘ:(θ^Tθ)TVT(θ^Tθ)σ2log(det(VT)12det(λId)12α)+λ12S}.
  • θ^T=(λId+t=1TZtZtT)1t=1TZtRt.

  • VT=Idλ+t=1TZtZtT.

  • λ = 1 (ridge regression regularization parameter).

  • σ = 1 (assumes rewards are σ-subgaussian).

  • S = 6, where it is assumed that θ(P)S (recall that in our simulations θ(P)R6.

  • Θ={θR6:θ26}).

  • For constructing confidence regions for θ(P), we use projected confidence regions.

A.2.5. Construction of Projected Confidence Regions

We are interested in getting the confidence ellipsoid of the projection of a d-dimensional ellipsoid onto p-dimensional space, for p < d.

  • Defining the original d-dimensional ellipsoid, for xRd and BRd×d:
    xTBx=1
  • Partitioning the matrix B and vector x:

    For yRdp and zRp.
    x=[yz]
    For CRdp×dp, ERp×p, and DRdp×p.
    B=[CDDTE]
  • Gradient of xBx with respect to x:
    (B+BT)x=2Bx=[CDDTE][yz].
    Since we are projecting onto the p-dimensional space, our projection is such that the gradient of xBx with respect to y is zero, which means
    Cy+Dz=0.

    This means in the projection that y = −C−1Dz.

  • Returning to our definition of the ellipsoid, plugging in z, we have that
    1=xTBx=[yTzT][CDDTE][yz]=yTCy+2zTDTy+zTEz=(C1Dz)TC(C1Dz)2zTDT(C1Dz)+zTEz=zTDTC1Dz2zTDTC1Dz+zTEz=zT(EDTC1D)z.
    Thus the equation for the final projected ellipsoid is
    zT(EDTC1D)z=1.

A.3. Additional Simulation Results

In addition to the continuous reward and a binary reward settings, here we also consider a discrete count reward setting. In this discrete reward setting, the reward Rt is generated from a Poisson distribution with expectation EP[RtXt,At]=exp(X~tTθ0(P)AtX~tTθ1(P)). All other data generation methods are equivalent to those used for the other simulation settings. Additionally we will consider the setting in which θ(P)=[0.1,0.1,0.1,0.2,0.1,0] for the continuous reward, binary reward, and discrete count settings.

To analyze the data, in the discrete count reward setting, we assume a correctly specified model for the expected reward. We use both unweighted and adaptively weighted maximum likelihood estimators (MLEs), which correspond to an M-estimators with mθ(Rt, Xt, At) set to the negative log-likelihood of Rt given Xt, At. We solve for these estimators using Newton–Raphson optimization and do not put explicit bounds on the parameter space Θ.

Figure 3: Poisson Rewards:

Figure 3:

Empirical coverage probabilities for 90% confidence ellipsoids for parameters θ(P) and parameters θ1(P) (top row). We also plot the volumes of these 90% confidence ellipsoids for θ(P) and parameters θ1(P) (bottom row). We set the true parameters to θ(P)=[0.1,0.1,0.1,0,0,0] (left) and to θ(P)=[0.1,0.1,0.1,0.2,0.1,0] (right).

Figure 4:

Figure 4:

Empirical coverage probabilities (upper row) and volume (lower row) of 90% confidence ellipsoids. In these simulations, θ(P)=[0.1,0.1,0.1,0.2,0.1,0]. The left two columns are for the linear reward model setting (t-distributed rewards) and the right two columns are for the logistic regression model setting (Bernoulli rewards). We consider confidence ellipsoids for all parameters θ(P) and for advantage parameters θ1(P) for both settings.

In Figure 5, we plot the mean squared errors of all estimators for all three simulation settings (same simulation hyperparameters as described previously for the respective simulation settings).

Figure 5:

Figure 5:

Mean squared error estimators of θ(P) for linear model (top), logistic regression model (middle), and generalized linear model for Poisson rewards (bottom). We consider simulations with θ(P)=[0.1,0.1,0.1,0,0,0] (left) and simulations with θ(P)=[0.1,0.1,0.1,0.2,0.1,0] (right).

B. Asymptotic Results

Throughout, ∥ · ∥ refers to the L2 norm.

B.1. Definitions

Here we define convergence in probability and distribution that is uniform over the true parameter. We follow the definitions are based on those in Kasy [2019] and Van Der Vaart and Wellner [1996, Chapter 1.12].

Definition 1 (Uniform Convergence in Probability). Let {ZT(P)}T1 be a sequence of random variables whose distributions are defined by some PP and some nuisance component η. We say that ZT(P)Pc uniformly over PP as T → ∞ if for any ϵ > 0,

supPPPP,η(ZT(P)c>ϵ)0. (12)

For simplicity of notation, throughout we denote ZT(P)c=oPP(1) to mean ZT(P)Pc uniformly over PP as T → ∞.

Definition 2 (Uniformly Stochastically Bounded). Let {ZT(P)}T1 be a sequence of random variables whose distributions are defined by some PP and some nuisance component η. We say that ZT(P) is uniformly stochastically bounded over PP as T → ∞ if for any ϵ > 0 there exists some k < ∞ such that

limsupTsupPPPP,η(ZT(P)>k)<ϵ.

Similarly we denote ZT(P)=OPP(1) to mean ZT(P) is stochastically bounded uniformly over PP as T → ∞.

Definition 3 (Uniform Convergence in Distribution). Let Z(P)RdZ and {ZT(P)}T1RdZ be a sequence of random variables whose distributions are defined by some PP and some nuisance component η. We say that ZT(P)DZ(P) uniformly over PP as T → ∞ if

supPPsupfBL1EP,η[f(ZT(P))]EP,η[f(Z(P))]0, (13)

where BL1 is the set of functions f:RdzR withf(z)∥ ≤ 1 and ∣f(z) − f(z′)∣ ≤ ∥zz′for all z, zRdZ.

As discussed in Kasy [2019], Equation (12) holds if and only if for any ϵ > 0 and any sequence {PT}T1 such that PTP for all T ≥ 1, PPT,η(ZT(PT)c>ϵ)0.

Similarly, Equation (13) holds if and only if for any sequence {PT}T1 such that PTP for all T ≥ 1, supfBL1EPT,η[f(ZT(PT))]EPT,η[f(Z(PT))]0.

B.2. Consistency

We prove the first part of Theorem 1, i.e., that θ^TPθ(P) uniformly over PP. We abbreviate mθ(Yt, Xt, At) with mθ,t. By definition of θ^T,

t=1TWtmθ^T,t=supθΘt=1TWtmθ,tt=1TWtmθ(P),t.

Note that θ^Tθ(P)>ϵ>0 implies that

supθΘ:θθ(P)>ϵt=1TWtmθ,t=supθΘt=1TWtmθ,t.

Thus, the above two results imply the following inequality:

supPPPP,π(θ^Tθ(P)>ϵ)supPPPP,π(supθΘ:θθ(P)>ϵt=1TWtmθ,tt=1TWtmθ(P),t)=supPPPP,π(supθΘ:θθ(P)>ϵ{1Tt=1TWtmθ,t}1Tt=1TWtmθ(P),t0)=supPPPP,π(supθΘ:θθ(P)>ϵ{1Tt=1TWtmθ,tEP,π[Wtmθ,tHt1]+EP,π[Wtmθ,tHt1]})(1Tt=1T{Wtmθ(P),tEP,π[Wtmθ(P),tHt1]+EP,π[Wtmθ(P),tHt1]}0).

By triangle inequality,

supPPPP,π(supθΘ:θθ(P)>ϵ{1Tt=1T(Wtmθ,tEP,π[Wtmθ,tHt1])})(a)+supθΘ:θθ(P)>ϵ{1Tt=1TEP,π[Wt(mθ,tmθ(P),t)Ht1]}(b)(1Tt=1T{Wtmθ(P),tEP,π[Wtmθ(P),tHt1]}(c)0)0. (14)

We now show that the limit in Equation (14) above holds.

  • Regarding term (c), by moment bounds of Condition 5 and Lemma 1, 1Tt=1T{Wtmθ(P),tEP,π[Wtmθ(P),tHt1]}=oPP(1).

  • Regarding term (a), by Lemma 2, supθΘ:θθ(P)>ϵ{1Tt=1T(Wtmθ,tEP,π[Wtmθ,tHt1])}=oPP(1).

Thus it is sufficient to show that term (b) is such that for some δ′ > 0,

supθΘ:θθ(P)>ϵ{1Tt=1TEP,π[Wt(mθ,tmθ(P),t)Ht1]}δw.p.1. (15)

By law of iterated expectations,

supθΘ:θθ(P)>ϵ{1Tt=1TEP,π[Wt(mθ,tmθ(P),t)Ht1]}=supθΘ:θθ(P)>ϵ{1Tt=1TEP[aAπt(a,Xt,Ht1)EP[Wt(mθ,tmθ(P),t)Ht1,Xt,At=a]daHt1]}.

Since Wtσ(Ht1,Xt,At), we have that EP[Wt(mθ,tmθ(P),t)Ht1,Xt,At=a]=WtEP[mθ,tmθ(P),tHt1,Xt,At=a]. By Condition 1, we have that WtEP[mθ,tmθ(P),tHt1,Xt,At=a]=WtEP[mθ,tmθ(P),tXt,At=a]. Thus we have,

=supθΘ:θθ(P)>ϵ{1Tt=1TEP[aAπt(a,Xt,Ht1)WtEP[mθ,tmθ(P),tXt,At=a]daHt1]}.

Since for all θ ∈ Θ, EP[mθ,tmθ(P),tXt,At]0 with probability 1 by Condition 7 and since 0<Wtρmax1 with probability 1 by Condition 9,

supθΘ:θθ(P)>ϵ{1Tρmaxt=1TEP[aAπt(a,Xt,Ht1)Wt2EP[mθ,tmθ(P),tXt,At=a]daHt1]}.

Since Wt2=πtsta(At,Xt)πt(At,Xt,Ht1),

=supθΘ:θθ(P)>ϵ{1Tρmaxt=1TEP[aAπtsta(a,Xt)EP[mθ,tmθ(P),tXt,At=a]daHt1]}.

By Condition 1 and since πtsta is pre-specified, we can drop the conditioning on Ht1, i.e.,

=supθΘ:θθ(P)>ϵ{1Tρmaxt=1TEP[aAπtsta(a,Xt)EP[mθ,tmθ(P),tXt,At=a]da]}.

By law of iterated expectations,

=supθΘ:θθ(P)>ϵ{1Tρmaxt=1TEP,πtsta[mθ,tmθ(P),t]}1ρmaxδ.

The last inequality above holds for some δ > 0 for all sufficiently large T by Condition 8. Thus Equation (15) holds for δ=1ρmaxδ.

B.3. Asymptotic Normality

We prove the second part of Theorem 1, i.e., that

ΣT(P)12M¨T(θ^T)T(θ^Tθ(P))DN(0,Id)uniformly overPP. (16)
B.3.1. Main Argument

The three results we show to ensure Equation (16) holds are as follows:

ΣT(P)12TM.T(θ(P))DN(0,Id)uniformly overPP. (17)

For ϵm¨>0 as defined in Condition 6,

supθΘ:θθ(P)ϵmMT(θ)1=OPP(1). (18)

For matrix H positive definite,

M¨T(θ(P))¯H+oPP(1). (19)

For a reminder on the notation of oPP(1) and OPP(1) see definitions 12 and 2. For now, we assume that Equations (17), (18), and (19) hold; we will show they hold in Sections B.3.2, B.3.3, and B.3.4 respectively. Our argument is based on Van der Vaart [2000, Theorem of 5.41].

By differentiability Condition 2, since θ^T is the maximizer of criterion MT(θ),

0=M.T(θ^T).

By differentiability Condition 2 again and Taylor’s theorem we have that for some random θ~T on the line segment between θ(P) and θ^T,

0=M.T(θ^T)=M.T(θ(P))+M¨T(θ(P))(θ^Tθ(P))+12(θ^Tθ(P))TMT(θ~T)(θ^Tθ(P)).

By rearranging terms and multiplying by T,

TM.T(θ(P))=M¨T(θ(P))T(θ^Tθ(P))+12(θ^Tθ(P))TMT(θ~T)T(θ^Tθ(P))=[M¨T(θ(P))+12(θ^Tθ(P))TMT(θ~T)]T(θ^Tθ(P)).

Note that by the above equation and Equation (17), we have that

ΣT(P)12[M¨T(θ(P))+12(θ^Tθ(P))TMT(θ~T)]T(θ^Tθ(P))DN(0,Id)uniformly overPP. (20)

By Equation (19), the probability that M¨T(θ(P)) is invertible goes to 1 uniformly over PP. Thus by Equation (20), we have that

ΣT(P)12[Id+12(θ^Tθ(P))TMT(θ~T)M¨T(θ(P))1]M¨T(θ(P))T(θ^Tθ(P))=[Id+12ΣT(P)12(θ^Tθ(P))TMT(θ~T)M¨T(θ(P))1ΣT(P)12]ΣT(P)12M¨T(θ(P))T(θ^Tθ(P))DN(0,Id)uniformly overPP. (21)

We now show that 12ΣT(P)12(θ^Tθ(P))TMT(θ~T)M¨T(θ(P))1ΣT(P)12=oPP(1). It is sufficient to show that ΣT(P)12θ^Tθ(P)MT(θ~T)1M¨T(θ(P))1ΣT(P)12=oPP(1).

  • By Condition 5, the minimum eigenvalue of ΣT(P) is bounded uniformly above some constant greater than zero, so supPPΣT(P)12=O(1).

  • By uniform consistency of θ^T,θ^Tθ(P)=oPP(1).

  • By uniform consistency of θ^T,1θ~Tθ(P)ϵm=oPP(1). Thus by Equation (18), MT(θ~T)=OPP(1).

  • By Equation (19), the minimum eigenvalue of M¨T(θ(P))1 is bounded above that of positive definite matrix H. Thus M¨T(θ(P))1=OPP(1).

  • By Condition 5, supPPΣT(P)12=O(1).

Thus, by Slutsky’s Theorem and Equation (21), we have that

ΣT(P)12M¨T(θ(P))T(θ^Tθ(P))DN(0,Id)uniformly overPP. (22)

Lastly, to show our desired result, that ΣT(P)12M¨T(θ^T)T(θ^Tθ(P))DN(0,Id) uniformly over PP, by Equation (22) and Slutsky’s Theorem it is sufficient to show that ΣT(P)12M¨T(θ^T)M¨T(θ(P))1ΣT(P)12PId uniformly over PP. Note if we can show that M¨T(θ^T)M¨T(θ(P))1PId uniformly over PP, then ΣT(P)12M¨T(θ^T)M¨T(θ(P))1ΣT(P)12=ΣT(P)12[Id+oPP(1)]ΣT(P)12=Id+ΣT(P)12oPP(1)ΣT(P)12=Id+oPP(1). The last limit holds since ΣT(P)12=OPP(1) and ΣT(P)12=OPP(1) by Condition 5 (use the same argument as that used in the bullet points below Equation (21)).

Thus it is sufficient to show that M¨T(θ^T)M¨T(θ(P))1PId uniformly over PP. By Taylor’s Theorem, for some random θ¯T on the line segment between θ^T and θ(P),

M¨T(θ^T)=M¨T(θ(P))+MT(θ¯T)(θ^Tθ(P)).

Recall that the probability the inverse of M¨T(θ(P)) exists goes to 1 by Equation (19) (use the same argument as that used in the bullet points below Equation (21)). Thus we have that M¨T(θ^T)M¨T(θ(P))1 equals the following:

[M¨T(θ(P))+MT(θ¯T)(θ^Tθ(P))]M¨T(θ(P))1=Id+MT(θ¯T)(θ^Tθ(P))M¨T(θ(P))1

Note that MT(θ¯T)(θ^Tθ(P))M¨T(θ(P))1=oPP(1) because

  • By uniform consistency of θ^T,1θ~Tθ(P)ϵm=oPP(1). Thus by Equation (18), MT(θ~T)=OPP(1).

  • By uniform consistency of θ^T,θ^Tθ(P)=oPP(1).

  • By Equation (19), M¨T(θ(P))1=OPP(1).

B.3.2. Asymptotic Normality of ΣT(P)12TM.T(θ(P))

We will show that Equation (17) holds by applying a martingale central limit theorem. For notational convenience, we let m.θ,tm.θ(Yt,Xt,At). Note that by definition ΣT(P)12TM.T(θ(P))=ΣT(P)121Tt=1TWtm.θ(P),t. We first show that {ΣT(P)121TWtm.θ(P),t}t=1T is a martingale difference sequence with respect to {Ht}t=0T. For any t ∈ [1: T],

EP,π[1TΣT(P)12WtcTm.θ(P),tHt1][=(a)1TEP,π[EP[ΣT(P)12WtcTm.θ(P),tHt1,Xt,At]Ht1][=(b)1TΣT(P)12EP,π[WtcTEP[m.θ(P),tHt1,Xt,At]Ht1]=(c)0
  • Above, (a) holds by law of iterated expectations.

  • (b) holds since Wtσ(Ht1,Xt,At) and since ΣT(P) are a function of stabilizing policies {πtsta}t1, which are pre-specified.

  • By Condition 1, EP[m.θ(P),tHt1,Xt,At]=EP[m.θ(P),tXt,At]. Equality (c) holds because EP[m.θ(P),tXt,At]=0 with probability 1 by Condition 7; note that θ(P) is a critical point of EP[mθ,tXt,At].

By Cramer-Wold device, to show that Equation (17) holds, it is sufficient to show that for any fixed cRd with ∥c2 = 1, that cTΣT(P)121Tt=1TWtm.θ(P),tDN(0,cTIdc) uniformly over PP. We now apply Theorem 2, a uniform version of the martingale central limit theorem of Dvoretzky [1972]; while the original theorem holds for any fixed P, we can show uniform convergence in distribution by ensuring that the conditions of the theorem hold uniformly over PP (see Definition 3). By Theorem 2, it is sufficient to show that the following two conditions hold:

1. Conditional Variance: 1T[t=1TEP,π[{cTΣT(P)12Wtm.θ(P),t}2Ht1]Pσ2 uniformly over PP.

2. Conditional Lindeberg: For any δ > 0, 1T[t=1TEP,π[{cTΣT(P)12Wtm.θ(P),t}21cTΣT(P)12Wtm.θ(P),t>δTHt1]P0 uniformly over PP.

1. Conditional Variance
1Tt=1TEP,π[(cTWtΣT(P)12m.θ(P),t)2Ht1]=1Tt=1TEP,π[Wt2cTΣT(P)12m.θ(P),t2ΣT(P)12cHt1]=(a)cTΣT(P)12{[1Tt=1TEP,π[Wt2m.θ(P),t2Ht1]}ΣT(P)12c=(b)cTΣT(P)12{[1Tt=1TEP[aAπt(a,Xt,Ht1)EP[Wt2m.θ(P),t2Ht1,Xt,At=a]daHt1]}ΣT(P)12c=(c)cTΣT(P)12{[1Tt=1TEP[aAπtsta(a,Xt)EP[m.θ(P),t2Ht1,Xt,At=a]daHt1]}ΣT(P)12c=(d)cTΣT(P)12{[1Tt=1TEP[EP,πtsta[m.θ(P),t2Xt]Ht1]}ΣT(P)12c=(e)cTΣT(P)12{1Tt=1TEP,πtsta[m.θ(P),t2]}ΣT(P)12c=(f)cTΣT(P)12ΣT(P)ΣT(P)12c=cTIdc
  • Above, (a) holds since ΣT(P) are a function of stabilizing policies {πtsta}t1, which are pre-specified.

  • Equality (b) holds by law of iterated expectations.

  • Equality (c) holds since Wt=πtsta(At,Xt)πt(At,Xt,Ht1)σ(Ht1,Xt,At).

  • Equality (d) holds because by Condition 1, EP[m.θ(P),t2Ht1,Xt,At=a]=EP[m.θ(P),t2Xt,At=a] and by law of iterated expectations.

  • Equality (e) holds because by Condition 1, the distribution of Xt does not depend on Ht1, so EP[EP,πtsta[m.θ(P),t2Xt]Ht1]=EP[EP,πtsta[m.θ(P),t2Xt]]=EP,πtsta[m.θ(P),t2]; the last equality holds by law of iterated expectations.

  • Equality (f) holds by definition.

2. Conditional Lindeberg
1Tt=1TEP,π[[(cTWtΣT(P)12m.θ(P),t)21cTWtΣT(P)12m.θ(P),t>δTHt1]=1Tt=1TEP,π[[Wt2cTΣT(P)12m.θ(P),t2ΣT(P)12c1cTWtΣT(P)12m.θ(P),t>δTHt1](a)1T2δ2t=1TEP,π[[Wt4(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Ht1](b)ρmaxT2δ2t=1TEP,π[[Wt4(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Ht1](c)ρmaxT2δ2t=1TEP[[aAπi(a,Xt,Ht1)EP[Wt2(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Ht1,Xt,At=a]daHt1](d)ρmaxT2δ2t=1TEP[[aAπtsta(a,Xt)EP[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Ht1,Xt,At=a]daHt1](e)ρmaxT2δ2t=1TEP[[EP[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Xt]Ht1](f)ρmaxT2δ2t=1TEP,πtsta[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2](g)0
  • Above, inequality (a) holds because 1WtcTΣT(P)12m.θ(P),t>Tδ=1 if and only if Wt21Tδ2cTΣT(P)12m.θ(P),t2ΣT(P)12c>1.

  • Inequality (b) holds because by Condition 9, Wt2ρmax with probability 1.

  • Equality (c) holds by the law of iterated expectations.

  • Equality (d) holds since Wt=πtsta(At,Xt)πt(At,Xt,Ht1)σ(Ht1,Xt,At).

  • Equality (e) holds because by Condition 1, EP[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Ht1,Xt,At=a]=EP[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2Xt] and by law of iterated expectations.

  • Equality (f) holds since the distribution of Xt does not depend on Ht1 by Condition 1 and by law of iterated expectations.

  • Regarding limit (g), it is sufficient to show that 1Tt=1TEP,πtsta[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2] is uniformly bounded over PP for all sufficiently large T. By Condition 5, the minimum eigenvalue of ΣT(P) is bounded above zero uniformly over PP for all sufficiently large T; this bounds the maximum eigenvalue of ΣT(P)−1. Also by Condition 5 the fourth moment of m.θ(P),t with respect to P and policy πtsta is uniformly bounded over PP and t ≥ 1. With these two properties we have that 1Tt=1TEP,πtsta[(cTΣT(P)12m.θ(P),t2ΣT(P)12c)2] is uniformly bounded over PP for all sufficiently large T.

B.3.3. Showing that supθΘ:θθ(P)ϵmMT(θ)1 is bounded in probability

Recall that for any BRd×d×d, we denote B1=i=1dj=1dk=1dBi,j,k. We abbreviate mθ(Yt,Xt,At) with mθ,t.

By triangle inequality, MT(θ)1=1Tt=1TWtmθ,t11Tt=1TWtmθ,t1. Thus we have that

supθΘ:θθ(P)ϵmMT(θ)1supθΘ:θθ(P)ϵm1Tt=1TWtmθ,t1.

By Condition 6 (ii), there exists a function m (note it is not indexed by θ) such that for all PP, we have that supθΘ:θθ(P)ϵmmθ,t1m(Yt,Xt,At)1.

1Tt=1TWtm(Yt,Xt,At)1.

Adding and subtracting 1Tt=1TEP,π[Wtm(Yt,Xt,At)1Ht1],

1Tt=1TEP,π[Wtm(Yt,Xt,At)1Ht1],=1Tt=1TWtm(Yt,Xt,At)1EP,π[Wtm(Yt,Xt,At)1Ht1]+EP,π[Wtm(Yt,Xt,At)1Ht1].

By second moment bounds on m(Yt,Xt,At)1 from Condition 6 (i), by Lemma 1, we have that 1Tt=1TWtm(Yt,Xt,At)1EP,π[Wtm(Yt,Xt,At)1Ht1]=oPP(1).

=oPP(1)+1Tt=1TEP,π[Wtm(Yt,Xt,At)1Ht1]

Since by Condition 9, Wtρmin1 with probability 1,

oPP(1)+1Tρmint=1TEP,π[Wt2m(Yt,Xt,At)1Ht1]

Since Wt2=πtsta(At,Xt)πt(At,Xt,Ht1) and by Condition 1,

=oPP(1)+1Tρmint=1TEP,πtsta[m(Yt,Xt,At)1]=OPP(1).

Note that by Jensen’s inequality, EP,πtsta[m(Yt,Xt,At)1]EP,πtsta[m(Yt,Xt,At)12]. By Condition 6 (i), supPP,t1EP,πtsta[m(Yt,Xt,At)12] is bounded, which implies the final limit above.

B.3.4. Lower bounding – M¨T(θ(P))

We now show that M¨T(θ(P))¯H+oPP(1), for positive definite matrix H introduced in Condition 7 (ii).

By Condition 5 and Lemma 1, 1Tt=1TWtm¨θ(P),tEP,π[Wtm¨θ(P),tHt1]=oPP(1), so

M¨T(θ(P))=1Tt=1TWtm¨θ(P),t=oPP(1)1Tt=1TEP,π[Wtm¨θ(P),tHt1]

By law of iterated expectations,

=oPP(1)1Tt=1TEP,π[WtEP[m¨θ(P),tHt1,Xt,At]Ht1]

By Condition 1,

=oPP(1)1Tt=1TEP,π[WtEP[m¨θ(P),tXt,At]Ht1]

By Condition 7, we have that EP[m¨θ(P),tXt,At]¯0; recall that θ(P) is a maximizing value of EP,π[mθ,tXt,At]. Also since Wtρmax1 with probability 1 by Condition 9,

¯oPP(1)1Tρmaxt=1TEP,π[Wt2EP,π[m¨θ(P),tXt,At]Ht1]

Since Wt2=πtsta(At,Xt)πt(At,Xt,Ht1),

=oPP(1)1Tρmaxt=1TEP,πtsta[m¨θ(P),tHt1]

Note that for any t ≥ 1, EP,πtsta[m¨θ(P),tHt1]=EP,πtsta[m¨θ(P),t] because {πtsta}t1 are pre-specified. Recall that by Condition 7 for all sufficiently large T, 1Tt=1TEP,πtsta[m¨θ(P),t]¯H for all PP. Thus our final result is that

M¨T(θ(P))¯H+oPP(1). (23)

B.4. Lemmas and Other Helpful Results

Theorem 2 (Uniform Martingale Central Limit Theorem). Let {ZT(P)}T1 be a sequence of random variables whose distributions are defined by some PP and some nuisance component η. Moreover, let {ZT(P)}T1 be a martingale difference sequence with respect to Ft, meaning EP,η[Zt(P)Ft1]=0 for all t ≥ 1 and PP.

  1. 1Tt=1TEP,η[Zt(P)2Ft1]Pσ2 uniformly over PP, where σ2 is a constant 0 < σ2 < ∞.

  2. For any ϵ > 0, 1Tt=1TEP,η[Zt(P)21Zt(P)>ϵFt1]P0 uniformly over PP.

Under the above conditions,

1Tt=1TZt(P)DN(0,σ2)uniformly overPP.

Proof: By by Kasy [2019, Lemma 1], it is sufficient to show that for any sequence {PT}T=1 with PTP for all T ≥ 1, 1Tt=1TZt(PT)DN(0,σ2). In this setting, since PT depends on T, we consider triangular array asymptotics and additionally index by T, e.g., FT,t.

Note that 1Tt=1TEPT,η[Zt(PT)2FT,t1]Pσ2, by Kasy [2019, Lemma 1] and condition (a) above.

Also, for any ϵ > 0, 1Tt=1TEPT,η[Zt(PT)21Zt(PT)>ϵFT,t1]P0, by Kasy [2019, Lemma 1] and condition (b) above.

Thus by the martingale central limit theorem of Dvoretzky [1972], we have that for the sequence {PT}T=1,

1Tt=1TZt(PT)DN(0,1).

Since the sequence {PT}T=1 were chosen arbitrarily from P, the desired result is implied again by Kasy [2019, Lemma 1].

Lemma 1. Let f(Yt,Xt,At)Rdf be a function such that supPP,t1EP,πtsta[f(Yt,Xt,At)2]<m for some m < ∞. Under Conditions 1 and 9,

1Tt=1T{Wtf(Yt,Xt,At)EP,π[Wtf(Yt,Xt,At)Ht1]}=OPP(1). (24)

Note that the above equation implies that

1Tt=1T{Wtf(Yt,Xt,At)EP,π[Wtf(Yt,Xt,At)Ht1]}=oPP(1).

Lemma 1 is a type of martingale weak law of large number result and the proof is similar to the weak law of large numbers proofs for i.i.d. random variables.

Proof: We denote the kth ∈ [1: df] dimension of vector f(Yt, Xt, At) as fk(Yt, Xt, At). It is sufficient to show the result for any dimension of vector f(Yt, Xt, At). For notational convenience, let ftfk(Yt, Xt, At). Let ϵ > 0.

supPPPP,π(1Tt=1T{WtftEP,π[WtftHt1]}>ϵ)(a)1Tϵ2supPPEP,π[(t=1T{WtftEP,π[WtftHt1]})2]=(b)1Tϵ2supPPt=1TEP,π[{WtftEP,π[WtftHt1]}2](c)1Tϵ2supPPt=1TEP,π[Wt2ft2]=(d)1Tϵ2supPPt=1TEP[aAWt2πt(a,Xt,Ht1)EP[ft2Ht1,Xt,At=a]da]=(e)1Tϵ2supPPt=1TEP[aAπtsta(a,Xt)EP[ft2Ht1,Xt,At=a]da]=(f)1Tϵ2supPPt=1TEP,πtsta[ft2](g)4mϵ2
  • Above (a) holds by Chebyshev’s inequality.

  • (b) holds because the above terms form a martingale difference sequence with respect to Ht1, i.e., EP,π[WtftEP,π[WtftHt1]Ht1]=0; this implies that cross terms disappear, i.e., for t > s,
    EP,π[(WtftEP,π[WtftHt1])(WsfsEP,π[WsfsHs1])]=[EP,π[EP,π[(WtftEP,π[WtftHt1])(WsfsEP,π[WsfsHs1])Ht1]]
    Since s > t,
    =[EP,π[(WsfsEP,π[WsfsHs1])EP,π[WtftEP,π[WtftHt1]Ht1]]=0.
  • (c) holds because EP,π[{WtftEP,π[WtftHt1]}2]=EP,π[Wt2ft2]EP,π[EP,π[WtftHt1]2]EP,π[Wt2ft2].

  • (d) holds by law of iterated expectations.

  • (e) holds because Wt=πtsta(At,Xt)πt(At,Xt,Ht1).

  • (f) holds since by Condition 1, EP[ft2Ht1,Xt,At]=EP[ft2Xt,At] and by law of iterated expectations EP,πtsta[ft2]=EP[aAπtsta(a,Xt)EP[ft2Xt,At=a]da].

  • (g) holds since supPP,t1EP,πtsta[ft2]<m<.

Lemma 2. Let mθ,tmθ(Yt, Xt, At). Under Conditions 1, 3, 4, 5, 7, and 9,

supθΘ{1Tt=1TWtmθ,tEP,π[Wtmθ,tHt1]}=OPP(1). (25)

Lemma 1 is a type of martingale functionally uniform law of large number result and the proof is similar to the functionally uniform law of large numbers proofs for i.i.d. random variables Van Der Vaart and Wellner [1996, Theorem 2.4.1].

Proof:

Finite Bracketing Number: Let δ > 0. We construct a set Bδ which is made up of pairs of functions (l, u). We show that we can find Bδ that satisfies the following:

  1. For any θ ∈ Θ, we can find (l, u) ∈ Bδ such that
    1. l(y, x, a) ≤ mθ(y, x, a) ≤ u(y, x, a) for all (x, y) in the joint support of {PP} and all aA.
    2. supPP,t1EP,πtsta[u(Yt,Xt,At)l(Yt,Xt,At)]δ.
  2. The number of pairs in this set is finite, i.e., ∣Bδ∣ < ∞.

  3. For any (l, u) ∈ Bδ, for some m < ∞ which does no depend on δ, supPP,t1EP,πtsta[u(Yt,Xt,At)2]m and supPP,t1EP,πtsta[l(Yt,Xt,At)2]m.

Showing that we can find Bδ that satisfy (a), means that ∣Bδ∣ is an upper bound on the bracketing number of {mθ : θ ∈ Θ}. For more information on bracketing functions, see Van Der Vaart and Wellner [1996] and Van der Vaart [2000].

To construct Bδ, we follow a similar argument to Example 19.7 of Van der Vaart [2000] (page 271). Make a grid over Θ with meshwidth λ/2 > 0 and let the points in this grid be the set Gλ/2 ⊆ Θ; we will specify λ later. Note that by construction, for any θ ∈ Θ we can find a θGλ/2 such that ∥θ′θ∥ ≤ λ.

By our Lipschitz Condition 4, we have that for any θ, θ′ ∈ Θ, ∣mθ(Yt, Xt, At) − mθ′(Yt, Xt, At)∣ ≤ g(Yt, Xt, At)∥θθ′∥ for function g such that for some mg < ∞,

supPP,t1EP,πtsta[g(Yt,Xt,At)2]mg. (26)

We now show that we can choose Bδ = {(mθg(Yt, Xt, At), mθ + g(Yt, Xt, At)) : θGλ/2}. Note that by compactness of Θ, Condition 3, the number of points in Gλ/2 is finite, so (b) above holds.

To show that (a) holds for our choice of Bδ, recall that for any θ ∈ Θ we can find a θ′Gλ/2 such that ∥θ′θ∥ ≤ λ. Also, by the Lipschitz Condition 4, ∣mθ(Yt, Xt, At) − mθ′(Yt, Xt, At)∣ ≤ g(Yt, Xt, At)∥θθ′∥ ≤ g(Yt, Xt, At)λ. Thus we have that

mθ(Yt,Xt,At)g(Yt,Xt,At)λmθ(Yt,Xt,At)mθ(Yt,Xt,At)+g(Yt,Xt,At)λ.

Note that

supPP,t1EP,πtsta[mθ(Yt,Xt,At)+g(Yt,Xt,At)λ{mθ(Yt,Xt,At)g(Yt,Xt,At)λ}]=2λsupPP,t1EP,πtsta[g(Yt,Xt,At)]2λmg<.

The inequalities above hold by Equation (26) and since EP,πtsta[g(Yt,Xt,At)]EP,πtsta[g(Yt,Xt,At)2] by Jensen’s inequality. (a) above holds for our choice of Bδ by letting meshwidth λ=δ(2mg).

We now show that (c) above holds. Note that

supPP,t1EP,πtsta[{mθ(Yt,Xt,At)+g(Yt,Xt,At)}2]3supPP,t1EP,πtsta[mθ(Yt,Xt,At)2]+3supPP,t1EP,πtsta[g(Yt,Xt,At)2]. (27)

Note that the above upper bound, Equation (27), also holds for supPP,t1EP,πtsta[{mθ(Yt,Xt,At)g(Yt,Xt,At)}2].

Since, mθ(Yt,Xt,At)=mθ(Yt,Xt,At)mθ(P)(Yt,Xt,At)+mθ(P)(Yt,Xt,At),

9supPP,t1EP,πtsta[{mθ(Yt,Xt,At)mθ(P)(Yt,Xt,At)}2]+9supPP,t1EP,πtsta[mθ(P)(Yt,Xt,At)2]+3supPP,t1EP,πtsta[g(Yt,Xt,At)2].

Note that supPP,t1EP,πtsta[mθ(P)(Yt,Xt,At)2] is bounded by our moment Condition 5 and that supPP,t1EP,πtsta[g(Yt,Xt,At)2] is bounded by Equation (26).

By our Lipschitz Condition 4, for any θ ∈ Θ, mθ(Yt,Xt,At)mθ(P)(Yt,Xt,At)g(Yt,Xt,At)θθ(P). Thus,

supPP,t1EP,πtsta[{mθ(Yt,Xt,At)mθ(P)(Yt,Xt,At)}2]supPP,t1EP,πtsta[g(Yt,Xt,At)2]θθ(P)2.

The above is bounded by Equation (26) and by compactness of Θ, Condition 3. Thus (c) above holds for our choice of Bδ.

Main Argument: We now show that for any ϵ > 0,

supPPPP,π(supθΘ{1Tt=1TWtmθ,tEP,π[Wtmθ,tHt1]}>ϵ)0. (28)

An analogous argument can be made to show that supPPPP,π(supθΘ{1Tt=1TWtmθ,tEP,π[Wtmθ,tHt1]}>ϵ)0.

Let δ > 0; we will choose δ later. Let Bδ be the set of pairs of functions as constructed earlier.

supθΘ{1Tt=1TWtmθ,tEP,π[Wtmθ,tHt1]}

Note that by (a), we get the following upper bound:

max(l,u)Bδ{1Tt=1TWtu(Yt,Xt,At)EP,π[Wtl(Yt,Xt,At)Ht1]}.

By adding and subtracting EP,π[Wtu(Yt,Xt,At)Ht1] and triangle inequality,

max(l,u)Bδ{1Tt=1TEP,π[Wt{u(Yt,Xt,At)l(Yt,Xt,At)}Ht1]}+max(l,u)Bδ{1Tt=1TWtu(Yt,Xt,At)EP,π[Wtu(Yt,Xt,At)Ht1]}.

Note that by Condition 9, Wt=πtsta(At,Xt)πt(At,Xt,Ht1)ρmax with probability 1, so EP,π[Wt{u(Yt,Xt,At)l(Yt,Xt,At)}Ht1]1ρmaxEP,π[Wt2{u(Yt,Xt,At)l(Yt,Xt,At)}Ht1]=1ρmaxEP,πtsta[u(Yt,Xt,At)l(Yt,Xt,At)]1ρmaxδ; the last equality holds by Condition 1 and the last inequality holds by (a). And since maxi[1:n]{ai}i=1nai,

1ρmaxδ+(l,u)Bδ1Tt=1TWtu(Yt,Xt,At)EP,π[Wtu(Yt,Xt,At)Ht1]

By Lemma 1 and (c), for any (l, u) ∈ Bδ, 1Tt=1TWtu(Yt,Xt,At)EP,π[Wtu(Yt,Xt,At)Ht1]=oPP(1). Since ∣Bδ∣ < ∞ by (b), the convergence holds for all (l, u) ∈ Bδ simultaneously, so

=1ρmaxδ+oPP(1).

Equation (28) holds by choosing δ=ρmaxϵ2.

B.5. Least-Squares Estimator

We use ϕ(Xt, At) to denote a feature vector that constructed using context Xt and action At.

Condition 10 (Linear Expected Outcome). For all PP, the following holds w.p. 1,

EP[YtXt,At]=ϕ(Xt,At)Tθ(P).

Condition 11 (Moment Conditions for Least Squares). The fourth moments of ϕ(Xt,At)(Ytϕ(Xt,At)Tθ(P)) and ϕ(Xt, At) with respect to P and policy πtsta are respectively bounded uniformly over PP and t ≥ 1.

Also the minimum eigenvalue of ΣT(P)=1Tt=1TEP,πtsta[ϕ(Yt,Xt,At)2(Ytϕ(Yt,Xt,At)Tθ(P))2] and 1Tt=1TEP,πtsta[ϕ(Xt,At)2] respectively are both bounded above constant some constant greater than zero for all PP.

Condition 12 (Importance Ratios for Least Squares). Let ρmin > 0 and ρmax,T > 0 be a nonrandom sequence such that ρmax,TT0. {πtsta}t=1T are pre-specified and do not depend on data {Yt,Xt,At}t=1T. For all PP, the following holds w.p. 1,

ρminπtsta(At,Xt)πt(At,Xt,Ht1)ρmax,T.

Note that Condition 12 allows πt(At,Xt,Ht1) to go to zero at some rate for stabilizing policies {πtsta}t1 that are strictly bounded away from 0 and 1.

We now define the AW-LS estimator for θ(P)Rd:

θ^TAW-LSargmaxθRd{t=1TWt(Ytϕ(Xt,At)Tθ)2}. (29)

Theorem 3 (Consistency and Asymptotic Normality of Adaptively-Weighted Least Squares Estimator). Under Conditions 1, 10, 11, and 12,

ΣT(P)12(1Tt=1TWtϕ(Xt,At)2)(θ^TAW-LSθ(P))DN(0,Id)uniformly overPP,

where ΣT(P)1Tt=1Tϕ(Xt,At)2(Ytϕ(Xt,At)Tθ(P))2.

Proof: By taking the derivative of Equation (29) with respect to the parameters, we have that

0=t=1TWtϕ(Xt,At)(Ytϕ(Xt,At)Tθ^TAW-LS).

By rearranging terms, we have that

1Tt=1TWtϕ(Xt,At)(Ytϕ(Xt,At)Tθ(P))=1Tt=1TWtϕ(Xt,At)2(θ^TAW-LSθ(P)). (30)

We first show that the following holds:

ΣT(P)121Tt=1TWtϕ(Xt,At)(Ytϕ(Xt,At)Tθ(P))DN(0,Id)uniformly overPP. (31)

Equation (31) holds by a similar argument as that used in Section B.3.2, for m.θ(Yt,Xt,At)=ϕ(Xt,At)(Ytϕ(Xt,At)Tθ(P)) by showing that the conditions of Theorem 2 hold. It can be checked that all the arguments hold even when we allow ρmax,T to grow at a rate such that ρmax,TT0.

By Equations (30) and (31),

ΣT(P)121Tt=1TWtϕ(Xt,At)2(θ^TAW-LSθ(P))DN(0,Id)uniformly overPP. (32)

By Equation (32), to ensure that θ^TAW-LSPθ(P) uniformly over PP, it is sufficient to show that the minimum eigenvalue of ΣT(P)121Tt=1TWtϕ(Xt,At)2 goes to infinity uniformly over PP as T → ∞.

By Condition 11, the maximum eigenvalue of ΣT(P) is bounded uniformly over PP, so the minimum eigenvalue of ΣT(P)12 is bounded uniformly above 0. Thus it is sufficient to show that the minimum eigenvalue of 1Tt=1TWtϕ(Xt,At)2 goes to infinity uniformly over PP as T → ∞.

Note that by Lemma 1 and Condition 11,

1Tt=1TWtϕ(Xt,At)2EP,π[Wtϕ(Xt,At)2Ht1]=OPP(1). (33)

Note that by law of iterated expectations,

EP,π[Wtϕ(Xt,At)2Ht1]=[EP[aAπt(a,Xt,Ht1)EP[Wtϕ(Xt,At)2Ht,1,Xt,a]daHt1].

By Condition 1 and since Wt=πtsta(At,Xt)πt(At,Xt,Ht1),

=EP[aAπt(a,Xt,Ht1)πtsta(a,Xt)πtsta(a,Xt)EP[ϕ(Xt,At)2Xt,a]daHt1]

Since by Condition 12, πt(a,Xt,Ht1)πtsta(a,Xt)1ρmax,Tandϕ(Xt,At)20,

¯1ρmax,TEP[aAπtsta(a,Xt)EP[ϕ(Xt,At)2Xt,a]daHt1].

Since πtsta are pre-specified and since by our i.i.d. potential outcomes assumption (Condition 1) Xt do not depend on Ht1,

=1ρmax,TEP[aAπtsta(a,Xt)EP[ϕ(Xt,At)2Xt,a]da].

By law of iterated expectations,

=1ρmax,TEP,πtsta[ϕ(Xt,At)2].

The above result and Equation (33) implies that

1Tt=1TWtϕ(Xt,At)2¯OPP(1)+Tρmax,T1Tt=1TEP,πtsta[ϕ(Xt,At)2]. (34)

By Condition 11, the minimum eigenvalue of 1Tt=1TEP,πtsta[ϕ(Xt,At)2] is bounded above some constant greater than zero for all PP. By Condition 12, Tρmax,T. Thus by Equation (32) and Equation (34), we have that θ^TAW-LSPθ(P) uniformly over PP.

C. Choice of Stabilizing Policy

C.1. Optimal Stabilizing Policy in Multi-Arm Bandit Setting

Here we consider the multi-armed bandit setting where EP[Yt(a)]=θa(P) and VarP(Yt(a))=σ2. We consider the adaptively-weighted least-squares estimator where mθ(Yt,At)=1At=a(Ytθa(P))2. By Theorem 1, we have that

(1Tt=1TEP,πtsta[1At=a(Ytθa(P))2])12(1Tt=1TWt1At=a)T(θ^T,aAW-LSθa(P))DN(0,1).

While the asymptotic variance of T(θ^T,aAW-LSθa(P)) does not necessarily concentrate we can examine the following:

(1Tt=1TWt1At=a)1(1Tt=1TEP,πtsta[1At=a(Ytθa(P))2])(1Tt=1TWt1At=a)1

By Lemma 1, we have that 1Tt=1TWt1At=aπtsta(a)πt(At,Ht1)P0. Thus we have

=(1Tt=1Tπtsta(a)σ2)(op(1)+1Tt=1Tπtsta(a)πt(At,Ht1))2.

As long as πtsta(a), πt(At,Ht1) are bounded away from zero w.p. 1, the op(1) term is asymptotically negligible and we can just consider (1Tt=1Tπtsta(a)σ2)(1Tt=1Tπtsta(a)πt(At,Ht1))2.

By Cauchy-Schwartz inequality,

(1Tt=1Tπtsta(a)πt(a,Ht1))2(1Tt=1Tπtsta(a))(1Tt=1Tπt(a,Ht1)).

Thus, 11Tt=1Tπt(a,Ht1)1Tt=1Tπtsta(a)(1Tt=1Tπtsta(a)πt(a,Ht1))2, so

1Tt=1Tπtsta(a)(1Tt=1Tπt(a,Ht1)πtsta(a))211Tt=1Tπt(a,Ht1).

Note that this lower bound is achieved when πtsta(a)=πt(a). However, since πt is a function of Ht1 and stabilizing policies {πtsta}t=1T are pre-specified, setting πtsta(At)=πt,a is generally an unfeasible choice. Thus we want to choose πtsta to be as close to πt as possible, subject to the constraint that the stabilizing policies are pre-specified, i.e., not a function of the data {Yt, Xt, At}t≥1.

C.2. Approximating the Optimal Stabilizing Policy

One way to approximately choose the optimal evaluation policy is to select πtsta(a,x)=EP,π[πt(a,x,Ht1)]. Note that EP,π[πt(a,x,Ht1)] depends on the P, which is unknown. Thus it is natural to choose πtsta(a,x) to be EP,π[πt(a,x,Ht1)] weighted by a prior on P. Note that as long as the evaluation policy ensures that weights Wt are bounded, the choice of evaluation policy does not affect the asymptotic validity of the estimator.

In Figure 6, we display the difference in mean squared error for the AW-LS estimator in a two-armed bandit setting for two different choices of evaluation policy: (1) the uniform evaluation policy which selects actions uniformly from A and (2) the expected πt(a,Ht1) evaluation policy for which πtsta(a)=EP,π[πt(a,Ht1)]. We can see in this setting that by setting πtsta(a)=EP,π[πt(a,Ht1)] we are able to decrease the mean squared error of the AW-LS estimator compared AW-LS with the uniform evaluation policy. Note though that in some cases setting πtsta(a)=EP,π[πt(a,Ht1)] is equivalent to choosing the uniform evaluation policy. For example, a two-armed bandit with identical arms so under common bandit algorithms EP,π[πt(a,Ht1)]=0.5 for all t ∈ [1: T], which will make the evaluation policy πtsta(a)=EP,π[πt(a,Ht1)] equivalent to the uniform policy.

Figure 6:

Figure 6:

Above we plot the mean squared errors for the adaptively-weighted least squares estimator with evaluation policies: (1) uniform evaluation policy which selects actions uniformly from A and (2) expected πt(a,Ht1) evaluation policy for which πtsta(a)=EP,π[πt(a)] (oracle quantity). In a two arm bandit setting we perform Thompson Sampling with standard normal priors, 0.01 clipping, θ(P)=[θ0(P),θ1(P)]=[0,1], standard normal errors, and T = 1000. Error bars denote standard errors computed over 5,000 Monte Carlo simulations.

D. Need for Uniformly Valid Inference on Data Collected with Bandit Algorithms

Here we consider the two-armed bandit setting where EP[Rt(a)]=θ0,a(P), VarP(Rt(a))=σ2, and EP[Rt(a)4]<c< for a ∈ {0, 1}. The unweighted least squares estimator is asymptotically normal on adaptively collected data under the following condition of Lai and Wei [1982], there exists a non-random sequence {bt}t≥1 such that

bTt=1TAtP1. (35)

Specifically, by Theorem 3 of Lai and Wei [1982], under (35),

t=1TAt(θ^T,1OLSθ1(P))=t=1TAt(Rtθ1(P))t=1TAtDN(0,σ2).

However, as discussed in Deshpande et al. [2018] and Zhang et al. [2020], (35) can fail to to hold for common bandit algorithms when there is no unique optimal policy, i.e., when θ0(P)θ1(P)=0. For example, in Figure 7 we plot 1Tt=1TAt for Thompson Sampling and ϵ-greedy for a bandit with two identical arms.

Figure 7:

Figure 7:

Above we plot empirical allocations, 1Tt=1TAt, under both Thompson Sampling (standard normal priors, 0.01 clipping) and ϵ-greedy (ϵ = 0.1) under zero margin θ0(P)=θ1(P)=0. For our simulations T = 100, errors are standard normal, and we use 50k Monte Carlo repetitions.

In order to construct reliable confidence intervals using asymptotic approximations, it is crucial that that estimators converge uniformly in distribution. To illustrate the importance of uniformity, consider the following example. We can modify Thompson Sampling to ensure that 1Tt=1TAtP0.5 when θ1(P)θ0(P)=0. For example, we could do this by using an algorithm we call Thompson Sampling Hodges (inspired by the Hodges estimator; see Van der Vaart [2000, Page 109]), defined below:

πt(1,Ht1)=P(θ~1>θ~0Ht1)1μ1,tμ0,t>t4+0.51μ1,tμ0,tt4

Under standard Thompson Sampling arm one is chosen according to the posterior probability that is optimal, so πt(1,Ht1)=P(θ~1>θ~0Ht1). Above, μa,t denotes the posterior mean for the mean reward for arm a at time t. Under TS-Hodges, if difference between the posterior means, ∣μ1,tμ0,t∣, is less than t−4, πt is set to 0.5. Additionally, we clip the action selection probabilities to bound them strictly away from 0 and 1 for some constant πmin in the following sense clip(πt) = (1 − πmin) ∧ (πtπmin). Under TS-Hodges with clipping, we can show that

1Tt=1TAtP{1πminifθ1(P)θ0(P)>0πminifθ1(P)θ0(P)<00.5ifθ1(P)θ0(P)=0} (36)

By equation (36), we satisfy (35) pointwise for every fixed P and we have that the OLS estimator is asymptotically normal pointwise [Lai and Wei, 1982]. However, equation (36) fails to hold uniformly over PP. Specifically, it fails to hold for any sequence of {Pt}t=1 such that θ1(Pt)θ0(Pt)=t4. In Figure 8, we show that confidence intervals constructed using normal approximations fail to provide reliable confidence intervals, even for very large sample sizes for the worst case values of θ1(P)θ0(P).

Figure 8:

Figure 8:

Above we construct confidence intervals for θ1(P)θ0(P) using a normal approximation for the OLS estimator. We compare independent sampling (πt = 0.5) and TS Hodges, both with standard normal priors, 0.01 clipping, standard normal errors, and T = 10, 000. We vary the value of θ1(P)θ0(P) in the simulations to demonstrate the non-uniformity of the confidence intervals.

E. Discussion of Chen et al. [2020]

Here we show formally that Theorem 3.1 in Chen et al. [2020], which proves that the OLS estimator is asymptotically normal on data collected with an ϵ-greedy algorithm, does not cover the case in which there is no unique optimal policy.

They assume that for rewards Rt, context vectors Xt, and binary actions At ∈ {0, 1},

E[RtXt,At]=AtXtTβ1+(1At)XtTβ0.

They define ββ1β0.

Specifically at part 1(b) of their proof on page 4 of the supplementary material, they claim that g(β^t,ϵ)Pg(β,ϵ), where β^t is the OLS estimator for ββ1β0 and g is defined as follows:

g(β0,β1,ϵ)=ϵ2vTxxTvdPx+(1ϵ)1βTx0vTxxTvdPx

Above vRd is arbitrary fixed vector and xRd are the context vectors. Px is the distribution of the context vectors Xt.

Specifically, they claim that g(β^t,ϵ)Pg(β,ϵ) because β^tPβ (Corollary 3.1) and by continuous mapping theorem.

Recall the continuous mapping theorem for convergence in probability [Van der Vaart, 2000, Theorem 2.3]:

Theorem 4 (Continuous Mapping Theorem). Let g:RkRm be continuous at every point of a set C such that P(XC)=1. If XnPX, then g(Xn)Pg(X).

Note that g is not continuous in β at the value β=0Rd; this is due to the indicator term 1βTx0. Thus, the standard continuous mapping theorem can not be applied in this setting. Note that the case that 0 = β = β1β0, is exactly when there is no unique optimal policy. This means that Theorem 3.1 in Chen et al. [2020] does not cover the setting in which there is no unique optimal policy.

Contributor Information

Kelly W. Zhang, Department of Computer Science, Harvard University

Lucas Janson, Departments of Statistics, Harvard University.

Susan A. Murphy, Departments of Statistics and Computer Science, Harvard University

References

  1. Abbasi-Yadkori Yasin, Pál Dávid, and Szepesvári Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [Google Scholar]
  2. Agrawal Shipra and Goyal Navin. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013. [Google Scholar]
  3. Agresti Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015. [Google Scholar]
  4. Bibaut Aurélien, Chambaz Antoine, Dimakopoulou Maria, Kallus Nathan, and Laan Mark van der. Post-contextual-bandit inference. NeurIPS 2021, 2021. [PMC free article] [PubMed] [Google Scholar]
  5. Bubeck Sébastien, Munos Rémi, and Stoltz Gilles. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009. [Google Scholar]
  6. Bura Efstathia, Duarte Sabrina, Forzani Liliana, Smucler Ezequiel, and Sued Mariela. Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models. Statistics, 52(5):1005–1024, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Caria Stefano, Kasy Maximilian, Quinn Simon, Shami Soha, Teytelboym Alex, et al. An adaptive targeted field experiment: Job search assistance for refugees in jordan. 2020. [Google Scholar]
  8. Chen Haoyu, Lu Wenbin, and Song Rui. Statistical inference for online decision making: In a contextual bandit setting. Journal of the American Statistical Association, pages 1–16, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dean Sarah, Mania Horia, Matni Nikolai, Recht Benjamin, and Tu Stephen. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, 2018. [Google Scholar]
  10. Deshpande Yash, Mackey Lester, Syrgkanis Vasilis, and Taddy Matt. Accurate inference for adaptive linear models. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [Google Scholar]
  11. Dvoretzky Aryeh. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [Google Scholar]
  12. Engle Robert F. Handbook of econometrics: volume 4. Number 330.015195 E53 v. 4. 1994. [Google Scholar]
  13. Erraqabi Akram, Lazaric Alessandro, Valko Michal, Brunskill Emma, and Liu Yun-En. Trading off rewards and errors in multi-armed bandits. In Artificial Intelligence and Statistics, pages 709–717. PMLR, 2017. [Google Scholar]
  14. Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hammersley John. Monte carlo methods. Springer Science & Business Media, 2013. [Google Scholar]
  16. Huber Peter J. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992. [Google Scholar]
  17. Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
  18. Kallus Nathan and Uehara Masatoshi. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 21(167):1–63, 2020.34305477 [Google Scholar]
  19. Kasy Maximilian. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [Google Scholar]
  20. Kasy Maximilian and Sautmann Anja. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021. [Google Scholar]
  21. Lai Tze Leung and Wei Ching Zong. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [Google Scholar]
  22. Leeb Hannes and Pötscher Benedikt M. Model selection and inference: Facts and fiction. Econometric Theory, pages 21–59, 2005. [Google Scholar]
  23. Liao Peng, Greenewald Kristjan, Klasnja Predrag, and Murphy Susan. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Liu Yun-En, Mandel Travis, Brunskill Emma, and Popovic Zoran. Trading off scientific knowledge and user learning with multi-armed bandits. In EDM, pages 161–168, 2014. [Google Scholar]
  25. Nickerson David M. Construction of a conservative confidence region from projections of an exact confidence region in multiple linear regression. The American Statistician, 48(2):120–124, 1994. [Google Scholar]
  26. Rafferty Anna, Ying Huiji, and Williams Joseph. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM∣ Journal of Educational Data Mining, 11(1):47–79, 2019. [Google Scholar]
  27. Robins James M, Hernan Miguel Angel, and Brumback Babette. Marginal structural models and causal inference in epidemiology, 2000. [DOI] [PubMed] [Google Scholar]
  28. Romano Joseph P, Shaikh Azeem M, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [Google Scholar]
  29. Shaikh Hammad, Modiri Arghavan, Williams Joseph Jay, and Rafferty Anna N. Balancing student success and inferring personalized effects in dynamic experiments. In EDM, 2019. [Google Scholar]
  30. Thomas Philip and Brunskill Emma. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016. [Google Scholar]
  31. Van der Vaart Aad W. Asymptotic Statistics, volume 3. Cambridge University Press, 2000. [Google Scholar]
  32. Van Der Vaart Aad W and Wellner Jon A. Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer, 1996. [Google Scholar]
  33. Villar Sofía S, Bowden Jack, and Wason James. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang Yu-Xiang, Agarwal Alekh, and Dudik Miroslav. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017. [Google Scholar]
  35. Yao Jiayu, Brunskill Emma, Pan Weiwei, Murphy Susan, and Doshi-Velez Finale. Power-constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [PMC free article] [PubMed] [Google Scholar]
  36. Yom-Tov Elad, Feraru Guy, Kozdoba Mark, Mannor Shie, Tennenholtz Moshe, and Hochberg Irit. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zhan Ruohan, Hadad Vitor, Hirshberg David A, and Athey Susan. Off-policy evaluation via adaptive weighting with data from contextual bandits. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021. [Google Scholar]
  38. Zhang Kelly W, Janson Lucas, and Murphy Susan A. Inference for batched bandits. In Advances in Neural Information Processing Systems, 2020. [PMC free article] [PubMed] [Google Scholar]

RESOURCES