Abstract
Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy policy learning, but its model-agnostic guarantees can fail when we use adaptively collected data, such as the result of running a contextual bandit algorithm. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that carefully leverages the importance sampling structure to obtain rates with the good dependence on the exploration rate in the data. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory.
1. Introduction
Adaptive experiments, wherein intervention policies are continually updated as in the case of contextual bandit algorithms, offer benefits in learning efficiency and better outcomes for participants in the experiment. They also make the collected data dependent and complicate standard machine learning approaches for model-agnostic risk minimization, such as empirical risk minimization (ERM). Given a loss function and a hypothesis class, ERM seeks the hypothesis that minimizes the sample average loss. This can be used for regression, classification, and even off-policy policy optimization. An extensive literature has shown that, for independent data, ERM enjoys model-agnostic, best-in-class risk guarantees and even fast rates under certain convexity and/or margin assumptions [e.g. 5, 6, 35, among others]. However, these guarantees fail under contextual-bandit-collected data, both because of covariate shift due to using a context-dependent logging policy and because of the policy’s data-adaptive evolution as more data are collected. A straightforward and popular approach to deal with the covariate shift is importance sampling (IS) weighting, whereby we weight samples by the inverse of the policy’s probability of choosing the observed action. Unfortunately, applying standard maximal inequalities for sequentially dependent data to study guarantees of this leads to poor dependence on these weights, and therefore incorrect rates whenever exploration is decaying and the weights diverge to infinity, as happens when collecting data using a contextual bandit algorithm.
In this paper, we provide a thorough theoretical analysis of IS weighted ERM (ISWERM; pronounced “ice worm”) that yields the correct rates on the convergence of excess risk under decaying exploration. To achieve this, we present a novel localized maximal inequality for IS weighted sequential empirical processes (Section 2) that carefully leverages their IS structure to avoid a bad dependence on the size of IS weights, as compared to applying standard results to an IS weighted process (Remark 2). We then apply this result to obtain generic slow rates for ISWERM for both Donsker-like and non-Donsker-like entropy conditions, as well as fast rates when a variance bound applies (Section 3). We instantiate these results for regression (Section 4) and for policy learning (Section 5), where we can express entropy conditions in terms of the hypothesis class and obtain variance bounds from convexity and margin assumptions. In particular, our results for policy learning close an open gap between existing lower and upper bounds in the literature (Remark 3). We end with an empirical investigation of ISWERM that sheds light on our theory (Section 6).
1.1. Setting
We consider data consisting of T observations, , where each observation consists of a state, action, and outcome, . The spaces , , are general measurable spaces, each endowed with a base measure , , ; in particular, actions can be finite or continuous (e.g., can be counting or Lebesgue). We assume the data were generated sequentially in a stochastic-contextual-bandit fashion. Specifically, we assume that the distribution of has a density p(T) with respect to (wrt) , which can be decomposed as
where we write , using lower case for dummy values and upper case for random variables. We define so that gt represents the random -measurable context-dependent policy that the agent has devised at the beginning of round t, which they then proceed to employ when observing Xt. Since we run the adaptive experiment, we assume that gt is known, as it is actually computed at stage t of the experiment. We do not assume that is known.
Remark 1 (Counterfactual interpretation). We can also interpret this data collection from a counterfactual perspective. At the beginning of each round, is drawn from some stationary (i.e., time-independent) distribution P*, Xt is revealed, and after acting with a non-anticipatory action At, meaning it only depends on past data, we observe Yt = Yt(At). This corresponds to the above with pX being the marginal of Xt under P* and pY (· | x, a) the conditional distribution of Yt(a) given Xt = x.
1.2. Importance Sampling Weighted Empirical Risk Minimization
Consider a class of hypotheses , a loss function , and some fixed reference g*(a | x), any function, for example, a conditional density. As we will see in Examples 1 to 3 below we will often simply use g*(a | x) = 1. Define the population reference risk as
We are interested in finding f with low risk R*(f). We consider doing so using ISWERM, which is ERM where we weight each term by the density ratio between the reference and the policy at time t:
Example 1 (Regression). Consider , , and ℓ(f, o) = (y − f(x, a))2. Then f with small R*(f) is good at predicting outcomes from context and action. In particular, for any g*, we have that solves . And, we can write .
Consider the counterfactual interpretation in Remark 1. Then . For example, if , is the counting measure, and g*(x | a) = 1, then is the total counterfactual prediction error. Alternatively, if g*(a | x) = 1(a = a*) and given some we let , then we have , that is, the regression risk for predicting the counterfactual outcome Y(a*) from X.
Example 2 (Classification). In the same setting as Example 1, suppose . Then μ(x, a) = 2pY(1 | x, a) − 1. And, if we restrict , letting leads to R*(f) being misclassification rate, an unrestricted minimizer of which is sign(μ(x, a)). Focusing on misclassification of sign(f(x, a)) for , we can also use a classification-calibrated loss [6], such as logistic , hinge ℓ(f, x) = (1 − yf(x, a))+, etc..
Example 3 (Policy learning). Consider , , g*(a | x) = 1 and ℓ(f, o) = yf(x, a). Then is the average outcome under a policy f. If we interpret outcomes as costs (or, negative rewards), then seeking to minimize R*(f) means to seek a policy with least risk (or, highest value).
Consider in particular the counterfactual interpretation in Remark 1 with . Consider deterministic policies: given , let . Then we have , that is, the average counterfactual outcome.
1.3. Related Literature
Contextual bandits.
A rich literature studies how to design adaptive experiments to optimize regret, simple regret, or the chance of identifying best interventions [see 12, 36, and biblioraphies therein]. Such adaptive experiments can significantly improve upon randomized trials (aka A/B tests), which is why they are seeing increased use in practice in a variety of settings, from e-commerce to policymaking [3, 4, 29, 32, 37, 43, 44, 53]. However, while randomized trials produce iid data, adaptive experiments do not, complicating post-experiment analysis, which motivates our current study. Many stochastic contextual bandit algorithms (stochastic meaning the context and response models are stationary, as in our setting) need to tackle learning from adaptively collected data to fit regression estimates of mean reward functions, but for the most part this is based on models such as linear [7, 14, 23, 37] or Hölder class [26, 47, 47], rather than on doing model-agnostic risk minimization and nonparametric learning with general function classes as we do here. Foster and Rakhlin [19] use generic regression models but require online oracles with guarantees for adversarial sequences of data. Simchi-Levi and Xu [50] use offline least-squares ERM but bypass the issue of adaptivity by using epochs of geometrically growing size in each of which data are collected iid. Other stochastic contextual bandit algorithms are based on direct policy learning using ERM [1, 8, 16, 20]; by carefully designing exploration strategies, they obtain good regret rates that are even better than the minimax-optimal guarantees given only the exploration rates, as we obtain (Remark 3).
Inference with adaptive data.
A stream of recent literature tackles how to construct confidence intervals after an adaptive experiment. While standard estimators like inverse-propensity weighting (IPW) and doubly robust estimation remain unbiased under adaptive data collection, they may no longer be asymptotically normal making inference difficult. To fix this, Hadad et al. [24] use and generalize a stabilization trick originally developed by Luedtke and van der Laan [38] for a non-adaptive setting with different inferential challenges. Their stabilized estimator, however, only works for data collected by non-contextual bandits. Bibaut et al. [9] extend this to a contextual-bandit setting. Our focus is different from these: risk minimization and guarantees rather than inference.
Policy learning with adaptive data.
Zhan et al. [58] study policy learning from contextual-bandit data by optimizing a doubly robust policy value estimator stabilized by a deterministic lower bound on IS weights. They provide regret guarantees for this algorithm based on invoking the results of Rakhlin et al. [45]. However, these guarantees do not match the algorithm-agnostic lower bound they provide whenever the lower bounds on IS weights decay to zero, as they do when data are generated by a bandit algorithm. For example, for an epsilon-greedy bandit algorithm with an exploration rate of ϵt = t−β, their lower bound on expected regret is Ω(T−(1−β)/2) while their upper bound is O(T−(1/2−β)). We close this gap by providing an upper bound of O(T−(1−β)/2) for our simpler IS weighted algorithm. See Remark 3. Our results for policy learning also extend to fast rates under margin conditions, non-Donsker-like policy classes, and learning via convex surrogate losses.
IS weighted ERM.
The use of IS weighting to deal with covariate shift, including when induced by a covariate-dependent policy, is standard. For estimation of causal effects from observational data this usually takes the form of inverse propensity weighting [28]. The same is often used for ERM for regression [15, 22, 48] and for policy learning [34, 52, 59]. When regressions are plugged into causal effect estimators, weighted regression with weights that depend on IS weights minimize the resulting estimation variance over a hypothesis class [13, 18, 30, 31, 49]. All of these approaches however have been studied in the independent-data setting where historical logging policies do not depend on the same observed data available for training, guarantees under which is precisely our focus herein.
Sequential maximal inequalities.
There are essentially two strands in the literature on maximal inequalities for sequential empirical processes. One expresses bounds in terms of sequential bracketing numbers as introduced by van de Geer [55], generalizing of standard bracketing numbers. Another uses sequential covering numbers, introduced by Rakhlin et al. [46]. These are in general not comparable. Foster and Krishnamurthy [20], Zhan et al. [58] use sequential L∞ and Lp covering numbers, respectively, to obtain maximal inequalities. van de Geer [55, Chapter 8] gives guarantees for ERM over nonparametric classes of controlled sequential bracketing entropy. However, applying her generic result as-is to IS weighted processes provides bad dependence on the exploration rate in the case of larger-than-Donsker hypothesis classes (see Remark 2). We also use sequential bracketing numbers, but we develop a new maximal inequality specially for IS weighted sequential empirical processes, where we use the special structure when truncating the chaining to avoid a bad dependence on the size of the IS weights. Equipped with our new maximal inequality, we obtain first-of-their kind guarantees for ISWERM, including fast rates that have not been before derived in adaptive settings.
2. A Maximal Inequality for IS Weighted Sequential Empirical Processes
A key building block for our results is a novel maximal inequality for IS weighted sequential empirical processes. For any sequence of objects (xt)t≥1, we introduce the shorthand x1:T to denote the sequence . We say that a sequence of random variables ζ1:T is -predictable if, for every t ∈ [T] = {1, …, T}, ζt is -measurable, i.e., is some function of .
IS weighted sequential empirical processes.
Let Pg denote the distribution on with density w.r.t. λX × λA × λY given by pX × g × pY and let us use the notation . Let us also define the norm ∥h∥p, g = (Pg(hp))1/p. Consider a sequence of -indexed random processes of the form where, for every , ξ1:T(f) is an -predictable sequence of functions. The IS-weighted sequential empirical process induced by ΞT is the -indexed random process
Sequential bracketing entropy.
For any -predictable sequence sequence ζ1:T of functions , we introduce the pseudonorm .
We say that a collection of 2N-many -predictable sequences of functions is an (ϵ, ρT,g*)-sequential bracketing of ΞT, if (a) for every , there exists k ∈ [N] such that and (b) for every k ∈ [N], .We denote by the minimal cardinality of an (ϵ, ρT,g*)-sequential bracketing of ΞT.
The special case of classes of classes of deterministic functions.
Consider the special case ξt(f) ≔ ξ(f), where is a class of functions where for every , ξ(f) is a deterministic function. Observe that for a fixed function , letting ζt ≔ ζ, we have that ρT,g*(ζ1:T) = ∥ζ∥2,g*. Therefore, , the (ϵ, ρT,g*)-sequential bracketing number of ΞT, reduces to N[](ϵ, Ξ, ∥ · ∥2,g*, the usual ϵ-bracketing number Ξ in the ∥ · ∥2,g* norm.
The maximal inequality.
Our maximal inequality will crucially depend on the decay rate of the the IS weights, that is, the exploration rate of the adaptive data collection.
Assumption 1. There exists a deterministic sequence of positive numbers (γt) such that, for any t ≥ 1, ∥g*/gt∥∞ ≤ γt, almost surely. Define and .
For example, if the data were collected under an ϵt-greedy contextual bandit algorithm then we have . If we have ϵt = t−β for β ∈ (0, 1) then . Note that having γt < ∞ in Assumption 1 does restrict us to contextual bandit algorithms that are conditionally random, such as ϵ-greedy and Thompson sampling, and rules out algorithms with deterministic policies given the history, such as UCB.
Theorem 1. Consider as defined above. Suppose that Assumption 1 holds, and that there exists B > 0 such that . In the special case where ξt(f) = ξ(f), ξ1:T ∈ ΞT, are deterministic functions, we let . Otherwise, in the general case, we let . Let r > 0. Let . For any r− ∈ [0, r/2], and any x > 0, it holds with probability at least 1 − 2e−x that
Remark 2 (Leveraging IS structure). Theorem 1 is based on a finite-depth adaptive chaining device, in which we leverage the IS-weighted structure to carefully bound the size of the tip of the chains. In contrast, applying Theorem 8.13 of van de Geer [55] to the IS weighted sequential empirical process would lead to suboptimal dependence on γt. The crucial point is to work with IS-weighted chains of the form , where the uj, f are upper brackets of the unweighted class Ξ, at scales ϵ1 > … > ϵJ (we simplify here a bit the chaining decomposition for ease of presentation compared with the proof). In adaptive chaining, the tip is bounded by the L1 norm of the corresponding bracket. In our case, denoting lJ,f the lower bracket corresponding to uJ,f, the tip is bounded by , in which we integrate out the IS ratio, thereby paying no price for it. Applying directly Theorem 8.13 of van de Geer [55], we would be working with a bracketing of the IS weighted class . When working with generic L2 brackets of the weighted class, the IS-weighting structure is lost, and we cannot do better than bounding the L1 of the tip by its L2 norm, which depends on γt. Since in sequential settings, γt generally diverges to ∞, good dependence is paramount to obtaining tight, informative results. Our proof technique otherwise follows the same general outlines as those of van de Geer [55, Theorem 8.13] and van Handel [57, Theorem A.4] (or, 40, Theorem 6.8 in the iid setting). Like these, we too leverage an adaptive chaining device, as pioneered by Ossiander [41].
3. Applications to Guarantees for ISWERM
We now return to ISWERM and use Theorem 1 to obtain generic guarantees for ISWERM. We will start with so-called slow rates that give generic generalization results and then present so-called fast rates that will apply in certain settings, where a so-called variance bound is available. Let f1 be a minimizer of the population risk R* over , that is .
Assumption 2 (Entropy on ). Define . There exist an envelope function of , and p > 0 such that, for any ϵ > 0,
The case p < 2 corresponds to the Donsker case, and p ≥ 2 to the (possibly) non-Donsker case.
Assumption 3 (Diameters on ). There exist b0 > 0 and ρ0 > 0 such that
Theorem 2 (Slow Rates for ISWERM). Suppose Assumptions 1 to 3 hold. Then for any δ ∈ (0, 1/2), we have that, with probability at least 1 − δ,
For p = 2 the bound is similar to the second case but with polylog terms; for brevity we omit the p = 2 case in this paper. Theorem 2 suggests that the excess risk of ISWERM converges at the rate of . For example, if and p < 2, we obtain . For β = 0 this matches the familiar slow rate of iid settings. However, in many cases we can obtain faster rates.
Assumption 4 (Variance Bound). For some α ∈ (0, 1], we have
As we will see in Lemmas 2 and 3, we can ensure Assumption 4 holds for least-squares regression and for policy learning with a margin condition.
Assumption 5 (Convexity). is convex and ℓ(·,O) is almost surely convex.
Theorem 3 (Fast Rates for ISWERM). Suppose Assumptions 1 to 5 hold with p < 2. Then for any δ ∈ (0, 1/2), we have that, with probability at least 1 − δ,
The entropy condition.
Assumption 2 assumes an entropy bound on the loss class . For many loss functions, we can easily satisfy this condition by assuming an entropy condition on itself.
Assumption 6 (Entropy on ). There exists p > 0 and an envelope function F of such that
Lemma 1 (Lemma 4 in Bibaut and van der Laan [10]). Suppose that is a set of unimodal functions that are equi-Lipschitz. Then Assumption 6 implies Assumption 2.
There are many examples of for which bracketing entropy conditions are known. The class of β-Hölder smooth functions (meaning having derivatives of orders up to and the -order derivatives are -Hölder continuous) on a compact domain in has p = d/β [56, Corollary 2.7.2]. The class of convex Lipschitz functions on a compact domain in has p = d/2 [56, Corollary 2.7.10]. The class of monotone functions on has p = 1 [56, Theorem 2.7.5]. If , f(o; θ) is Lipschitz in θ, and is compact, then any p > 0 holds [56, Theorem 2.7.11]. The class of càdlàg functions with sectional variation norm (aka Hardy-Krause variation) no larger than M > 0 has envelope-scaled bracketing entropy O(ϵ−1|log(1/ϵ)|2(d−1)) [10], so Assumption 6 holds with any p > 1 (or, we can track the log terms). Since trees with bounded output range and finite depth fall in the class of càdlàg functions with bounded sectional variation norm, decision tree classes also satisfy Assumption 6 with any p > 1.
4. Least squares regression using ISWERM
We now instantiate ISWERM for least squares regression. Consider , for some M > 0, , and ℓ(f, o) = (y − f(x, a))2. If is convex, strongly convex losses such as ℓ always yield a variance bound with respect to any population risk minimizer over (see e.g. lemma 15 in [6]). Let be such a population risk minimizer. We present in the lemma below properties relevant for application of theorems 2 and 3
Lemma 2 (Properties of the square loss.). Consider the setting of the current section. The square loss ℓ over satisfies the following variance bound:
and the following Lispchitz property:
Theorem 4 (Least squares regression). Suppose Assumption 1 holds. Suppose Assumption 6 holds for the envelope taken to be constant equal to , the range of the regression functions. Then for any δ ∈ (0, 1/2), we have that, with probability at least 1 − δ,
5. Policy Learning using ISWERM
We next instantiate ISWERM for policy learning. Consider , as in Example 3. Let ℓ(f, o) = yf(x, a) and g*(a | x) = 1 so that is exactly the average outcome under a policy f (or, its negative value), where we define .
We first give specification-agnostic slow rates, which also close an open gap in the literature.
Theorem 5 (ISWERM Policy Learning: slow rates). Suppose Assumption 1 holds and suppose that Assumption 6 holds withe envelope constant equal to 1 (which is the maximal range of policies). Then for any δ ∈ (0, 1/2), we have that, with probability at least 1 − δ,
Remark 3 (Comparison to Zhan et al. [58]). Given a deterministic policy class (fh(x, a) = 1(h(x) = a)), )with a Natarajan dimension, Zhan et al. [58] show a lower bound of on the expected regret of any policy-learning algorithms for some logging policy satisfying Assumption 1 (see their Theorem 1), that is, Ω(T−(1−β)/2) when . Zhan et al. [58] also provide an upper bound of for their particular algorithm (see their Corollary 2.1), that is, O(T−1/2+β) when , assuming that for p < 1, where NH is the Hamming covering number, which would be implied by having a Natarajan dimension. This is a potentially significant gap in the regret rate in T when exploration is diminishing, β > 0, as is often the case with bandit-collected data. For example, for β = 1/2, this yields a vanishing lower bound and a non-vanishing upper bound.
In comparison, our Theorem 5 gives the rate on the expected regret of policy learning with ISWERM in the case of p < 2 (given by integrating the tail inequality in Theorem 5), that is, O(T−(1−β)/2) when . This matches the rate of the lower bound of Zhan et al. [58], seemingly closing the gap. While Natarajan dimension, Hamming covering entropy with p < 1, and bracketing entropy with p < 2 are generally incomparable conditions (aside from the first implying the second), they all generally hold for policy classes parametrized by finite-dimensional parameters and tree policy classes, for which we definitely close gap. It remains an open question how to close the gap for general policy classes that only satisfy a Hamming covering entropy condition.
The gap arose from the specific technical route Zhan et al. [58] followed (not their algorithm). For the sake of exposition, we give an explanation of the phenomenon in a non-sequential i.i.d. setting, under stationary logging policy g1, and under our own notation. The same phenomenon translates to the sequential setting. Since they use a symmetrization and covering-based approach, they need to work with uniform covering-type entropies1 of the form for a certain class , where the supremum is over all finitely supported distributions. Their approach amounts to taking to be the weighted loss class {(g*/gt)ℓ(π)}. While for , it holds that , for a general Q, the best bound is , where dH is the Hamming distance. By contrast, when working with bracketing entropy, one only needs to control the size of brackets in terms of , that is the L2-norm under the distribution of the data . This allows to save a factor. Our results also show that a simple IS weighting algorithm suffices to obtain optimal rates, and the stabilization by γt employed by Zhan et al. [58], which is inspired by the stabilization employed by Hadad et al. [24], Luedtke and van der Laan [38] for inference purposes, may not be necessary for policy learning purposes. The doubly-robust-style centering may still be beneficial in practice for reducing variance but it does not affect the rate.
Remark 4 (Comparison to [20]). Foster and Krishnamurthy [20], albeit in a slightly different setting, derive a maximal inequality under sequential covering entropy that also exhibits the correct dependence on the exploration rate as ours. This shows in particular that the suboptimal dependence on the exploration rate of Zhan et al. [58] is not a necessary consequence of using sequential covering entropy. Analogously to us, Foster and Krishnamurthy [20] exploit the specific IS-weigthed structure of the loss process, and work with covers of the unweighted policy class directly. Using an L∞ sequential cover of the unweighted class and using Holder’s inequality, they are able to factor out the L1 norm of the IS ratios. This allows them to circumvent the type of sequential cover of the weighted class that Zhan et al. [58] need, and yields optimal γ scaling. One caveat of this approach is that the entropy integral in the corresponding bound is expressed in terms of L∞ sequential covering entropy, which makes it hard to obtain fast rates via localization. Indeed, while variance bounds that allow for localization in L2 norm are common, it is in general much harder to obtain localization in L∞ norm.
In well-specified cases, much faster rates of regret convergence are possible. We focus on finitely-many actions, . Define and fix a*(X) with μ(X, a*(X)) = μ*(X).
Assumption 7 (Margin). For a constant ν ∈ [0, ∞], we have for all u ≥ 0,
where we define 01/∞ = 0 and x1/∞ = 1 for x ∈ (0, 1].
This type of margin condition was originally considered in the case of binary classification [39, 54]. The condition we use is more similar to that used in multi-arm contextual bandits [25, 26, 42]. The condition controls the density of the arm gap near zero. It generally holds with ν = 1 for sufficiently well-behaved μ and continuous X and with ν = ∞ for discrete X [see, e.g., 27, Lemmas 4 and 5].
Lemma 3. Suppose Assumption 7 holds and . Then Assumption 4 holds for α = ν/(ν + 1) and Λ : o ↦ M.
Theorem 6 (ISWERM Policy Learning: fast rates). Suppose Assumptions 1, 6 and 7 hold with p < 2 and . Then for any δ ∈ (0, 1/2), with probability at least 1 − δ,
Remark 5 (Classification using ISWERM). The above results can easily be rephrased for the classification analogue to the regression problem in Section 4, where and we want a classifier based on features x, a to minimize misclassification error. Because the policy learning problem is both of greater interest and greater generality, we focus our presentation on policy learning.
6. Empirical Study
Next, we empirically investigate various risk minimization approaches using data collected by a contextual bandit algorithm, including both ISWERM and unweighted ERM among others. We take 51 different mutli-calss classification datasets from OpenML-CC18 [11] and transform each into a multi-arm contextual bandit problem (following the approach of [15, 17, 51]). We then run an epsilon greedy algorithm for T = 100000, where we explore uniformly with probability ϵt = t−1/3 and otherwise pull the arm that maximizes an estimate of μ(x, a) based on data so far. Details are given in Appendix F.1.
We then consider using this data to regress Yt on Xt, At using different methods where each observation is weighted by wt using different schemes: (1) Unweighted ERM: wt = 1; (2) ISWERM: ; (3) ISFloorWERM: , where–inspired by [58]–we weight by the inverse (nonrandom) floor of the propensity scores; (4) SqrtISWERM: , which applies the stabilization of [24, 38] to ISWERM; (5) SqrtISFloorWERM: ; (6) MRDRWERM: , which are the weights used by Farajtabar et al. [18]; (7) MRDRFloorWERM: , which is like MRDRWERM but uses the propensity score floors γt. With these sample weights, we run either Ridge regression, LASSO, or CART using sklearn’s RidgeCV(cv=4), LassoCV(cv=4), or DecisionTreeRegressor, each with default parameters. For Ridge and LASSO we pass as features the intercept-augmented contexts concatenated by the product of the one-hot encoding of arms with the intercept-augmented contexts . For CART, we use the concatenation of the contexts with the one-hot encoding of arms . To evaluate, we play our bandit anew for Ttest = 1000 rounds using a uniform exploration policy, g*(a | x) = 1/K, and record the mean-squared error (MSE) of the regression fits on this data. We repeat the whole process 64 times and report estimated average MSE and standard error in Fig. 1.
Figure 1:

Comparison of weighted regression run on contextual-bandit-collected data. Each dot is one of 51 OpenML-CC18 datasets. Lines denote ±1 standard error. Dots are blue when ISWERM is clearly better, red when clearly worse, and black when indistinguishable within one standard error.
Results.
Figures 1a and 1b show that ISWERM clearly outperforms unweighted ERM and all other weighted-ERM schemes for linear regression, with ISWERM’s advantage being even more pronounced for LASSO. Intuitively, since a linear model is misspecified, this can be attributed to ISWERM’s ability to provide agnostic best-in-class risk guarantees. In contrast, for a better specified model such as CART, all ERM methods perform similarly, as seen in Fig. 1c. We highlight that our focus is not necessarily methodological improvements, and the aim of our experiments is to explore the implications of our theory, not provide state-of-the-art results. We provide additional empirical results in Appendix F.2, the conclusions from which are qualitatively the same.
7. Conclusions and Future Work
We provided first-of-their-kind guarantees for risk minimization from adaptively collected data using ISWERM. Most crucially, our guarantees provided good dependence on the size of IS weights leading to correct convergence rates when exploration diminishes with time, as happens when we collect data using a contextual bandit algorithm. This was made possible by a new maximal inequality specifically for IS weighted sequential empirical processes. There are several important avenues for future work. We focused on a fixed hypothesis class. One important next question is how to do effective model selection in adaptive settings. We also focused on IS weighted regression and policy learning, but recent work in the iid setting highlights the benefits of using doubly-robust-style centering [2, 21, 33]. These benefits are most important to avoid rate deterioration when IS weights are estimated, while our IS weights are known, but there are still benefits in reducing the loss variance in the leading constant. Therefore, exploring such methods in adaptive settings is another important next question.
8. Societal Impact
Our work provides guarantees for learning from adaptively collected data. While the methods (IS weighting) are standard, our novel guarantees lend credibility to the use of adaptive experiments. Adaptive experiments hold great promise for better, more efficient, and even more ethical experiments. At the same time, adaptive experiments, especially when all arms are always being explored (γt < ∞) even if at vanishing rates (γt = ω(1)), must still be subject to the same ethical guidelines as classic randomized experiments regarding favorable risk-benefit ratio of any arm, informed consent, and other protections of participants. There are also several potential dangers to be aware of in supervised and policy learning generally, such as the training data possibly being unrepresentative of the population to which predictions and policies will be applied leading to potential disparities as well as the focus on average welfare compared to prediction error or policy value on each individual or group. These remain concerns in the adaptive setting, and while ways to tackle these challenges in non-adaptive settings might be applicable in adaptive ones, a rigorous study of applicability requires future work.
Supplementary Material
Footnotes
See van der Vaart and Wellner [56, Chapter 2.3] for an explanation of why uniform covering entropy is natural for bounding symmetrized Rademacher processes.
Contributor Information
Aurélien Bibaut, Netflix.
Nathan Kallus, Cornell University and Netflix.
Maria Dimakopoulou, Netflix.
Antoine Chambaz, Université de Paris.
Mark van der Laan, University of California, Berkeley.
References
- [1].Agarwal Alekh, Hsu Daniel, Kale Satyen, Langford John, Li Lihong, and Schapire Robert. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014. [Google Scholar]
- [2].Athey Susan and Wager Stefan. Policy learning with observational data. Econometrica, 89(1): 133–161, 2021. [Google Scholar]
- [3].Athey Susan, Baird Sarah, Jamison Julian, Craig McIntosh Berk Özler, and Sama Dohbit. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2018. URL http://pubdocs.worldbank.org/en/606341582906195532/Study-Protocol-Adaptive-experiment-on-FP-counseling-and-uptake-of-MCs.pdf. Study protocol.
- [4].Bakshy Eytan, Dworkin Lili, Karrer Brian, Kashin Konstantin, Letham Benjamin, Murthy Ashwin, and Singh Shaun. Ae: A domain-agnostic platform for adaptive experimentation. In Workshop on System for ML, 2018. [Google Scholar]
- [5].Bartlett Peter L, Bousquet Olivier, and Mendelson Shahar. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005. [Google Scholar]
- [6].Bartlett Peter L, Jordan Michael I, and McAuliffe Jon D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. [Google Scholar]
- [7].Bastani Hamsa and Bayati Mohsen. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020. [Google Scholar]
- [8].Bibaut AF, Chambaz A, and van der Laan MJ. Generalized policy elimination: an efficient algorithm for nonparametric contextual bandits, 2020.
- [9].Bibaut Aurelien, Dimakopoulou Maria, Kallus Nathan, Chambaz Antoine, and van der Laan Mark. Post-contextual-bandit inference for policy evaluation. 2021. [PMC free article] [PubMed]
- [10].Bibaut Aurélien F and van der Laan Mark J. Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244, 2019. [Google Scholar]
- [11].Bischl Bernd, Casalicchio Giuseppe, Feurer Matthias, Hutter Frank, Lang Michel, Mantovani Rafael G, van Rijn Jan N, and Vanschoren Joaquin. Openml benchmarking suites. arXiv preprint arXiv:1708.03731, 2017. [Google Scholar]
- [12].Bubeck Sébastien and Cesa-Bianchi Nicolò. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. ISSN 1935–8237. doi: 10.1561/2200000024. [DOI] [Google Scholar]
- [13].Cao Weihua, Tsiatis Anastasios A, and Davidian Marie. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3): 723–734, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Chu Wei, Li Lihong, Reyzin Lev, and Schapire Robert. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011. [Google Scholar]
- [15].Dimakopoulou Maria, Zhou Zhengyuan, Athey Susan, and Imbens Guido. Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077, 2017. [Google Scholar]
- [16].Dudik Miroslav, Hsu Daniel, Kale Satyen, Karampatziakis Nikos, Langford John, Reyzin Lev, and Zhang Tong. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369, 2011. [Google Scholar]
- [17].Miroslav Dudík, Erhan Dumitru, Langford John, and Li Lihong. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014. [Google Scholar]
- [18].Farajtabar Mehrdad, Chow Yinlam, and Ghavamzadeh Mohammad. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018. [Google Scholar]
- [19].Foster Dylan and Rakhlin Alexander. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020. [Google Scholar]
- [20].Foster Dylan J and Krishnamurthy Akshay. Contextual bandits with surrogate losses: Margin bounds and efficient algorithms. arXiv preprint arXiv:1806.10745, 2018. [Google Scholar]
- [21].Foster Dylan J and Syrgkanis Vasilis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019. [Google Scholar]
- [22].Freedman David A and Berk Richard A. Weighting regressions by propensity scores. Evaluation Review, 32(4):392–409, 2008. [DOI] [PubMed] [Google Scholar]
- [23].Goldenshluger Alexander and Zeevi Assaf. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013. [Google Scholar]
- [24].Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Hu Yichun, Kallus Nathan, and Mao Xiaojie. Fast rates for contextual linear optimization. arXiv preprint arXiv:2011.03030, 2020. [Google Scholar]
- [26].Hu Yichun, Kallus Nathan, and Mao Xiaojie. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. In Conference on Learning Theory, pages 2007–2010, 2020. [Google Scholar]
- [27].Hu Yichun, Kallus Nathan, and Uehara Masatoshi. Fast rates for the regret of offline reinforcement learning. In Conference on Learning Theory, 2021. [Google Scholar]
- [28].Imbens Guido W and Rubin Donald B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
- [29].Kallus Nathan and Udell Madeleine. Dynamic assortment personalization in high dimensions. Operations Research, 68(4):1020–1037, 2020. [Google Scholar]
- [30].Kallus Nathan and Uehara Masatoshi. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in neural information processing systems, 2019. [Google Scholar]
- [31].Kallus Nathan, Saito Yuta, and Uehara Masatoshi. Optimal off-policy evaluation from multiple logging policies. In International Conference on Machine Learning, 2021. [Google Scholar]
- [32].Kasy Maximilian and Sautmann Anja. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021. [Google Scholar]
- [33].Kennedy Edward H. Optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497, 2020. [Google Scholar]
- [34].Kitagawa Toru and Tetenov Aleksey. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591–616, 2018. [Google Scholar]
- [35].Koltchinskii Vladimir. Local rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34(6):2593–2656, 2006. [Google Scholar]
- [36].Lattimore Tor and Szepesvári Csaba. Bandit algorithms. Cambridge University Press, 2020. [Google Scholar]
- [37].Li Lihong, Chu Wei, Langford John, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [Google Scholar]
- [38].Luedtke Alexander R. and van der Laan Mark J.. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. The Annals of Statistics, 44(2):713–742, 2016. doi: 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Mammen Enno and Alexandre B Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999. [Google Scholar]
- [40].Massart Pascal. Concentration inequalities and model selection. 2007.
- [41].Ossiander Mina. A central limit theorem under metric entropy with l_2 bracketing. Annals of probability, 15(3):897–919, 1987. [Google Scholar]
- [42].Perchet Vianney and Rigollet Philippe. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721, 2013. [Google Scholar]
- [43].Qiang Sheng and Bayati Mohsen. Dynamic pricing with demand covariates. arXiv preprint arXiv:1604.07463, 2016. [Google Scholar]
- [44].Quinn Simon, Teytelboym Alex, Kasy Maximilian, Gordon Grant, and Caria Stefano. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2019. URL https://www.socialscienceregistry.org/trials/3870. Study registration.
- [45].Rakhlin Alexander, Sridharan Karthik, and Tewari Ambuj. Sequential complexities and uniform martingale laws of large numbers. 2013.
- [46].Rakhlin Alexander, Sridharan Karthik, and Tewari Ambuj. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 161(1–2):111–153, 2015. [Google Scholar]
- [47].Rigollet Philippe and Zeevi Assaf. Nonparametric bandits with covariates. arXiv preprint arXiv:1003.1630, 2010. [Google Scholar]
- [48].Robins James M, Hernan Miguel Angel, and Brumback Babette. Marginal structural models and causal inference in epidemiology, 2000. [DOI] [PubMed]
- [49].Rubin Daniel B and van der Laan Mark J. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1), 2008. [PMC free article] [PubMed] [Google Scholar]
- [50].Simchi-Levi David and Xu Yunzong. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN, 2020. [Google Scholar]
- [51].Su Yi, Wang Lequn, Santacatterina Michele, and Joachims Thorsten. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pages 6005–6014. PMLR, 2019. [Google Scholar]
- [52].Swaminathan Adith and Joachims Thorsten. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research, 16(1): 1731–1755, 2015. [Google Scholar]
- [53].Tewari Ambuj and Murphy Susan A. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017. [Google Scholar]
- [54].Tsybakov Alexander B. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004. [Google Scholar]
- [55].van de Geer Sara. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000. [Google Scholar]
- [56].van der Vaart Aad W and Wellner Jon A. Weak convergence and empirical processes. Springer, 1996. [Google Scholar]
- [57].van Handel Ramon. On the minimal penalty for markov order estimation. Probability theory and related fields, 150(3–4):709–738, 2011. [Google Scholar]
- [58].Zhan Ruohan, Ren Zhimei, Athey Susan, and Zhou Zhengyuan. Policy learning with adaptively collected data. arXiv preprint arXiv:2105.02344, 2021. [Google Scholar]
- [59].Zhao Yingqi, Zeng Donglin, Rush A John, and Kosorok Michael R. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
