Post-Contextual-Bandit Inference

Aurélien Bibaut; Antoine Chambaz; Maria Dimakopoulou; Nathan Kallus; Mark van der Laan

. Author manuscript; available in PMC: 2022 Jul 1.

Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;34:28548–28559.

Post-Contextual-Bandit Inference

Aurélien Bibaut ¹, Antoine Chambaz ², Maria Dimakopoulou ³, Nathan Kallus ⁴, Mark van der Laan ⁵

PMCID: PMC9249103 NIHMSID: NIHMS1816256 PMID: 35785105

Abstract

Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking because they can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, nonetheless, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies. The adaptive nature of the data collected by contextual bandit algorithms, however, makes this difficult: standard estimators are no longer asymptotically normally distributed and classic confidence intervals fail to provide correct coverage. While this has been addressed in non-contextual settings by using stabilized estimators, the contextual setting poses unique challenges that we tackle for the first time in this paper. We propose the Contextual Adaptive Doubly Robust (CADR) estimator, the first estimator for policy value that is asymptotically normal under contextual adaptive data collection. The main technical challenge in constructing CADR is designing adaptive and consistent conditional standard deviation estimators for stabilization. Extensive numerical experiments using 57 OpenML datasets demonstrate that confidence intervals based on CADR uniquely provide correct coverage.

1. Introduction

Contextual bandits, where personalized decisions are made sequentially and simultaneously with data collection, are increasingly used to address important decision-making problems where data is limited and/or expensive to collect, with applications in product recommendation [Li et al., 2010], revenue management [Kallus and Udell, 2020, Qiang and Bayati, 2016], and personalized medicine [Tewari and Murphy, 2017]. Adaptive experiments, whether based on bandit algorithms or Bayesian optimization, are increasingly being considered in place of classic randomized trials in order to improve both the outcomes for study participants and the chance of identifying the best treatment allocations [Athey et al., 2018, Quinn et al., 2019, Kasy and Sautmann, 2021, Bakshy et al., 2018].

But, at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or the value of new personalized interventions. Such confidence intervals are, for example, crucial for enabling credible inference on the presence or absence of improvement of novel policies. However, due to the adaptive nature of the data collection, unlike classic randomized trials, standard estimates and their confidence intervals actually fail to provide correct coverage, that is, contain the true parameter with the desired confidence probability (e.g., 95%). A variety of recent work has recognized this and offered remedies [Hadad et al., 2019, Luedtke and van der Laan, 2016], but only for the case of non-contextual adaptive data collection. Like classic confidence intervals, when data comes from a contextual bandit – or any other context-dependent adaptive data collection – these intervals also fail to provide correct coverage. In this paper, we propose the first asymptotically normal estimator for the value of a (possibly contextual) policy from context-dependent adaptively collected data. This asymptotic normality leads directly to the construction of valid confidence intervals.

Our estimator takes the form of a stabilized doubly robust estimator, that is, a weighted time average of an estimate of the so-called canonical gradient using plug in estimators for the outcome model, where each time point is inversely weighted by its estimated conditional standard deviation given the past. We term this the Contextual Adaptive Doubly Robust (CADR) estimator. We show that, given consistent conditional variance estimates which at each time point only depend on previous data, the CADR estimator is asymptotically normal, and as a result we can easily construct asymptotically valid confidence intervals. This normality is in fact robust to misspecifying the outcome model. A significant technical challenge is actually constructing such variance estimators. We resolve this using an adaptive variance estimator based on the importance-sampling ratio of current to past (adaptive) policies at each time point. We also show that we can reliably estimate outcome models from the adaptively-collected data so that we can plug them in. Extensive experiments using 57 OpenML datasets demonstrate the failure of previous approaches and the success of ours at constructing confidence intervals with correct coverage.

1.1. Problem Statement and Notation

The data.

Our data consists of a sequence of observations indexed t = 1, …, T comprising of context $X (t) \in X$ , action $A (t) \in A$ , and outcome $Y (t) \in Y \subset ℝ$ generated by an adaptive experiment, such as a contextual bandit algorithm. Roughly, at each round t = 1, 2, …, T, an agent formed a contextual policy g_t(a | x) based on all past observations, then observed an independently drawn context vector X(t) ~ Q_0,X, carried out an action A(t) drawn from its current policy g_t(· | X(t)), and observed an outcome Y (t) ~ Q_0,Y (· | A(t), X(t)) depending only on the present context and action. The action and context measurable spaces $X$ , $A$ are arbitrary, e.g., finite or continuous.

More formally, we let O(t) ≔ (X(t), A(t), Y (t)) and make the following assumptions about the sequence O(1), …, O(T) comprising our dataset. First, we assume X(t) is independent of all else given A(t) and has a time-independent marginal distribution that we denote by Q_0,X. Second, we assume A(t) is independent of all else given O(1), …, O(t − 1), X(t) and we set g_t(· | X(t)) to its (random) conditional distribution given O(1), …, O(t − 1), X(t). Third, we assume Y (t) is independent of all else given X(t), A(t) and has a time-independent conditional distribution given X(t) = x, A(t) = a that is denoted by Q_0,Y (· | A, X). The distributions Q_0,X and Q_0,Y are unknown, while the policies g_t(a | x) are known, as would be the case when running an adaptive experiment. To simplify presentation we endow $A$ with a base measure $μ_{A}$ (e.g., counting for finite actions or Lebesgue for continuous actions) and identify policies g_t with conditional densities with respect to (w.r.t.) $μ_{A}$ . In the case of K < ∞ actions, policies are maps from $X$ to the K-simplex.

Note that, as the agent updates its policy based on already collected observations, g_t is a random O(1), …, O(t − 1)-measurable object. This is the major departure from the setting considered in other literature on off-policy evaluation, which only consider a fixed logging policy, g_t = g, that is independent of the data. See Section 1.2.

The target parameter.

We are interested in inference on a generalized average causal effect expressed as a functional of the unknown distributions above, Ψ₀ = Ψ(Q_0,X, Q_0,Y), where for any distributions Q_X, Q_Y, we define

Ψ (Q_{X}, Q_{Y}) ≔ \int y Q_{X} (d x) g (a ∣ x) d μ_{A} (a) Q_{Y} (d y ∣ a, x),

where $g^{*} (a ∣ x) : A \times X \to [- G, G]$ is a given fixed, bounded function. Two examples are: (a) when g* is a policy (conditional density), then Ψ₀ is its value; (b) when g* is the difference between two policies then Ψ₀ is the difference between their values. A prominent example of the latter is when $A = {+ 1, - 1}$ and g*(a | x) = a, which is known as the average treatment effect. If we include an indicator for x being in some set, then we get the subgroup effect.

Defining the conditional mean outcome,

{\bar{Q}}_{0} (a, x) ≔ E_{Q_{0, Y} (\cdot ∣ x, a)} [Y] = \int y Q_{0, Y} (d y ∣ a, x),

we note that the target parameter only depends on Q_0,Y via ${\bar{Q}}_{0}$ , so we also overload notation and write $Ψ (Q_{X}, \bar{Q}) = \int \bar{Q} (a, x) Q_{X} (d x) g (a ∣ x) d μ_{A} (a)$ for any function $\bar{Q} : A \times X \to Y$ . Note that when $| A | < \infty$ and $μ_{A}$ is the counting measure, the integral over a is a simple sum.

Canonical gradient.

We will make repeated use of the following function: for any conditional density (a, x) ↦ g(a | x), any probability distribution Q_X over the context space $X$ , and any function $\bar{Q} : A \times X$ , we define the function $D^{'} (g, \bar{Q}) : O \to ℝ$ by

D^{'} (g, \bar{Q}) (x, a, y) ≔ \frac{g^{*} (a ∣ x)}{g (a ∣ x)} (y - \bar{Q} (a, x)) + \int \bar{Q} (a^{'}, x) g^{*} (a^{'} ∣ x) d μ_{A} (a^{'}) .

Further, define $D (g, Q_{X}, \bar{Q}) = D^{'} (g, Q_{X}, \bar{Q}) - Ψ (Q_{X}, \bar{Q})$ , which coincides with the so-called canonical gradient of the target parameter Ψ w.r.t. the usual nonparametric statistical model comprising all joint distributions over $O$ [van der Vaart, 2000, van der Laan and Robins, 2003].

Integration operator notation.

For any policy g and distributions Q_X, Q_Y, denote by P_Q,g the induced distribution on $O$ . For any function $f : O \to ℝ$ , we use the integration operator notation

P_{Q, g} f = \int f (x, a, y) Q_{X} (d x) g (a ∣ x) d μ_{A} (a) Q_{Y} (d y ∣ a, x),

that is, the expectation w.r.t. P_Q,g alone. Then, for example, for any O(1), …, O(s − 1)-measurable random function $f : O \to ℝ$ , we have that $P_{Q_{0}, g_{s}} f = E_{Q_{0}, g_{s}} [f (O (s)) ∣ O (1), \dots, O (s - 1)]$ .

1.2. Related Literature and Challenges for Post-Contextual-Bandit Inference

Off-policy evaluation.

In non-adaptive settings, where g_t = g is fixed and does not depend on previous observations, common off-the shelf estimators for the mean outcome under g* include the Inverse Propensity Scoring (IPS) estimator [Beygelzimer and Langford, 2009, Li et al., 2011] and and the Doubly Robust (DR) estimator [Dudík et al., 2011, Robins et al., 1994]:

{\hat{Ψ}}^{IPS} ≔ \frac{1}{T} \sum_{t = 1}^{T} D^{'} (g, 0), {\hat{Ψ}}^{DR} ≔ \frac{1}{T} \sum_{t = 1}^{T} D^{'} (g, \hat{\bar{Q}})

where $\hat{\bar{Q}}$ is an estimator of the outcome model ${\bar{Q}}_{0} (a, x)$ . If we use cross-fitting to estimate $\hat{\bar{Q}}$ [Chernozhukov et al., 2018], then both the IPS and DR estimators are unbiased and asymptotically normal, permitting straightforward inference using Wald confidence intervals (i.e., ±1.96 of the estimated standard error). There also exist many variants of the IPS and DR estimators that, rather than plugging in the importance sampling (IS) ratios (g*/g_t)(A(t) | X(t)) and/or outcome-model estimators, instead choose them directly with the aim to minimize error [e.g. Kallus, 2018, Farajtabar et al., 2018, Thomas and Brunskill, 2016, Wang et al., 2017, Kallus and Uehara, 2019b].

Inference challenges in adaptive settings.

In the adaptive setting, it is easy to see that, if in the tth term for DR we use an outcome model ${\hat{\bar{Q}}}_{t - 1}$ fit using only the observations O(1), …, O(t − 1), then both the IPS and DR estimators both remain unbiased. However, neither generally converges to a normal distribution. One key difference between the non-adaptive and adaptive settings is that the IS ratios (g*/g_t)(A(t) | X(t)) can both diverge to infinity or converge to zero. As a result of this, the above two estimators may either be dominated by their first terms or their last terms. At a more theoretical level, this violates the classical condition of martingale central limit theorems that the conditional variance of the terms given previous observations stabilizes asymptotically.

Stabilized DR estimators in non-contextual settings.

The issue for inference due to instability of the DR estimator terms was recognized by Luedtke and van der Laan [2016] in another setting. They work in the non-adaptive setting but consider the problem of inferring the maximum mean outcome over all policies when the optimal policy is non-unique. Their proposal is a so-called stabilized estimator, in which each term is inversely weighted by an estimate of its conditional standard deviation given the previous terms. This stabilization trick has been also been reused for off-policy inference from non-contextual bandit data by Hadad et al. [2019], as the stabilized estimator remains asymptotically normal, permitting inference. In their non-contextual setting, an estimate of the conditional standard deviation of the terms can easily be obtained by the inverse square root propensities. In contrast, in our contextual setting, obtaining valid stabilization weights is more challenging and requires a construction involving adaptive training on past data.

1.3. Contributions

In this paper, we construct and analyze a stabilized estimator for policy evaluation from context-dependent adaptively collected data, such as the result of running a contextual bandit algorithm. This then immediately enables inference. After constructing a generic extension of the stabilization trick, the main technical challenge is to construct a sequence of estimators ${\hat{σ}}_{1}, \dots, {\hat{σ}}_{T}$ of the conditional standard deviations that are both consistent and such that for each t, ${\hat{σ}}_{t}$ only uses the previous data points O(1), …, O(t − 1). We show in extensive experiments across a large set of contextual bandit environments that our confidence intervals uniquely achieve close to nominal coverage.

2. Construction and Analysis of the Generic Contextual Stabilized Estimator

In this section, we give a generic construction of a stabilized estimator in our contextual and adaptive setting. That is, given generic plug-ins for outcome model and conditional standard deviation. We then provide conditions under which the estimator is asymptotically normal, as desired. To develop CADR, we will then proceed to construct appropriate plug in estimators in the proceeding sections.

2.1. Construction of the Estimator

Outcome and variance estimators.

Our estimator uses a sequence ${({\hat{\bar{Q}}}_{t})}_{t \geq 1}$ of estimators of the outcome model ${\bar{Q}}_{0}$ , such that, for every t, ${\hat{\bar{Q}}}_{t}$ is O(1), …, O(t)-measurable, that is, is trained using only the data up to time t. A key part of our estimator are the conditional variance estimators.

Additionally, we require estimates of the conditional standard deviation of the canonical gradient. Define

σ_{0, t} ≔ σ_{0, t} (g_{t})

where $σ_{0, t}^{2} (g) ≔ {Var}_{Q_{0}, g} (D^{'} (g, {\hat{\bar{Q}}}_{t - 1}) (O (t)) ∣ O (1), \dots, O (t - 1))$ .

Let ${({\hat{σ}}_{t})}_{t \geq 1}$ be a given sequence of estimates of σ_0,t such that ${\hat{σ}}_{t}$ is O(1), …, O(t − 1)-measurable, that is, is estimated using only the data up to time t.

The generic form of the estimator.

The generic contextual stabilized estimator is then defined as:

{\hat{Ψ}}_{T} ≔ {(\frac{1}{T} \sum_{t = 1}^{T} {\hat{σ}}_{t}^{- 1})}^{- 1} \frac{1}{T} \sum_{t = 1}^{T} {\hat{σ}}_{t}^{- 1} D^{'} (g, {\hat{\bar{Q}}}_{t - 1}) .

(1)

2.2. Asymptotic normality guarantees

We next characterize the asymptotic distribution of ${\hat{Ψ}}_{T}$ under some assumptions.

Assumption 1 (Non degenerate efficiency bound). ${inf}_{g} P_{Q_{0}, g} D^{2} (g, {\bar{Q}}_{0}, Q_{0, X}) > 0$ .

Assumption 1 states that there is no fixed logging policy g such that the efficiency bound for estimation of $Ψ ({\bar{Q}}_{0}, Q_{0, X})$ in the nonparametric model, from i.i.d. draws of $P_{Q_{0}, g}$ , is zero. If assumption 1 does not hold, there exists a logging policy g such that, if $O = (X, A, Y) ~ P_{Q_{0}, g}$ , then (g*(A | X)/g(A | X))Y equals $Ψ ({\bar{Q}}_{0}, Q_{0, X})$ with probability 1. In other words, if assumption 1 does not hold, there exists a logging policy g such that $Ψ ({\bar{Q}}_{0}, Q_{0, X})$ can be estimated with no error with probability 1 from a single draw of $P_{Q_{0}, g}$ . Thus, it is very lax. An easy sufficient condition for Assumption 1 is that the outcome model has nontrivial variance in that ${Var}_{Q_{0, X}} (\int \bar{Q} (a, X) g^{*} (a ∣ x) d μ_{A} (a)) > 0$ .

Assumption 2 (Consistent standard deviation estimators.). ${\hat{σ}}_{t} - σ_{0, t} \overset{t \to \infty}{\to} 0$ almost surely.

In the next section we will proceed to construct specific estimators ${\hat{σ}}_{t}$ that satisfy Assumption 2, leading to our proposed CADR estimator and confidence intervals.

Assumption 3 (Exploration rate). For any t ≥ 1, we have that ${inf}_{a \in A, x \in X} g_{t} (a ∣ x) ≳ t^{- 1 / 2}$ almost surely.

Here, a_t ≳ b_t means that for some constant c > 0, we have a_t ≥ cb_t for all t ≥ 1. Assumption 3 requires that the exploration rate of the adaptive experiment does not decay too quickly.

Based on these assumptions, we have the following asymptotic normality result:

Theorem 1. Denote $Γ_{T} ≔ {(T^{- 1} \sum_{t = 1}^{T} {\hat{σ}}_{t}^{- 1})}^{- 1}$ . Under Assumptions 1 to 3, it holds that $Γ_{T}^{- 1} \sqrt{T} ({\hat{Ψ}}_{T} - Ψ_{0}) \overset{d}{\to} N (0, 1)$ .

Remark 1. Theorem 1 does not require the outcome model estimator to converge at all. As we will see in Section 3, our conditional variance estimator does require that the outcome model converges to a fixed limit ${\bar{Q}}_{1}$ , but this limit does not have to be the true outcome model ${\bar{Q}}_{0}$ . In other words, consistency of the outcome model is not required at any point of our analysis.

3. Construction of the Conditional Variance Estimator and CADR

We now tackle the construction of ${\hat{σ}}_{t}$ satisfying our assumptions; namely, they must be adaptively trained only on past data at each t and they must be consistent. Observe that $σ_{0, t}^{2} = σ_{0}^{2} (g_{t}, {\hat{\bar{Q}}}_{t - 1})$ , where we define

σ_{0}^{2} (g, \bar{Q}) ≔ Φ_{0, 1} (g, \bar{Q}) - {(Φ_{0, 2} (g, \bar{Q}))}^{2}, Φ_{0, i} (g, \bar{Q}) ≔ P_{Q_{0}, g} {(D^{'})}^{i} (g, \bar{Q}), i = 1, 2.

Designing an O(1), …, O(t − 1)-measurable estimator of $σ_{0, t}^{2}$ presents several challenges. First, while we can only use observations O(1), …, O(t − 1) to estimate it, $σ_{0, t}^{2}$ is defined as a function of integrals w.r.t. $P_{Q_{0}, g_{t}}$ , from which we have only one observation, namely O(t). Second, our estimation target $σ_{0, t}^{2} = σ_{0} (g_{t}, {\hat{\bar{Q}}}_{t - 1})$ is random as it depends on g_t and ${\hat{\bar{Q}}}_{t}$ . Third, g_t, ${\hat{\bar{Q}}}_{t}$ depend on the same observations O(1), …, O(t − 1) that we have at our disposal to estimate $σ_{0, t}^{2}$ .

Representation via importance sampling.

We can overcome the first difficulty via importance sampling, which allows us to write $Φ_{0, i} (g, \bar{Q})$ , i = 1, 2 as integrals w.r.t. $P_{Q_{0}, g_{s}}$ , s = 1, …, t − 1, i.e., the conditional distributions of observations O(s), s = 1, …, t − 1 given their respective past. Namely, for any s ≥ 1, i = 1, 2, we have that

Φ_{0, i} (g, \bar{Q}) = P_{Q_{0}, g_{s}} \frac{g}{g_{s}} {(D^{'})}^{i} (g, \bar{Q}) .

(2)

Dealing with the randomness of the estimation target.

We now turn to second challenge. Since $σ_{0, t}^{2}$ can be written in terms of $Φ_{0, i} (g_{t}, {\hat{\bar{Q}}}_{t - 1})$ for i = 1, 2, Eq. (2) suggests perhaps an approach based on sample averages of $(g_{t} / g_{s}) {(D^{'})}^{i} (g_{t}, {\hat{\bar{Q}}}_{t - 1})$ over s. However, whenever s < t, the latter is an O(1), …, O(t − 1)-measurable function due to the dependence on g_t and ${\hat{\bar{Q}}}_{t}$ . Namely, $P_{Q_{0}, g_{s}} {(g_{t} / g_{s}) {(D^{'})}^{i} (g_{t}, {\hat{\bar{Q}}}_{t - 1})}$ does not coincide in general with the conditional expectation $E_{Q_{0}, g_{s}} [((g_{t} / g_{s}) {(D^{'})}^{i} (g_{t}, {\hat{\bar{Q}}}_{t - 1})) (O (s)) ∣ \bar{O} (s - 1)]$ , as would arise from a sample average. We now look at solutions to overcome this difficulty, considering first ${\hat{\bar{Q}}}_{t - 1}$ and then g_t.

Dealing with the randomness of ${\hat{\bar{Q}}}_{t - 1}$ .

We propose an estimator of $σ_{0}^{2} (g, {\hat{\bar{Q}}}_{t - 1})$ for any fixed g. While requiring that ${\hat{\bar{Q}}}_{t - 1}$ converges to the true outcome regression function ${\bar{Q}}_{0}$ is a strong requirement, most reasonable estimators will at least converge to some fixed limit ${\bar{Q}}_{1}$ . As a result, under an appropriate stochastic convergence condition on $({\hat{\bar{Q}}}_{t - 1}) t \geq 1$ , $Φ_{0, i} (g, {\hat{\bar{Q}}}_{t - 1})$ can be reasonably approximated by the corresponding Cesaro averages, defined for i = 1, 2 as

{\bar{Φ}}_{0, i, t} (g) ≔ \frac{1}{t - 1} \sum_{s = 1}^{t - 1} Φ_{0, i} (g, {\hat{\bar{Q}}}_{s - 1}) = \frac{1}{t - 1} \sum_{s = 1}^{t} E_{Q_{0}, g_{s}} [((g / g_{s}) {(D^{'})}^{i} (g, {\hat{\bar{Q}}}_{s - 1})) (O (s)) ∣ \bar{O} (s - 1)] .

These are easy to estimate from the corresponding sample averages, defined for i = 1, 2 as

{\hat{Φ}}_{i, t} (g) ≔ \frac{1}{t - 1} \sum_{s = 1}^{t} ((g / g_{s}) {(D^{'})}^{i} (g, {\hat{\bar{Q}}}_{s - 1})) (O (s)),

since for each i = 1, 2, the difference ${\hat{Φ}}_{i, t} (g) - {\bar{Φ}}_{0, i, t} (g)$ is the average of a martingale difference sequence (MDS). We then define our estimator of $σ_{0}^{2} (g, {\hat{\bar{Q}}}_{t - 1})$ as

{\hat{σ}}_{t} (g) ≔ {\hat{Φ}}_{1, t} (g) - {({\hat{Φ}}_{2, t} (g))}^{2} .

(3)

From fixed g to random g_t.

So far, we have proposed and justified the construction of ${\hat{σ}}_{t} (g)$ as an estimator of $σ_{0, t} (g, {\hat{\bar{Q}}}_{t - 1})$ for a fixed g. We now discuss conditions under which ${\hat{σ}}_{t} (g_{t})$ is valid estimator of $σ_{0, t} (g_{t}, {\hat{\bar{Q}}}_{t - 1})$ . When g is fixed, for each i = 1, 2, the error ${\hat{Φ}}_{i, t} (g) - Φ_{0, i, t} (g, {\hat{\bar{Q}}}_{t - 1})$ decomposes as the sum of the MDS average ${\hat{Φ}}_{i, t} (g) - {\bar{Φ}}_{0, i, t} (g)$ and of the Cesaro approximation error ${\bar{Φ}}_{0, i, t} (g) - Φ_{0, i} (g, {\hat{\bar{Q}}}_{t - 1})$ . Both differences are straightforward to bound. For a random g_t, the term ${\hat{Φ}}_{i, t} (g_{t}) - {\bar{Φ}}_{0, i, t} (g_{t})$ is no longer an MDS average. Fortunately, under a complexity condition on the logging policy class $G$ , we can bound the supremum of the martingale empirical processes ${| {\hat{Φ}}_{i, t} (g) - {\bar{Φ}}_{0, i, t} (g) | : g \in G}$ , which in turn gives us a bound on $| {\hat{Φ}}_{i, t} (g_{t}) - {\bar{Φ}}_{0, i, t} (g_{t}) |$ .

Consistency guarantee for ${\hat{σ}}_{t}^{2}$ .

Our formal consistency result relies on the following assumptions.

Assumption 4 (Outcome regression estimator convergence). There exists β > 0, and a fixed function ${\bar{Q}}_{1} : A \times X \to ℝ$ such that ${‖ {\hat{\bar{Q}}}_{t} - {\bar{Q}}_{1} ‖}_{1, Q_{0, X}, g^{*}} = O (t^{- β})$ almost surely.

The next assumption is a bound on the bracketing entropy (see, e.g., [van der Vaart and Wellner, 1996] for definition) of the logging policy class.

Assumption 5 (Complexity of the logging policy class). There exists a class of conditional densities $G$ such that $g_{t} \in G \forall t \geq 1$ almost surely, there exists G > 0 such that ${sup}_{g \in G} {‖ g / g^{ref} ‖}_{\infty} \leq G$ , and for some p > 0

log N_{[]} (ϵ, G / g^{ref}, ∥ \cdot ∥_{2, Q_{0, X}, g^{ref}}) ≲ ϵ^{- p},

where $G / g^{ref} ≔ {g / g^{ref} : g \in G}$ .

Next, we require a condition on the exploration rate that is stronger than Assumption 3.

Assumption 6 (Exploration rate (stronger)). For ant t ≥ 1, we have that ${inf}_{a \in A, x \in X} g_{t} (a ∣ x) / g^{ref} (a ∣ x) ≳ t^{- α (β, p)}$ almost surely, where α(β, p) ≔ min(1/((3 + p)), 1/(1 + 2p), β).

Theorem 2. Suppose that Assumptions 4 to 6 hold. Then, ${\hat{σ}}_{t}^{2} - σ_{0, t}^{2} = o (1)$ almost surely.

Remark 2. While we theoretically require the existence of a logging policy class $G$ with controlled complexity, we do not actually need to know $G$ to construct our estimator. Moreover, while we require a bound on the bracketing entropy of the logging policy class $G$ , we impose no restriction on the outcome regression model complexity, permitting us to use flexible black-box regression methods.

Remark 3. Assumption 4 requires $({\hat{\bar{Q}}}_{t})$ to be a sequence of regression estimator, such that for every t ≥ 1, ${\hat{\bar{Q}}}_{t}$ is fitted on O(1), …, O(t) and for which we can guarantee a rate of convergence to some fixed limit ${\bar{Q}}_{1}$ . Note that this can at first glance pose a challenge since observations O(1), …, O(t) are adaptively collected. In the appendix, we give guarantees for outcome regression estimation over a nonparametric model using an importance sampling weighted empirical risk minimization.

CADR asymptotics.

Our proposed CADR estimator is now given by plugging our estimates ${\hat{σ}}_{t}$ from Eq. (3) into Eq. (1), as summarized in Algorithm 1 As an immediate corollary of Theorems 1 and 2 we have our main guarantee for this final estimator, showing CADR is asymptotically normal, whence we immediately obtain asymptotically valid confidence intervals.

Corollary 1 (CADR Asymptotics and Inference). Suppose that Assumptions 1 and 4 to 6 hold. Let ${\hat{σ}}_{t}$ be given as in Eq. (3). Denote $Γ_{T} ≔ {(T^{- 1} \sum_{t = 1}^{T} {\hat{σ}}_{t}^{- 1})}^{- 1}$ . Then, $Γ_{T}^{- 1} \sqrt{T} ({\hat{Ψ}}_{T} - Ψ_{0}) \overset{d}{\to} N (0, 1) .$

Moreover, letting ζ_α denote the α-quantile of the standard normal distribution,

Pr [Ψ (Q_{0, X}, {\bar{Q}}_{0}) \in [{\hat{Ψ}}_{T} \pm ζ_{1 - α / 2} Γ_{T} / \sqrt{T}]] \overset{T \to \infty}{\to} 1 - α .

4. Empirical Evaluation

We next present computational results on public datasets that demonstrate the robustness of CADR confidence intervals using contextual bandit data with comparison to several baselines. Our experiments focus on the case of finitely-many actions, $A = {1, \dots, K}$ .

4.1. Baseline Estimators

We compare CADR to several benchmarks. All take the following form for a choice of w_t, ω_t, ${\hat{\bar{Q}}}_{t}$ :

{\hat{Ψ}}_{T} = {(\frac{1}{T} w_{t})}^{- 1} \frac{1}{T} \sum_{t = 1}^{T} w_{t} {\tilde{D}}_{t}^{'}, {CI}_{α} = [{\hat{Ψ}}_{T} \pm ζ_{1 - α / 2} \sqrt{\frac{\sum_{t = 1}^{T} w_{t}^{2} {({\tilde{D}}_{t}^{'} - \hat{Ψ})}^{2}}{{(\sum_{t = 1}^{T} w_{t})}^{2}}}],

where ${\tilde{D}}_{t}^{'} = ω_{t} (Y (t) - {\hat{\bar{Q}}}_{t - 1} (A (t), X (t))) + \sum_{a = 1}^{K} {\hat{\bar{Q}}}_{t - 1} (a, X (t)) g^{*} (a ∣ X (t))$ .

The Direct Method (DM) sets w_t = 1, ω_t = 0 and fits ${\hat{\bar{Q}}}_{t - 1} (a, \cdot)$ by running some regression method for each a on the data {(X(s), Y (s)) : 1 ≤ s ≤ t − 1, A(s) = a}. We will use either linear regression or decision-tree regression, both using default sklearn parameters. Note that even in non-contextual settings, where ${\hat{\bar{Q}}}_{t - 1}$ is a simple per-arm sample average, ${\hat{\bar{Q}}}_{t - 1}$ may be biased due to adaptive data collection [Xu et al., 2013, Luedtke and van der Laan, 2016, Bowden and Trippa, 2017, Nie et al., 2018, Hadad et al., 2019, Shin et al., 2019]. Inverse Propensity Score Weighting (IPW) sets w_t = 1, ω_t = (g*/g_t)(A(t) | X(t)), ${\hat{\bar{Q}}}_{t} = 0$ . Doubly Robust (DR) sets w_t = 1, ω_t = (g*/g_t)(A(t) | X(t)) and fits ${\hat{\bar{Q}}}_{t - 1}$ as in DM. More Robust Doubly Robust (MRDR) [Farajtabar et al., 2018] is the same as DR but when fitting ${\hat{\bar{Q}}}_{t - 1}$ we reweight each data point by $\frac{g^{*} (A (s) ∣ X (s)) (1 - g_{s} (A (s) ∣ X (s)))}{g_{s} {(A (s) ∣ X (s))}^{2}}$ . None of the above are generally asymptotically normal under adaptive data collection [Hadad et al., 2019]. Adaptive Doubly Robust (ADR; a.k.a. stabilized onestep estimator for multi-armed bandit data) [Luedtke and van der Laan, 2016, Hadad et al., 2019] is the same as DR but sets $w_{t} = g_{t}^{- 1 / 2} (A (t) ∣ X (t))$ . ADR is unbiased and asymptotically normal for multi-armed bandit logging policies but is biased for context-measurable adaptive logging policies, which is the focus of this paper. Finally, note that our proposal CADR takes the same form as DR but with $w_{t} = {\hat{σ}}_{t}^{- 1}$ using our adaptive conditional standard deviation estimators ${\hat{σ}}_{t}$ in Eq. (3).

4.2. Contextual Bandit Data from Multiclass Classification Data

To construct our data, we turn K-class classification tasks into a K-armed contextual bandit problems [Dudík et al., 2014, Dimakopoulou et al., 2017, Su et al., 2019], which has the benefits of reproducibility using public datasets and being able to make uncontroversial comparisons using actual ground truth data with counterfactuals. We use the public OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18; BSD 3-Clause license) [Bischl et al., 2017], which has datasets that vary in domain, number of observations, number of classes and number of features. Among these, we select the classification datasets which have less than 100 features. This results in 57 classification datasets from OpenML-CC18 used for evaluation and Table 1 summarizes the characteristics of these datasets.

Table 1:

Characteristics of the 57 OpenML-CC18 datasets used for evaluation.

Samples	Count	Classes	Count	Features	Count
< 1000	17	= 2	31	≥ 2 and < 10	14
≥ 1000 and < 10000	30	> 2 and < 10	17	≥ 10 and < 50	34
≥ 10000	10	≥ 10	9	≥ 50 and ≤ 100	9

Open in a new tab

Each dataset is a collection of pairs of covariates X and labels L ∈ {1, …, K}. We transform each dataset to the contextual bandit problem as follows. At each round, we draw X(t), L(t) uniformly at random with replacement from the dataset. We reveal the context X(t) to the agent, and given an arm pull A(t), we draw and return the reward $Y (t) ~ N (1 {A (t) = L (t)}, 1)$ . To generate our data, we set T = 10000 and use the following ϵ-greedy procedure. We pull arms uniformly at random until each arm has been pulled at least once. Then at each subsequent round t, we fit ${\hat{\bar{Q}}}_{t - 1}$ using the data up to that time in the same fashion as used for the DM estimator above using decision-tree regressions. We set ${\tilde{A}}_{x} (t) = arg {max}_{a = 1, \dots, K} {\hat{\bar{Q}}}_{t - 1} (a, X (t))$ and ϵ_t = 0.01 · t^−1/3. We then let g_t(a | x) = ϵ_t/K for $a \neq {\tilde{A}}_{x} (t)$ and $g_{t} ({\tilde{A}}_{x} (t) ∣ x) = 1 - ϵ_{t} + ϵ_{t} / K$ . That is, with probability ϵ_t we pull a random arm, and otherwise we pull ${\tilde{A}}_{X (t)} (t)$ .

We then consider four candidate policies to evaluate: (1) “arm 1 non-contextual”: g*(1 | x) = 1 and otherwise g*(a | x) = 0 (note that the meaning of label “1” changes by dataset), (2) “arm 2 non-contextual”: g*(2 | x) = 1 and otherwise g*(a | x) = 0, (3) “linear contextual”: we sample a new dataset of size T using a uniform exploration policy, then fit ${\hat{\bar{Q}}}_{T}$ as above using linear regression, fix $a^{*} = arg {max}_{a \in {1, \dots, K}} {\hat{\bar{Q}}}_{T} (a, x)$ , and set g*(a* | x) = 1 and otherwise g*(a | x) = 0, (4) “tree contextual”: same as “linear contextual” but fit ${\hat{\bar{Q}}}_{T}$ using decision-tree regression.

4.3. Results

Figure 1 shows the comparison of CADR estimator against DM, IPW, DR, ADR, and MRDR w.r.t. coverage, that is, the frequency over 64 replications of the 95% confidence interval covering the true Ψ₀, for each of the 57 OpenML-CC18 datasets and 4 target policies. In each subfigure, each dot represents a dataset, the y-axis corresponds to the coverage of the CADR estimator and the x-axis corresponds to the coverage of one of the baseline estimators. The lines represent one standard error over the 64 replications. The dot is depicted in blue if for that dataset CADR has significantly better coverage than the baseline estimator, in red if it has significantly worse coverage, and in black if the difference in coverage of both estimators is within one standard error. In Fig. 1, outcome models for CADR, DM, DR, ADR, and MRDR are fit using linear regression (with default sklearn parameters). In the appendix, we provide additional empirical results where we use decision-tree regressions, or where we use the MRDR outcome model for CADR, or where we use cross-fold estimation across time.

Across all of our experiments, we observe that the confidence interval of CADR has better coverage of the ground truth than any other baseline, which can be attributed to its asymptotic normality. The second best estimator in terms of coverage is DR. The advantages of CADR over DR are most pronounced when either (a) there is a mismatch between the logging policy and the target policy (e.g., compare the 1st and 2nd rows in Fig. 1; the tree target policy is most similar to the logging policy, which also uses trees) or (b) when the outcome model is bad (either due to model misspecification such as with a linear model on real data or due to small sample size).

5. Conclusions

Adaptive experiments hold great promise for better, more efficient, and even more ethical experiments. However, they complicate post-experiment inference, which is a cornerstone of drawing credible conclusions from controlled experiments. We provided here the first asymptotically normal estimator for policy value and causal effects when data were generated from a contextual adaptive experiment, such as a contextual bandit algorithm. This led to simple and effective confidence intervals given by adding and subtracting multiples of the standard error, making contextual adaptive experiments a more viable option for experimentation in practice.

6. Societal Impact and Limitations

Adaptive experiments hold particular promise in settings where experimentation is costly and/or dangerous, such as in medicine and policymaking. By adapting treatment allocation, harmful interventions can be avoided, outcomes for study participants improved, and smaller studies enabled. Being able to draw credible conclusions from such experiments make them viable replacements for classic randomized trials. Our confidence intervals offer one way to do so. At the same time, and especially subject to our assumption of vanishing but nonzero exploration, these experiments must be subject to the same ethical guidelines as classic randomized experiments. Additionally, the usual caveats of frequentist confidence intervals hold here, such as its interpretation only as a guarantee over data collection, this guarantee only being approximate in finite samples when we rely on asymptotic normality, and the risks of multiple comparisons and of p-hacking. Finally, we note that our inference focused on an average quantity, as such it focuses on social welfare and need not capture the risk to individuals or groups. Subgroup analyses may therefore be helpful in complementing the analysis; these can be conducted by setting g*(a | x) to zero for some x’s. Future work may be necessary to further extend our results to conducting inference on risk metrics such as quantiles of outcomes.

Supplementary Material

NIHMS1816256-supplement-1.pdf^{(1.7MB, pdf)}

Contributor Information

Aurélien Bibaut, Netflix.

Antoine Chambaz, Université Paris Descartes.

Maria Dimakopoulou, Netflix.

Nathan Kallus, Cornell University and Netflix.

Mark van der Laan, University of California, Berkeley.

References

Athey Susan, Baird Sarah, Jamison Julian, McIntosh Craig, Özler Berk, and Sama Dohbit. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2018. URL http://pubdocs.worldbank.org/en/606341582906195532/Study-Protocol-Adaptive-experiment-on-FP-counseling-and-uptake-of-MCs.pdf. Study protocol.
Bakshy Eytan, Dworkin Lili, Karrer Brian, Kashin Konstantin, Letham Benjamin, Murthy Ashwin, and Singh Shaun. Ae: A domain-agnostic platform for adaptive experimentation. In Workshop on System for ML, 2018. [Google Scholar]
Beygelzimer Alina and Langford John. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009. [Google Scholar]
Bibaut Aurelien, Dimakopoulou Maria, Chambaz Antoine, Kallus Nathan, and van der Laan Mark. Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. 2021. [PMC free article] [PubMed]
Bischl Bernd, Casalicchio Giuseppe, Feurer Matthias, Hutter Frank, Lang Michel, Mantovani Rafael G, van Rijn Jan N, and Vanschoren Joaquin. Openml benchmarking suites. arXiv preprint arXiv:1708.03731, 2017. [Google Scholar]
Bowden Jack and Trippa Lorenzo. Unbiased estimation for response adaptive clinical trials. Statistical methods in medical research, 26(5):2376–2388, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chernozhukov Victor, Chetverikov Denis, Demirer Mert, Duflo Esther, Hansen Christian, Newey Whitney, and Robins James. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018. [Google Scholar]
Dimakopoulou Maria, Zhou Zhengyuan, Athey Susan, and Imbens Guido. Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077, 2017. [Google Scholar]
Dudík Miroslav, Langford John, and Li Lihong. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011. [Google Scholar]
Dudík Miroslav, Erhan Dumitru, Langford John, Li Lihong, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014. [Google Scholar]
Farajtabar Mehrdad, Chow Yinlam, and Ghavamzadeh Mohammad. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018. [Google Scholar]
Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kallus Nathan. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8895–8906, 2018. [Google Scholar]
Kallus Nathan and Udell Madeleine. Dynamic assortment personalization in high dimensions. Operations Research, 68(4):1020–1037, 2020. [Google Scholar]
Kallus Nathan and Uehara Masatoshi. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. arXiv preprint arXiv:1909.05850, 2019a. [Google Scholar]
Kallus Nathan and Uehara Masatoshi. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. Advances in neural information processing systems, 32, 2019b. [Google Scholar]
Kasy Maximilian and Sautmann Anja. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021. [Google Scholar]
Li Lihong, Chu Wei, Langford John, and Schapire Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [Google Scholar]
Li Lihong, Chu Wei, Langford John, and Wang Xuanhui. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011. [Google Scholar]
Luedtke Alexander R. and van der Laan Mark J.. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. The Annals of Statistics, 44(2):713–742, 2016. Doi: 10.1214/15-AOS1384. URL 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nie Xinkun, Tian Xiaoying, Taylor Jonathan, and Zou James. Why adaptively collected data have negative bias and how to correct for it. In International Conference on Artificial Intelligence and Statistics, pages 1261–1269. PMLR, 2018. [Google Scholar]
Qiang Sheng and Bayati Mohsen. Dynamic pricing with demand covariates. arXiv preprint arXiv:1604.07463, 2016. [Google Scholar]
Quinn Simon, Teytelboym Alex, Kasy Maximilian, Gordon Grant, and Caria Stefano. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2019. URL https://www.socialscienceregistry.org/trials/3870. Study registration.
Robins James M, Rotnitzky Andrea, and Zhao Lue Ping. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89 (427):846–866, 1994. [Google Scholar]
Shin Jaehyeok, Ramdas Aaditya, and Rinaldo Alessandro. On the bias, risk and consistency of sample means in multi-armed bandits. arXiv preprint arXiv:1902.00746, 2019. [Google Scholar]
Su Yi, Wang Lequn, Santacatterina Michele, and Joachims Thorsten. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pages 6005–6014. PMLR, 2019. [Google Scholar]
Tewari Ambuj and Murphy Susan A. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017. [Google Scholar]
Thomas Philip and Brunskill Emma. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016. [Google Scholar]
van der Laan Mark J and Robins James M. Unified methods for censored longitudinal data and causality. Springer Science & Business Media, 2003. [Google Scholar]
van der Vaart A and Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York, 03 1996. ISBN 9781475725452. [Google Scholar]
van der Vaart Aad W. Asymptotic statistics. Cambridge university press, 2000. [Google Scholar]
van Handel R. On the minimal penalty for Markov order estimation. Probability Theory and Related Fields, 150:709–738, 2011. [Google Scholar]
Wang Yu-Xiang, Agarwal Alekh, and Dudık Miroslav. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017. [Google Scholar]
Xu Min, Qin Tao, and Liu Tie-Yan. Estimation bias in multi-armed bandit algorithms for search advertising. Advances in Neural Information Processing Systems, 26:2400–2408, 2013. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1816256-supplement-1.pdf^{(1.7MB, pdf)}

[R1] Athey Susan, Baird Sarah, Jamison Julian, McIntosh Craig, Özler Berk, and Sama Dohbit. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2018. URL http://pubdocs.worldbank.org/en/606341582906195532/Study-Protocol-Adaptive-experiment-on-FP-counseling-and-uptake-of-MCs.pdf. Study protocol.

[R2] Bakshy Eytan, Dworkin Lili, Karrer Brian, Kashin Konstantin, Letham Benjamin, Murthy Ashwin, and Singh Shaun. Ae: A domain-agnostic platform for adaptive experimentation. In Workshop on System for ML, 2018. [Google Scholar]

[R3] Beygelzimer Alina and Langford John. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009. [Google Scholar]

[R4] Bibaut Aurelien, Dimakopoulou Maria, Chambaz Antoine, Kallus Nathan, and van der Laan Mark. Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. 2021. [PMC free article] [PubMed]

[R5] Bischl Bernd, Casalicchio Giuseppe, Feurer Matthias, Hutter Frank, Lang Michel, Mantovani Rafael G, van Rijn Jan N, and Vanschoren Joaquin. Openml benchmarking suites. arXiv preprint arXiv:1708.03731, 2017. [Google Scholar]

[R6] Bowden Jack and Trippa Lorenzo. Unbiased estimation for response adaptive clinical trials. Statistical methods in medical research, 26(5):2376–2388, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chernozhukov Victor, Chetverikov Denis, Demirer Mert, Duflo Esther, Hansen Christian, Newey Whitney, and Robins James. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018. [Google Scholar]

[R8] Dimakopoulou Maria, Zhou Zhengyuan, Athey Susan, and Imbens Guido. Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077, 2017. [Google Scholar]

[R9] Dudík Miroslav, Langford John, and Li Lihong. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011. [Google Scholar]

[R10] Dudík Miroslav, Erhan Dumitru, Langford John, Li Lihong, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014. [Google Scholar]

[R11] Farajtabar Mehrdad, Chow Yinlam, and Ghavamzadeh Mohammad. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018. [Google Scholar]

[R12] Hadad Vitor, Hirshberg David A, Zhan Ruohan, Wager Stefan, and Athey Susan. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Kallus Nathan. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8895–8906, 2018. [Google Scholar]

[R14] Kallus Nathan and Udell Madeleine. Dynamic assortment personalization in high dimensions. Operations Research, 68(4):1020–1037, 2020. [Google Scholar]

[R15] Kallus Nathan and Uehara Masatoshi. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. arXiv preprint arXiv:1909.05850, 2019a. [Google Scholar]

[R16] Kallus Nathan and Uehara Masatoshi. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. Advances in neural information processing systems, 32, 2019b. [Google Scholar]

[R17] Kasy Maximilian and Sautmann Anja. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021. [Google Scholar]

[R18] Li Lihong, Chu Wei, Langford John, and Schapire Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [Google Scholar]

[R19] Li Lihong, Chu Wei, Langford John, and Wang Xuanhui. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011. [Google Scholar]

[R20] Luedtke Alexander R. and van der Laan Mark J.. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. The Annals of Statistics, 44(2):713–742, 2016. Doi: 10.1214/15-AOS1384. URL 10.1214/15-AOS1384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Nie Xinkun, Tian Xiaoying, Taylor Jonathan, and Zou James. Why adaptively collected data have negative bias and how to correct for it. In International Conference on Artificial Intelligence and Statistics, pages 1261–1269. PMLR, 2018. [Google Scholar]

[R22] Qiang Sheng and Bayati Mohsen. Dynamic pricing with demand covariates. arXiv preprint arXiv:1604.07463, 2016. [Google Scholar]

[R23] Quinn Simon, Teytelboym Alex, Kasy Maximilian, Gordon Grant, and Caria Stefano. A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in cameroon, 2019. URL https://www.socialscienceregistry.org/trials/3870. Study registration.

[R24] Robins James M, Rotnitzky Andrea, and Zhao Lue Ping. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89 (427):846–866, 1994. [Google Scholar]

[R25] Shin Jaehyeok, Ramdas Aaditya, and Rinaldo Alessandro. On the bias, risk and consistency of sample means in multi-armed bandits. arXiv preprint arXiv:1902.00746, 2019. [Google Scholar]

[R26] Su Yi, Wang Lequn, Santacatterina Michele, and Joachims Thorsten. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pages 6005–6014. PMLR, 2019. [Google Scholar]

[R27] Tewari Ambuj and Murphy Susan A. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017. [Google Scholar]

[R28] Thomas Philip and Brunskill Emma. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016. [Google Scholar]

[R29] van der Laan Mark J and Robins James M. Unified methods for censored longitudinal data and causality. Springer Science & Business Media, 2003. [Google Scholar]

[R30] van der Vaart A and Wellner J. Weak Convergence and Empirical Processes. Springer-Verlag; New York, 03 1996. ISBN 9781475725452. [Google Scholar]

[R31] van der Vaart Aad W. Asymptotic statistics. Cambridge university press, 2000. [Google Scholar]

[R32] van Handel R. On the minimal penalty for Markov order estimation. Probability Theory and Related Fields, 150:709–738, 2011. [Google Scholar]

[R33] Wang Yu-Xiang, Agarwal Alekh, and Dudık Miroslav. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017. [Google Scholar]

[R34] Xu Min, Qin Tao, and Liu Tie-Yan. Estimation bias in multi-armed bandit algorithms for search advertising. Advances in Neural Information Processing Systems, 26:2400–2408, 2013. [Google Scholar]

PERMALINK

Post-Contextual-Bandit Inference

Aurélien Bibaut

Antoine Chambaz

Maria Dimakopoulou

Nathan Kallus

Mark van der Laan

Abstract

1. Introduction

1.1. Problem Statement and Notation

The data.

The target parameter.

Canonical gradient.

Integration operator notation.

1.2. Related Literature and Challenges for Post-Contextual-Bandit Inference

Off-policy evaluation.

Inference challenges in adaptive settings.

Stabilized DR estimators in non-contextual settings.

1.3. Contributions

2. Construction and Analysis of the Generic Contextual Stabilized Estimator

2.1. Construction of the Estimator

Outcome and variance estimators.

The generic form of the estimator.

2.2. Asymptotic normality guarantees

3. Construction of the Conditional Variance Estimator and CADR

Representation via importance sampling.

Dealing with the randomness of the estimation target.

Dealing with the randomness of Q¯^t−1.

From fixed g to random gt.

Consistency guarantee for σ^t2.

CADR asymptotics.

4. Empirical Evaluation

4.1. Baseline Estimators

4.2. Contextual Bandit Data from Multiclass Classification Data

Table 1:

4.3. Results

Figure 1:

5. Conclusions

6. Societal Impact and Limitations

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Dealing with the randomness of ${\hat{\bar{Q}}}_{t - 1}$ .

From fixed g to random g_t.

Consistency guarantee for ${\hat{σ}}_{t}^{2}$ .