Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 11.
Published in final edited form as: Proc Mach Learn Res. 2018;80:5353–5362.

Variance Regularized Counterfactual Risk Minimizationvia Variational Divergence Minimization

Hang Wu 1, May D Wang 1
PMCID: PMC7419136  NIHMSID: NIHMS1595660  PMID: 32789292

Abstract

Off-policy learning, the task of evaluating and improving policies using historic data collected from a logging policy, is important because on-policy evaluation is usually expensive and has adverse impacts. One of the major challenge of off-policy learning is to derive counterfactual estimators that also has low variance and thus low generalization error. In this work, inspired by learning bounds for importance sampling problems, we present a new counterfactual learning principle for off-policy learning with bandit feedbacks. Our method regularizes the generalization error by minimizing the distribution divergence between the logging policy and the new policy, and removes the need for iterating through all training samples to compute sample variance regularization in prior work. With neural network policies, our end-to-end training algorithms using variational divergence minimization showed significant improvement over conventional baseline algorithms and is also consistent with our theoretical results.

Introduction

Off-policy learning refers to evaluating and improving a deterministic policy using historic data collected from a stationary policy, which is important because in real-world scenarios on-policy evaluation is oftentimes expensive and has adverse impacts. For instance, evaluating a new treatment option, a clinical policy, by administering it to patients requires rigorous human clinical trials, in which patients are exposed to risks of serious side effects. As another example, an online advertising A/B testing can incur high cost for advertisers and bring them few gains. Therefore, we need to utilize historic data to perform off-policy evaluation and learning that can enable safe exploration of the hypothesis space of policies before deploying them.

There has been extensive studies on off-policy learning in the context of reinforcement learning and contextual bandits, including various methods such as Q learning (Sutton and Barto 1998), doubly robust estimator (Dudík et al. 2014), self-normalized (Swaminathan and Joachims 2015b), etc. A recently emerging direction of off-policy learning involves the use of logged interaction data with bandit feedback. However, in this setting, we can only observe limited feedback, often in the form of a scalar reward or loss, for every action; a larger amount of information about other possibilities is never revealed, such as what reward we could have obtained had we taken another action, the best action we should have take, and the relationship between the change in policy and the change in reward. For example, after an item is suggested to a user by an online recommendation system, although we can observe the user’s subsequent interactions with this particular item, we cannot anticipate the user’s reaction to other items that could have been the better options.

Using historic data to perform off-policy learning in bandit feedback case faces a common challenge in counterfactual inference: How do we handle the distribution mismatch between the logging policy and a new policy and the induced generalization error? To answer this question, (Swaminathan and Joachims 2015a) derived the new counterfactual risk minimization framework, that added the sample variance as a regularization term into conventional empirical risk minimization objective. However, the parametrization of policies in their work as linear stochastic models has limited representation power, and the computation of sample variance regularization requires iterating through all training samples. Although a first-order approximation technique was proposed in the paper, deriving accurate and efficient end-to-end training algorithms under this framework still remains a challenging task.

Our contribution in this paper is three-fold:

  1. By drawing a connection to the generalization error bound of importance sampling (Cortes, Mansour, and Mohri 2010), we propose a new learning principle for off-policy learning with bandit feedback. We explicitly regularize the generalization error of the new policy by minimizing the distribution divergence between it and the logging policy. The proposed learning objective automatically trade off between empirical risk and sample variance.

  2. To enable end-to-end training, we propose to parametrize the policy as a neural network, and solves the divergence minimization problem using recent work on variational divergence minimization (Nowozin, Cseke, and Tomioka 2016) and Gumbel soft-max (Jang, Gu, and Poole 2016) sampling.

  3. Our experiment evaluation on benchmark datasets shows significant improvement in performance over conventional baselines, and case studies also corroborates the soundness of our theoretical proofs.

Background

Problem Framework

We first review the framework of off-policy learning with logged bandit feedback introduced in (Swaminathan and Joachims 2015a). A policy maps an input xX to a structured (discrete) output yY. For example, the input x can be profiles of users, and we recommend movies of relevance to the users as the output y; or in the reinforcement learning setting, the input is the trajectory of the agent, and the output is the action the agent should take in the next time point. We use a family of stochastic policies, where each policy defines a posterior distribution over the output space given the input x, parametrized by some θ, i.e., hθ(Y|x). Note that here a distribution which has all its probability density mass on one action corresponds to a deterministic policy. With the distribution h(Y|x), we take actions by sampling from it, and each action y has a probability of h(y|x) being selected. In the discussion later, we will use h and h(y|x) interchangeably when there will not create any confusion.

In online systems, we observe feedbacks δ(x,y;y*) for the action y sampled from h(Y|x) by comparing it to some underlying ‘best’ y* that was not revealed to the system. For example, in recommendation system, we can use a scalar loss function δ(x,y;y*) [0,L], with smaller values indicating higher satisfaction with recommended items.

The risk of a policy h(Y|x) is defined as

R(h)=ExP(X),yh(Y|x)[δ(x,y)]

, and the goal of off-policy learning is to find a policy with minimum risk on test data.

In the off-line logged learning setting, we only have data collected from a logging policy h0(Y|x), and we aim to find an improved policy h(Y|x) that has lower risks R(h)<R(h0). Specifically, the data we will use will be

D={xi,yi,δi=δi(xi,yi),pi=h0(yi|xi)},i=1,...,N,

where δi and pi are the observed loss feedback and the logging probability (also called propensity score), and N is the number of training samples.

Two main challenges are associated with this task: 1) If the distribution of a logging policy is skewed towards a specific region of the whole space, and doesn’t have support everywhere, feedbacks of certain actions cannot be obtained and improvement for these actions is not possible as a result. 2) since we cannot compute the expectation exactly, we need to resort to empirical estimation using finite samples, which creates generalization error and needs additional regularization.

A vanilla approach to solve the problem is propensity scoring approach using importance sampling (Rosenbaum and Rubin 1983), by accounting for the distribution mismatch between h and h0. Specifically, we can rewrite the risk w.r.t h as the risk w.r.t h0 using an importance reweighting:

R(h)=Ex(X),yh(y|x)[δ(x,y)]=Ex(X),yh0(y|x)[h(y|x)h0(y|x)δ(x,y)]

With the collected historic dataset D, we can estimate the empirical risk R^D(h), short as R^(h)

R^(h)=1Ni=1Nh(yi|xi)h0(yi|xi)δi(xi,yi)

Counterfactual Risk Minimization

(Swaminathan and Joachims 2015a) pointed out several flaws with the vanilla approach, namely, not being invariant to loss scaling, large and potentially unbounded variance. To regularize the variance, the authors proposed a regularization term for sample variance derived from empirical Bernstein bounds.

The modified objective function to minimize is now:

R^(h)=1Ni=1Nui+λVar(u¯)N

, where ui=h(yi|xi)h0(yi|xi)δi,u¯=1Ni=1Nui is the average of {ui} obtained from training data, and Var(u¯) is the sample variance of {ui}.

As the variance term is dependent on the whole dataset, stochastic training is difficult, the authors approximated the regularization term via first-order Taylor expansion and obtained a stochastic optimization algorithm. Despite its simplicity, such first-order approximation neglects the non-linear terms from second-order and above, and introduces approximation errors while trying to reduce the sample variance.

Variance Regularization Objective

Theoretical Motivation

Instead of estimating variance empirically from the samples, which prohibits direct stochastic training, the fact that we have a parametrized version of the policy h(Y|x) motivates us to think: can we derive a variance bound directly from the parametrized distribution?

We first note that the empirical risk term R^(h) is the average loss reweighted by importance sampling function h(y|x)h0(y|x), and a general learning bound exist for the second moment of importance weighted loss:

Let X be a random variable distributed according to distribution P with density p(x), Y be a random variable, and δ(x,y) be a loss function over (x,y) that is bounded in [0,L]. For two sampling distributions of y, h(y|x) and h0(y|x), define their conditional divergence as d2(h(y|x)||h0(y|x);P(x)), we have

ExP(X),yh0(y|x)[w2(y|x)δ2(x,y)]L2d2(h(y|x)||h0(y|x);P(x))

[thm:bound]

The bound is similar to that of (Cortes, Mansour, and Mohri 2010) for a single random variable except that we are working with a joint distribution over x,y here. Detailed proofs can be found in Appendix [thm:bound].

Let Rh be the risk of the new policy on loss function δ, and R^h be the emprical risk. We additionally assume the divergence is bounded by M, i.e., d2(h||h0)d(h||h0)=M

Then with probability at least 1η,

R(h)R^(h)+2LMlog1/η3N+L2d(h||h0;(x))log1/ηN

[th:bern]

The proof of this theorem is an application of Bernstein inequality and the second moment bound, and detailed proof is in Appendix. This result highlights the bias-variance trade-offs as seen in empirical risk minimization (ERM) problems, where R^h approximates the empirical risk/ bias, and the third term characterize the variance of the solution with distribution divergence. It thus motivates us that in bandit learning setting, instead of directly optimizing the reweighted loss and suffer huge variance in test setting, we can try to minimize the variance regularized objectives as

minh=h(Y|x)R^(h)+λ1Nd2(h||h0;P(x))

λ=2L2log1/η is a model hyper-parameter controlling the trade-off between empirical risk and model variance, but we are still faced with the challenge of setting λ empirically and the difficulty in optimizing the objective (See Appendix for a comparison). Thus, in light of the recent success of distributionally robust learning, we explore an alternative formulation of the above regularized ERM in the next subsection.

Robustly Regularized Formulation

Instead of solving a ‘loss + regularizer’ objective function, we here study a closely related constrained optimization formulation, whose intuition comes from the method of Langaragian mutliplier for constrained optimization.

The new formulation is:

minh1Ni=1mh(yi|xi)piδis.t.d2(h||h0;P(X))ρ/N,

where ρ is a pre-determined constant as the regularization hyper-parameter.

We analyzes the generalization property of the minimizer h* to the constrained problem: first showing the bound of empirical risk of h* compared to that of h0, then move to the expectation setting.

The minimizing value of R^(h*) satisfies that

2ρNVar^R^(h*)R^(h0)(2ρNVar^2ρLN)+

[prop:emp_bound]

where Var^=Var^h0[δihih0[i]] is the empirical variance for the reweighted loss. The results come from algebraic computation using KKT conditions for a Lagrange multiplier argument, and we show the calculation in Appendix.

With the above proposition, we can prove the following bound for a minimizer h* to the optimization problem

For a minimier h*, we have that with probability at least 1exp(t),

R(h*)R(h0)2(ρt)NVar^+(4+6ρ)L3Nt+6tL23N+2tVar^h0[δi]N

where Var^ is the policy-reweighted loss, and the constant Var^h0[δi] is the variance of the original losses.

We show the detailed proof in Appendix, which combines Benett’s inequality and the proposition above. These results suggest that the risk of our minimizer is a good surrogate to the best risk plus a variance term, with their difference controlled by the regularization hyper-parameter ρ and approaches 0 when N0. Moreover, we can recognize, in some cases, e.g. Var^ is considerably large while ρ>t, we are guaranteed to have an improvement on the losses.

At first glance, the new objective function removes the needs to compute the sample variance in existing bounds, but when we have a parametrized distribution of h(y|x), and finite samples {xi,yi}i=1N from h0(yi|xi), estimating the divergence function is not an easy task. In the next subsection, we will present how recent f-gan networks for variational divergence minimization (Nowozin, Cseke, and Tomioka 2016) and Gumbel soft-max sampling (Jang, Gu, and Poole 2016) can help solve the task.

Discussion: Possibility of Counterfactual Learning:

One interesting aspect of our bounds also stresses the need for the stochasticity of the logging policy (Langford, Strehl, and Wortman 2008). For a deterministic logging policy, if the corresponding probability distribution can only have some peaked masses, and zeros elsewhere in its domain, our intution suggests that learning will be difficult, as those regions are never explored. Our theory well reflects this intuition in the calculation of the divergence term, the integral of form yh2(y|x)/h0(y|x)dy. A deterministic policy has a non-zero measure region of h0(Y|x) with probability density of h0(y|x)=0, while the corresponding h(y|x) can have finite values in the region. The resulting integral results is thus unbounded, and in turn induces an unbounded generalization bound, making counterfactual learning in this case not possible

Adversarial Training Algorithm

Adversarial Learning of the Divergence

The derived variance regularized objective requires us to minimize the square root of the conditional divergence, d2(h||h0;P(X))=Exyh2h0dy.

For simplicity, we can examine the term inside the expectation operation first. With simple calculation, we have

yh(y|x)2h0(y|x)dy=yh0(y|x)[(h(y|x)h0(y|x))21+1]=Df(h(Y|x)||h0(Y|x))+1,

where f(t)=t21 is a convex function in the domain {t:t0} with f(1)=0. Combining with the expectation operator gives a minimization objective of Df(h||h0;P(X)) (+1 omitted as constant).

The above calculation draws connection between our divergence and the f-divergence measure (Nguyen, Wainwright, and Jordan 2010). Follow the f-GAN for variational divergence minimization method proposed in (Nowozin, Cseke, and Tomioka 2016), we can reach a lower bound of the above objective as

Df(h(Y|x)||h0(Y|x);P(X))=Ex[yf(h(y|x)h0(y|x))dh0(y|x)]=Ex[supT{Eyh[T(y)]Eyh0[f*(T(y))])}=supT{ExEyh[T(x,y)]ExEyh0[f*(T(x,y))]}supTT{ExEyh[T(x,y)]ExEyh0[f*(T(x,y))]}=supTT{Ex,yhT(x,y)Ex,yh0f*(T(x,y))}F(T,h)

For the second equality, as f is a convex function and applying Fenchel convex duality (f*=supu{uvf(u)}) gives the dual formulation. Because the expectation is taken w.r.t to x while the supreme is taken w.r.t. all functions T, we can safely swap the two operators. We note that the bound is tight when T0(x)=f(h/h0), where f is the first order derivative of f as f(t)=2t (Nguyen, Wainwright, and Jordan 2010).

The third inequality follows because we restrict T to a family of functions instead of all functions. Luckily, the universal approximation theorem of neural networks (Hornik, Stinchcombe, and White 1989) states that neural networks with arbitrary number of hidden units can approximate continous functions on a compact set with any desired precision. Thus, by choosing the family of T to be the family of neural networks, the equality condition of the second equality can be satisfied theoretically.

The final objective is a saddle point of a function T(x,y):X×YR that maps input pairs to a scalar value, and the policy we want to learn h(Y|x) acts as a sampling distribution. Although being a lower bound with achievable equality conditions, theoretically, this saddle point trained with mini-batch estimation is a consistent estimator of the true divergence (Proof in Appendix).

We use Df=supTTdhdxf*(T)dh0dx to denote the true divergence, and D^f=supTTT(xi,yi)dh^dxf*T(xj,yj)dh^0dx the empirical estimator we use, where h^ and h^0 are the empirical distribution obtained by sampling from the two distribution respectively.

D^f is a consistent estimator of Df.

Again, a generative-adversarial approach (Goodfellow et al. 2014) can be applied. Toward this end, we represent the T function as a discriminator network parametrized as Tw(x,y). We then parametrize the distribution of our policy h(y|x) as another generator neural network hθ(y|x) mapping x to the probability of sampling y. For structured output problems with discrete values of y, to allow the gradients of samples obtained from sampling backpropagated to all other parameters, we use the Gumbel soft-max sampling (Jang, Gu, and Poole 2016) methods for differential sampling from the distribution h(y|x). We list the complete training procedure Alg. [alg:fgan] for completeness. As shown by (Nowozin, Cseke, and Tomioka 2016), when the saddle point solution exisits, under some mild conditions, Alg. [alg:fgan] decreases the objective value with a geometric rate.

D={xi,yi}i=1N sampled from logging policy h0; a predefined threshold D0; an initial generator distribution hθ°(y|x); an initial discriminator function Tw°(x,y); max iteration I An optimized generator hθ*(y|x) distribution that has minimum divergence to h0 initialization Sample a mini-batch ‘real’ samples (xi,yi) from D Sample a mini-batch x from D, and construct ‘fake’ samples (xi,y^i) by sampling y^ from hθt(y|x) with Gubmel soft-max Update wt+1=wt+ηwF(Tw,hθ) Update θt+1=θtηθF(Tw,hθ)

For our purpose of minimizing the variance regularization term, we can similarly derive a training algorithm, as the gradient of tt+1 can also be backpropagated.

Training algorithm

With the above two components, we are now ready to present the full treatment of our end-to-end learning for counterfactual risk minimization from logged data. The following algorithm solve the robust regularized formulation and for completeness, training for the original ERM formulation in Sec. [sub:motivation] (referred to co-training version in the later experiment sections) is included in Appendix.

D={xi,yi,pi,δi}i=0N sampled from h0; regularization hyper-parameter ρ, and maximum iteration of divergence minimization steps I, and max epochs for the whole algorithm MAX An optimized generator hθ*(y|x) that is an approximate minimizer of R(w) initialization

Sample a mini-batch of m samples from D Estimate the reweighted loss as R^t=1mi=1mhθt(yi|xi)piδi and get the gradient as g1=θRt Update θt+1=θtηθg1 Call Algorithm [alg:fgan] to minimize the divergence D2(h||h0;P(X)) with threshold =ρ, and max iter set to I

The algorithm works in two seperate training steps: 1) update the parameters of the policy h to minimize the reweighted loss 2) update the parameters of the policy/ generator and the discriminator to regularize the variance thus to improve the generalization performance of the new policy.

Experiments

Experiment Setups

For empirical evaluation of our proposed algorithms, we follow the conversion from supervised learning to bandit feedback method (Agarwal et al. 2014). For a given supervised dataset D*={(xi,yi*)}i=1N, we first construct a logging policy h0(Y|x), and then for each sample xi, we sample a prediction yih0(y|xi), and collect the feedback as δ(yi*,yi). For the purpose of benchmarks, we also use the conditional random field (CRF) policy trained on 5% of D* as the logging policy h0, and use hamming loss, the number of incorrectly misclassified labels between yi and yi*, as the loss function δ ((Swaminathan and Joachims 2015a)). To create bandit feedback datasets D={xi,yi,δi,pi}, each of the samples xi were passed four times to the logging policy h0 and sampled actions yi were recorded along with the loss value δi and the propensity score pi=h0(yi|xi).

In evaluation, we use two type of evaluation metrics for the probabilistic policy h(Y|x). The first is the expected loss (referred to as ‘EXP’ later) R(h)=1NtestiEyh(y|xi)δ(yi*,y), a direct measure of the generalization performance of the learned policy. The second is the average hamming loss of maximum a posteriori probability (MAP) prediction yMAP=argmaxh(y|x) derived from the learned policy, as MAP is a faster way to generate predictions without the need for sampling in practice. However, since MAP predictions only depend on the regions with highest probability, and doesn’t take into account the diverse of predictions, two policies with same MAP performance could have very different generalization performance. Thus, a model with high MAP performance but low EXP performance might be over-fitting, as it may be centering most of its probability masses in the regions where h0 policy obtained good performance.

Benchmark Comparison

Baselines

Vanilla importance sampling algorithms using inverse propensity score (IPS), and the counterfactual risk minimization algorithm from (Swaminathan and Joachims 2015a) (POEM) are compared, with both L-BFGS optimization and stochastic optimization solvers. The hyper-parameters are selected by performance on validation set and more details of their methods can be found in the original paper (Swaminathan and Joachims 2015a).

Neural network policies without divergence regularization (short as NN-NoReg in later discussions) is also compared as baselines, to verify the effectiveness of variance regularization.

Dataset We use four multi-label classification dataset collected in the UCI machine learning repo (Asuncion and Newman 2007), and perform the supervised to bandit conversion. We report the statistics in Table 2 in the Appendix.

Table 2:

Dataset Statistics

Name # Features # Labels # Train # Test
Yeast 103 14 1500 917
Scene 294 6 1211 1196
TMC 30438 22 21519 7077
LYRL 47236 4 23149 781265

For these datasets, we choose a three-layer feed-forward neural network for our policy distribution, and a two or three layer feed-forward neural network as the discriminator for divergence minimization. Detailed configurations can be found in the Appendix 6.

For benchmark comparison, we use the separate training version [alg:sep] as it has faster convergence and better performance (See the section in Appendix for an empirical comparison).

The networks are trained with Adam (Kingma and Ba 2014) of learning rate 0.001 and 0.01 respectively for the reweighted loss and the divergence minimization part. We used PyTorch to implement the pipelines and trained networks with Nvidia K80 GPU cards.1

Results by an average of 10 experiment runs are obtained and we report the two evaluation metrics in Table 1. We report the regularized neural network policies with two Gumbel-softmax sampling schemes, soft Gumbel soft-max (NN-Soft), and straight-through Gumbel soft-max (NN-Hard).

Table 1:

Benchmark Comparison Results.

Dataset Scene Yeast TMC LYRL

Evaluation Metrics MAP EXP MAP EXP MAP EXP MAP EXP
Logging Policy h0 (5% data CRF) 1.069 1.887 3.255 5.485 4.995 5.053 1.047 1.949

NN-NoReg 1.465 1.981 3.223 4.705 1.706 1.724 0.247 0.263
NN-Hard 1.303 1.463 3.047 3.788 1.694 1.720 0.248 0.255
NN-Soft 1.347 1.457 3.097 3.789 1.683 1.707 0.247 0.255

IPS 1.350 1.350 4.256 4.521 4.601 4.416 1.240 1.240
POEM 1.169 1.169 4.238 4.508 4.611 4.505 1.169 1.306
IPS(Stochastic) 1.291 1.291 4.090 4.605 2.812 2.737 1.149 1.479
POEM(Stochastic) 1.322 1.323 4.140 4.570 3.601 3.561 1.237 1.237

Supervised Learning (NN) 0.943 2.238 3.101 4.300 1.530 3.786 0.217 0.519
Supervised Learning (CRF) 1.110 1.423 2.807 4.047 1.344 1.241 0.240 0.437

As we can see from the result, by introducing a neural network parametrization of the policies, we are able to improve the test performance by a large margin compared to the baseline CRF policies, as the representation power of networks are often reported to be stronger than other models. The introduction of additional variance regularization term (comparing NN-Hard/Soft to NN-NoReg), we can observe an additional improvement in both testing loss and MAP prediction loss. We observe no significant difference between the two Gumbel soft-max sampling schemes.

Effect of Variance Regularization

To study the effectiveness of variance regularization quantitatively, we vary the maximum number of iterations (I in Alg. [alg:sep]) we take in each divergence minimization sub loop. For example, ‘NN-Hard-10’ indicates that we use ST Gubmel soft-max and set the maximum number of iterations to 10. Here we set the thresholds for divergence slightly larger so maximum iterations are executed so that results are more comparable. We plot the expected loss in test sets against the epochs average over 10 runs with error bars using the dataset yeast.

As we can see from the figure, models with no regularization (gray lines in the figure) have higher loss, and slower convergence rate. As the number of maximum iterations for divergence minimization increases, the test loss decreased faster and the final test loss is also lower. This behavior suggests that by adding the regularization term, our learned policies are able to generalize better to test sets, and the stronger the regularization we impose by taking more divergence minimization steps, the better the test performance is. The regularization also helps the training algorithm to converge faster, as shown by the trend.

Generalization Performance

Our theoretical bounds implies that the generalization performance of our algorithm improves as the number of training samples increases. We vary the number of passes of training data x was passed to the logging policy to sample an action y, and vary it in the range 2[1,2,...,8] with log scales.

When the number of training samples in the bandit dataset increases, both models with and without regularization have an increasing test performance in the expected loss and reaches a relatively stable level in the end. Moreover, regularized policies have a better generalization performance compared to the model without regularization constantly. This matches our theoretical intuitions that explicitly regularizing the variance can help improve the generalization ability, and that stronger regularization induces better generalization performance. But as indicated by the MAP performance, after the replay of training samples are more than 24, MAP prediction performance starts to decrease, which suggests the models may be starting over-fitting already.

Effect of logging policy vs results

In this section, we discuss how the effect of logging policies, in terms of stochasticity and quality, will affect the learning performance and additional visualizations of other metrics can be found in the Appendix.

As discussed before, the ability of our algorithm to learn an improved policy relies on the stochasticity of the logging policy. To test how this stochasticity affects our learning, we modify the parameter of h0 by introducing a temperature multiplier α. For CRF logging policies, the prediction is made by normalizing values of wTϕ(x,y), where w is the model parameter and can be modified by α with wαw. As α becomes higher, h0 will have a more peaked distribution, and ultimately become a deterministic policy with α.

We varied α in the range of 2[1,1,...,8], and report the average ratio of expected test loss to the logging policy loss of our algorithms (Y-axis in Fig 3a, where smaller values indicate a larger improvement). We can see that NN policies are performing better than logging policy when the stochasticity of h0 is sufficient, while after the temperature parameter increases greater than 23, it’s much harder and even impossible (ratio > 1) to learn improved NN policies. We also note here that the stochasticity doesn’t affect the expected loss values themselves, and the drop in the ratios mainly resulted from the decreased loss of the logging policy h0. In addition, comparing within NN policies, policies with stronger regularization have slight better performance against models with weaker ones, which in some extent shows the robustness of our learning principle.

Figure 3:

Figure 3:

a) The decreasing stochasticity of h0 makes it harder to obtain an improved NN policy, and our regularization can help the model be more robust and achieve better generalization performance.

b) As h0 improves, the models constantly outperform the baselines, however, the difficulty is increasing with the quality of h0. Note: more visualizations of other metrics can be found in the appendix 6.

Finally, we discusses the impact of logging policies to the our learned improved policies. Intuitively, a better policy that has lower hamming loss can produce bandit datasets with more correct predictions, however, it’s also possible that the sampling biases introduced by the logging policy is larger, and such that some predictions might not be available for feedbacks. To study the trade-off between better policy accuracy and the sampling biases, we vary the proportion of training data points used to train the logging policy from 0.05 to 1, and compare the performance of our improved policies obtained by in Fig. 3b. We can see that as the logging policy improves gradually, both NN and NN-Reg policies are outperforming the logging policy, indicating that they are able to address the sampling biases. The increasing ratios of test expected loss to h0 performance, as a proxy for relative policy improvement, also matches our intuition that h0 with better quality is harder to beat.

Conclusion

In this paper, we showed how divergence regularization can help improve off-policy learning both theoretically via importance sampling bounds and empirically by our adversarial learning algorithms on real-world datasets.

Limitations of the work mainly lies in the need for the propensity scores (the probability an action is taken by the logging policy), which may not always be available. Learning to estimate propensity scores and plug the estimation into our training framework will increase the applicability of our algorithms. For example, as suggested by (Cortes, Mansour, and Mohri 2010), directly learning importance weights (the ratio between new policy probability to the logging policy probability) has comparable theoretical guarantees, which might be a good extension for the proposed algorithm.

Although the work focuses on off-policy from logged data, the techniques and theorems may be extended to general supervised learning and reinforcement learning. It will be interesting to study how this training algorithm can work for empirical risk minimization and what generalization bounds it may have as the future direction of research.

Supplementary Material

Algorithm 1
Algorithm 2
appendix

Figure 1:

Figure 1:

Stronger regularization can help obtain faster convergence and better test performance.

Figure 2:

Figure 2:

Neural network policies both have increasing performance with increasing number of training data, while models with regularization have faster convergence rate and better performance.

Figure 4:

Figure 4:

Results from different training schemes suggest that separately minimizing reweighted loss and divergence is better compared to training the two losses together.

Figure 5:

Figure 5:

As the logging policy becomes more deterministic, NN policies are still able to find improvement over h0 in a) expected loss and b) loss with MAP predictions. c) We cannot observe a clear trend in terms of the performance of MAP predictions.

We hypothesize it results from that h0 policy already has good MAP prediction performance by centering some of the masses. While NN policies can easily pick up the patterns, it will be difficult to beat the baselines. We believe this phenomenon worth further investigation.

Figure 6:

Figure 6:

a) As the quality of the logging policy increases, NN policies are still able to find improvement over h0 in expected loss.

and b) c) For MAP predictions, however, it will be really difficult for NN policies to beat if the logging policy was already exposed to full training data and trained in a supervised fashion.

Acknowledgements

This work has been supported in part by grants from the National Science Foundation NSF1651360, National Institutes of Health (NIH) UL1TR000454 and NIH R01CA163256, CDC HHSD2002015F62550B, the Children’s Healthcare of Atlanta, and Microsoft Research. We also thank Dr. Adith Swaminathan for answering questions regarding their paper’s experiments procedure.

Footnotes

1

Codes for reproducing the results can be found in the link https://github.com/hang-wu/VRCRM.

Reference:

  • 1.Agarwal Alekh, Hsu Daniel, Kale Satyen, Langford John, Li Lihong, and Schapire Robert. 2014. “Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits.” In International Conference on Machine Learning, 1638–46. [Google Scholar]
  • 2.Asuncion Arthur, and Newman David. 2007. “UCI Machine Learning Repository.” [Google Scholar]
  • 3.Bertsimas Dimitris, Brown David B, and Constantine Caramanis. 2011. “Theory and Applications of Robust Optimization.” SIAM Review 53 (3). SIAM: 464–501. [Google Scholar]
  • 4.Bertsimas Dimitris, Gupta Vishal, and Kallus Nathan. 2013. “Data-Driven Robust Optimization.” Mathematical Programming. Springer, 1–58. [Google Scholar]
  • 5.Beygelzimer Alina, and Langford John. 2009. “The Offset Tree for Learning with Partial Labels.” In Proceedings of the 15th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 129–38. ACM. [Google Scholar]
  • 6.Bottou Léon, Peters Jonas, Quiñonero-Candela Joaquin, Charles Denis X, Chickering D Max, Portugaly Elon, Ray Dipankar, Simard Patrice, and Snelson Ed. 2013. “Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.” The Journal of Machine Learning Research 14 (1). JMLR. org: 3207–60. [Google Scholar]
  • 7.Cortes Corinna, Mansour Yishay, and Mohri Mehryar. 2010. “Learning Bounds for Importance Weighting.” In Advances in Neural Information Processing Systems, 442–50. [Google Scholar]
  • 8.Duchi John, and Namkoong Hongseok. 2016. “Variance-Based Regularization with Convex Objectives.” arXiv Preprint arXiv:1610.02581. [Google Scholar]
  • 9.Dudík Miroslav, Erhan Dumitru, Langford John, Li Lihong, and others. 2014. “Doubly Robust Policy Evaluation and Optimization.” Statistical Science 29 (4). Institute of Mathematical Statistics: 485–511. [Google Scholar]
  • 10.Esfahani Peyman Mohajerin, and Kuhn Daniel. 2015. “Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations.” arXiv Preprint arXiv:1505.05116. [Google Scholar]
  • 11.Gabrel Virginie, Murat Cécile, and Thiele Aurélie. 2014. “Recent Advances in Robust Optimization: An Overview.” European Journal of Operational Research 235 (3). Elsevier: 471–83. [Google Scholar]
  • 12.Gao Rui, and Anton J Kleywegt. 2016. “Distributionally Robust Stochastic Optimization with Wasserstein Distance.” arXiv Preprint arXiv:1604.02199. [Google Scholar]
  • 13.Goodfellow Ian, Jean Pouget-Abadie Mehdi Mirza, Xu Bing, David Warde-Farley Sherjil Ozair, Courville Aaron, and Bengio Yoshua. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2672–80. [Google Scholar]
  • 14.Gretton Arthur, Alexander J Smola Jiayuan Huang, Schmittfull Marcel, Borgwardt Karsten M, and Schölkopf Bernhard. 2009. “Covariate Shift by Kernel Mean Matching.” MIT press. [Google Scholar]
  • 15.Gu Shixiang, Holly Ethan, Lillicrap Timothy, and Levine Sergey. 2017. “Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates.” In Robotics and Automation (Icra), 2017 Ieee International Conference on, 3389–96. IEEE. [Google Scholar]
  • 16.Gu Shixiang, Lillicrap Timothy, Ghahramani Zoubin, Richard E Turner Bernhard Schölkopf, and Levine Sergey. 2017. “Interpolated Policy Gradient: Merging on-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning.” arXiv Preprint arXiv:1706.00387. [Google Scholar]
  • 17.Hornik Kurt, Stinchcombe Maxwell, and White Halbert. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5). Elsevier: 359–66. [Google Scholar]
  • 18.Horvitz Daniel G, and Thompson Donovan J. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260). Taylor & Francis Group: 663–85. [Google Scholar]
  • 19.Imbens Guido W. 2004. “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.” The Review of Economics and Statistics 86 (1). MIT Press: 4–29. [Google Scholar]
  • 20.Jang Eric, Gu Shixiang, and Poole Ben. 2016. “Categorical Reparameterization with Gumbel-Softmax.” arXiv Preprint arXiv:1611.01144. [Google Scholar]
  • 21.Jiang Nan, and Li Lihong. 2015. “Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning.” arXiv Preprint arXiv:1511.03722. [Google Scholar]
  • 22.Kingma Diederik, and Ba Jimmy. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980. [Google Scholar]
  • 23.Langford John, Strehl Alexander, and Wortman Jennifer. 2008. “Exploration Scavenging.” In Proceedings of the 25th International Conference on Machine Learning, 528–35. ACM. [Google Scholar]
  • 24.Li Lihong, Munos Rémi, and Szepesvári Csaba. 2015. “Toward Minimax Off-Policy Value Estimation.” [Google Scholar]
  • 25.Mahmood Ashique Rupam, Yu Huizhen, and Sutton Richard S. 2017. “Multi-Step Off-Policy Learning Without Importance Sampling Ratios.” arXiv Preprint arXiv:1702.03006. [Google Scholar]
  • 26.Maurer Andreas. 2009. “Empirical Bernstein Bounds and Sample Variance Penalization.” In In Colt. Citeseer. [Google Scholar]
  • 27.Munos Rémi, Stepleton Tom, Harutyunyan Anna, and Bellemare Marc. 2016. “Safe and Efficient Off-Policy Reinforcement Learning.” In Advances in Neural Information Processing Systems, 1054–62. [Google Scholar]
  • 28.Nguyen XuanLong, Wainwright Martin J, and Jordan Michael I. 2010. “Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization.” IEEE Transactions on Information Theory 56 (11). IEEE: 5847–61. [Google Scholar]
  • 29.Nowozin Sebastian, Cseke Botond, and Tomioka Ryota. 2016. “F-Gan: Training Generative Neural Samplers Using Variational Divergence Minimization.” In Advances in Neural Information Processing Systems, 271–79. [Google Scholar]
  • 30.Rényi Alfréd, and others. 1961. “On Measures of Entropy and Information.” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California. [Google Scholar]
  • 31.Rosenbaum Paul R, and Rubin Donald B. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1). Oxford University Press: 41–55. [Google Scholar]
  • 32.Shafieezadeh-Abadeh Soroosh, Esfahani Peyman Mohajerin, and Kuhn Daniel. 2015. “Distributionally Robust Logistic Regression.” In Advances in Neural Information Processing Systems, 1576–84. [Google Scholar]
  • 33.Shivaswamy Pannagadatta, and Joachims Thorsten. 2012. “Multi-Armed Bandit Problems with History.” In Artificial Intelligence and Statistics, 1046–54. [Google Scholar]
  • 34.Strehl Alex, Langford John, Li Lihong, and Kakade Sham M. 2010. “Learning from Logged Implicit Exploration Data.” In Advances in Neural Information Processing Systems, 2217–25. [Google Scholar]
  • 35.Sugiyama Masashi, Krauledat Matthias, and Klaus-Robert MÞller. 2007. “Covariate Shift Adaptation by Importance Weighted Cross Validation.” Journal of Machine Learning Research 8 (May): 985–1005. [Google Scholar]
  • 36.Sutton, Richard S, and Barto Andrew G. 1998. Reinforcement Learning: An Introduction. Vol. 1. 1. MIT press Cambridge. [Google Scholar]
  • 37.Swaminathan Adith, and Joachims Thorsten. 2015a. “Counterfactual Risk Minimization: Learning from Logged Bandit Feedback.” In International Conference on Machine Learning, 814–23. [Google Scholar]
  • 38.Thomas Philip, and Brunskill Emma 2015b. “The Self-Normalized Estimator for Counterfactual Learning.” In Advances in Neural Information Processing Systems, 3231–9. [Google Scholar]
  • 39.Thomas Philip, and Brunskill Emma. 2016. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” In International Conference on Machine Learning, 2139–48. [Google Scholar]
  • 40.Vapnik Vladimir. 2013. The Nature of Statistical Learning Theory. Springer science & business media. [Google Scholar]
  • 41.Wang Yu-Xiang, Agarwal Alekh, and Dudík Miroslav. 2017. “Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits.” In International Conference on Machine Learning, 3589–97. [Google Scholar]
  • 42.Xu Huan, Caramanis Constantine, and Mannor Shie. 2009. “Robust Regression and Lasso.” In Advances in Neural Information Processing Systems, 1801–8. [Google Scholar]
  • 43.Zadrozny Bianca, Langford John, and Abe Naoki. 2003. “Cost-Sensitive Learning by Cost-Proportionate Example Weighting” In Data Mining, 2003. Icdm 2003. Third Ieee International Conference on, 435–42. IEEE. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Algorithm 1
Algorithm 2
appendix

RESOURCES