Skip to main content
Entropy logoLink to Entropy
. 2020 Aug 25;22(9):929. doi: 10.3390/e22090929

Probability Forecast Combination via Entropy Regularized Wasserstein Distance

Ryan Cumings-Menon 1, Minchul Shin 2,*
PMCID: PMC7597186  PMID: 33286698

Abstract

We propose probability and density forecast combination methods that are defined using the entropy regularized Wasserstein distance. First, we provide a theoretical characterization of the combined density forecast based on the regularized Wasserstein distance under the assumption. More specifically, we show that the regularized Wasserstein barycenter between multivariate Gaussian input densities is multivariate Gaussian, and provide a simple way to compute mean and its variance–covariance matrix. Second, we show how this type of regularization can improve the predictive power of the resulting combined density. Third, we provide a method for choosing the tuning parameter that governs the strength of regularization. Lastly, we apply our proposed method to the U.S. inflation rate density forecasting, and illustrate how the entropy regularization can improve the quality of predictive density relative to its unregularized counterpart.

Keywords: entropy regularization, Wasserstein distance, optimal transport, density forecasting, forecast combination, model combination, quantile aggregation

1. Introduction

In this paper, we study a class of density forecast combination methods based on a Wasserstein metric. In the univariate case, an equally weighted centroid defined by a Wasserstein metric corresponds to a quantile averaging or vincentized center where quantiles of forecast densities are averaged. The resulting combined density tends to be narrower than the linear opinion rule [1,2,3], which may or not be desirable, depending on the context.

We propose to use the entropy regularized Wasserstein metric to construct a combined density forecast. Like its unregularized counterpart, this combined probability/density can be defined by an optimization problem, but the optimization problem in this case includes an additional regularization term that penalizes densities with low entropy, which ensures the combined density forecast is smooth. One advantage of this approach is that the entropy regularized Wasserstein barycenter can be found in a much more computationally efficient manner than its unregularized counterpart when the input densities are multi-dimensional [4].

While computational efficiency is the most commonly cited reason for using entropy regularization, this paper demonstrates that there is an additional advantage of regularization when it comes to the density combination problem. It provides a way to tune the degree of dispersion of the combined density forecast. To the best of our knowledge, this regularized metric has not been explored in the context of the density forecasting combination problem.

As a part of our discussion, we provide a theoretical characterization of the regularized Wasserstein distance under the Gaussian assumption. More specifically, we show that the regularized Wasserstein barycenter between two multivariate Gaussian inputs is multivariate Gaussian. Our proof complements Theorem 1 of [5], which characterizes the regularized Wasserstein barycenter among an arbitrary number of univariate normal densities. In addition, our result also provides a simple recursive equation that is guaranteed to converge to the variance–covariance matrix.

We proceed as follows. Section 2 formulates a density forecast combination problem with a general metric. Several existing aggregation methods in the literature can be formulated with the choice of a specific metric within this unified framework. After discussing these existing approaches, we introduce our proposal of using the entropy regularized Wasserstein barycenter. Section 3 provides theoretical results that describe the impact of entropy regularization on the combined density under a Gaussian assumption and discusses how this helps improve the quality of the combined density prediction. Section 4 discusses how to set the strength of the entropy regularization in practice and shows that our proposed selection rule achieves a certain notion of optimality. Section 5 provides an empirical exercise that illustrates how entropy regularization improves the quality of density prediction of the U.S. inflation rate relative to the unregularized combined density forecast. Section 6 concludes the article.

2. Regularized Wasserstein Barycenter for Density Forecast Combination

This section introduces the density combination problem; see, for example [6]. We assume that agent i{1,,N} at time tN+ provides a forecast of the density function pit:RdR+, with distribution function denoted by Pit:RdR+, of the random variable yt+h with hN+. We are interested in aggregating information contained in the N agents’ forecasts to generate a better predictive distribution for yt+h.

Throughout the paper, we shall focus on density combinations that can be viewed as a type of average over probability densities. Specifically, those that can be defined as

p¯t=argminptPi=1ND(pit,pt), (1)

where D(pi,pj) is a measure of the discrepancy between the densities pi and pj. When D(·) satisfies the usual properties of a distance metric, which is the case when D(·) is defined as Euclidean or an unregularized Wasserstein metric, then p¯t is known as a Fréchet mean, which is a generalization of the average for real numbers. We will refer to p¯t as a barycenter to also encompass the more general case in which D(·) is not a metric. As described in Equation (1), we restrict our attention to the case in which p¯t is a density forecast with each input density having equal weight, which is known to perform quite well as a combination forecast [7].

A specific choice of metric, D(pi,pj), will lead to a different combined density, p¯t. Before introducing our proposed definition of D(·), the entropy regularized Wasserstein metric, the next two sections introduce choices for D(pi,pj) that lead to well-known density forecast combination methods.

2.1. Equal-Weighted Linear Opinion Rule

As a starting point let us consider D(pi,pj):=pipj22. Then, Equation (1) becomes

p¯t=argminptPi=1N(pitpt)2, (2)

which results in the following solution

p¯t=1Ni=1Npit. (3)

This can be derived using the first-order condition with respect to pt, which is i=1N(pitp¯t)=0.

This solution is known as the linear opinion rule with equal-weighting. This is the prototypical aggregation method both in the forecasting literature and in practice; see, for example [1]. This is a particularly tractable density combination method, as it is equivalent to a mixture density, and it has the additional advantage of being computationally tractable to compute. However, one disadvantage is that it does not preserve the shape of the individual forecast densities. For example, when combining two uni-modal densities, the resulting solution is generally bi-modal.

2.2. Quantile Aggregation and the Wasserstein Barycenter

In this section we consider the case in which D(·) is defined as the p-Wasserstein metric, which is defined as

Wp(pit,pjt)=infφΩ(pit,pjt)zizjpdφ(zi,zj)1/p, (4)

where Ω(pit,pjt) is the set of all joint distributions φ(zi,zj) that have marginal densities given by pit and pjt, respectively. Formally, we write

Ω(pit,pjt)=φ:Rd×RdR+1|ARd,φ(A,Rd)=pit(A)andφ(Rd,A)=pjt(A). (5)

In other words, each φΩ(pit,pjt) is a coupling between the distributions pit and pjt. In the optimal transport literature, the minimizer of (4) is also known as the optimal transport plan. This is because, for any A,BRd, φ(A,B) can be interpreted as the amount of mass that is moved from A to B in order to minimize Ezizjpp where zipit and zjpjt. For more detail on the field of optimal transport, see [8,9].

A special case of this Wasserstein barycenter has a close relation to a recently proposed probability/density forecast combination method in the forecasting literature. More specifically, suppose that input densities are univariate, and p¯t is defined as the squared Wasserstein metric, denoted by D(·):=W22(·); in this case, we have

P¯t1(τ)=1Ni=1NPit1(τ),forallτ(0,1), (6)

where Pit1(·) and P¯t1(·) are the quantile function of agent i and of the combination method, respectively. This forecast aggregation rule is also known as “quantile aggregation” or “Vincentized distribution” [2,3,10]. We prefer the representation of Equation (1) because this definition can be easily extended to higher dimensional densities or mixed data types (e.g., when some inputs are continuous and others are discrete) unlike quantile aggregation.

The Wasserstein barycenter is known to preserve the shape of input densities, such as log-concavity [11]. For example [12] show that the Wasserstein barycenter of the inputs, N(μ1,S1) and N(μ2,S2), is N((μ1+μ2)/2,S), where S is the solution of,

S=S1/2S1S1/21/2/2+S1/2S2S1/21/2/2; (7)

see also [13]. This is different than the linear opinion rule, which leads to a mixture of two normal densities with mean (μ1+μ2)/2 and variance σ12+σ222+(μ1μ2)24, which, in contrast, can be expected to be bi-modal whenever μ1μ2.

Another difference between these two aggregation methods is that the variance of the Wasserstein barycenter is smaller than that of the combined density resulting from a linear opinion rule. This holds for a more general class of input densities as shown in [2] in the univariate case. Of course, a narrow (i.e., sharp) predictive density can be good or bad depending on the underlying distribution of the target variable. It may be desirable to have an ability to flexibly adjust the dispersion of the combined density.

2.3. Regularized Wasserstein Barycenter

Now, we turn to our proposal. In this paper, we use a regularized Wasserstein distance [14,15] to combine individual probability forecasts. The regularization term used in this approximation of the Wasserstein metric is given by the negative differential entropy, which, when φ is an absolutely continuous measure, we will define as, h(φ)=Rd×Rdlogdφdλdφ, where λ is the Lebesgue measure, and infinity otherwise. We will use h(φ) to define the regularized Wasserstein metric as

Wp,γ(pit,pjt)=infφΩ(pit,pjt)zizjpdφ(zi,zj)+γh(φ)1/p, (8)

where γ>0 controls a strength of regularization. Note that φ is constrained by the same two marginal restrictions as its unregularized counterpart, as described in the definition of Ω(pit,pjt). This form of regularization is originally introduced by [14] in order to estimate the Wasserstein metric in a computationally efficient manner using the iterative proportional fitting procedure (IPFP) provided by [16].

When γ=0, there is no regularization, so we have Wp,0(pit,pjt)=Wp(pit,pjt). One can also show that the optimal coupling, say φγ🟉, satisfies limγ0+φγ🟉=φ0🟉 when φ0🟉 is uniquely defined, and otherwise this limiting value is given by the element of the set of optimal unregularized couplings with maximum entropy [15]. Higher values of γ place more weight on the second term in the objective function, which results in optimal couplings that are smoother and more dispersed than their unregularized counterparts.

Defining D(pit,pjt) by W2,γ2(pit,pjt) results in the combined density

p¯t=argminptPi=1NW2,γ2(pit,pjt), (9)

which is known as the regularized Wasserstein barycenter. The authors of [4] provided a generalization of the IPFP procedure to find this barycenter that is more computationally efficient than the unregularized case. While computational efficiency is the commonly cited reason for using entropy regularization, as we will see in the later sections, our motivation for regularization is not entirely computational.

For the rest of the paper, we study this regularized Wasserstein barycenter, which is p¯t defined in Equation (1) using (8). First, we present analytical results under a parametric assumption that broadens our understanding about the role of the regularization in forecast density combination. Then, we discuss how one can empirically choose the strength of the regularization that would achieve a certain notion of optimality.

3. Analytical Results: The Impact of Entropy Regularization

In this section we provide analytical results that describe the impact of entropy regularization on the shape of the barycenter. To better compare this barycenter with its unregularized counterpart in the Gaussian case, as defined above, we will focus on the regularized barycenter when p1 and p2 are d-dimensional multivariate Gaussians (d1). The regularized Wasserstein barycenter in this case is defined as

p¯argminqWγ2(p1,q)+Wγ2(p2,q). (10)

The following theorem completely characterizes the resulting barycenter in this case. Like the unregularized case, the theorem shows that regularization does not impact the mean of the barycenter; however, it does have an impact on its variance–covariance matrix.

Theorem 1.

Let p1 and p2 be Gaussian density functions with means μ1,μ2Rd, and variance matrices, S1,S2Rd×d. The regularized Wasserstein barycenter between p1 and p2 is given by the density function of N(μB,SB), where μBRd and SBRd×d are defined by,

μB:=(μ1+μ2)/2SB:=V/γ+I1V/2+Iγ/2+S2V/γ+I1=V/γ+I1V/2+Iγ/2+S1V/γ+I1,

where VRd×d is the unique symmetric matrix that satisfies these equalities and Iγ<V<Iγ.

Also, the iterates of the following series converge to V when V(0):=0d×d,

V(k+1)=S2S1+S1S1+Iγ/2V(k)/21S1S2S2+Iγ/2+V(k)/21S2.

The proof of this result is included in the Appendix A. We prove a slightly more general version of the theorem where the objective function in Equation (10) is a weighted average of Wγ2(p1,q) and Wγ2(p2,q). The proof first derives a system of equations that characterizes the barycenter in the case in which the regularized barycenter is Gaussian. Afterward, a fixed point theorem provided by [17] for mappings on partially ordered sets is used to show that this system has a unique solution, and this, along with convexity of Equation (10), implies the regularized barycenter is Gaussian.

Now, we discuss our theoretical results and their implication to the density forecast combination problem.

Remark 1.

(on location). Regularization does not affect the mean of the resulting barycenter, which is a property that may not hold in the more general setting that does not include a normality assumption. For example, suppose the domain of p is [0,1], and Exp(x)1/2, and consider the barycenter between p and itself. For any fixed density function q, the optimal coupling of the optimization problem that defines Wγ2(p,q) converges to dφ(z1,z2)/dλ=q(z1)p(z2), as this is the coupling with maximum entropy that has marginals given by q and p; see for example, [15]. However, the negative entropy of dφ(z1,z2)/dλ=p(z2), is less than or equal to that of dφ(z1,z2)/dλ=q(z1)p(z2), for any such fixed density q. We can also ensure these couplings are feasible by defining q to be a uniform density function, so we have limγq=1. This implies that limγExq(x)=1/2, regardless of the Exp(x). Since the unregularized density is given by q=p, and Exp(x)1/2, the regularization parameter does impact the mean of the barycenter.

Remark 2.

(on dispersion) Regularization tends to smooth the resulting barycenter, leading to a more dispersed combined density. To understand this point, let us consider a simple example below.

Example 1.

Consider a case with univariate pit=N(μit,σ2) and N=2. Then, the original Wasserstein barycenter (quantile averaging) is p¯t=N((μ1t+μ2t)/2,σ2). On the other hand the regularized Wasserstein barycenter is p¯t(γ)=N((μ1t+μ2t)/2,σ2+γ/2).

As this case exemplifies the strength of the regularization controls a dispersion of the combined density. The heavier the regularization the greater dispersed (or, the smoother) density we obtain. This result highlights that the entropy regularization offers an extra flexibility to control the dispersion of the combined density. In the next section, we propose a data-driven way to select the value of γ, the strength of the regularization.

Remark 3.

The normality assumption that we made to obtain the closed-form solution for the barycenter is not needed in practice. The regularized barycenter of probability/density forecasts is well-defined and computationally tractable for a broader context. One can have multiple inputs, non-Gaussian densities, discrete/continuous/mixed distribution. This includes many interesting and empirically relevant situations in economic forecasting such as macroeconomic and financial forecasting. The efficient computation of the regularized Wasserstein distance and barycenter with non-Gaussian input densities is still an active area of research. There is a large literature on computing the regularized barycenter in practice; see for example [4,18,19,20,21,22,23].

Remark 4.

During the review process for this paper, we became aware of a similar result that was proved independently of ours by [5]. There are two primary differences between these results. First, our result provides the regularized barycenter between two multivariate normal densities, while Theorem 1 in Janati et al. (2020a) provides the barycenter between an arbitrary number of univariate normal densities. Second, our result also provides a recursive formula to compute the variance–covariance matrix of the barycenter, which guarantees a convergence to a desired solution. We appreciate one of referees who pointed out relevant papers.

There have also been a number of recent results on a few related barycenters, including those that are modified to avoid the increase in the dispersion of the barycenter caused by regularization using one of the following two techniques. First, a Kullback–Leibler divergence penalty term can be used, with a reference measure given by the product of the input densities, rather than differential entropy. Second, a technique known as debiasing can also be used. For example, the remaining results in [5], as well as the results provided by [24,25], characterize these types of regularized Wasserstein barycenters between Gaussian densities. In contrast to the barycenter we consider, which can be viewed as the original discrete entropy regularized Wasserstein barycenter in the limit as the number of bins diverges, increasing the regularization parameter of these alternative barycenters either decreases or does not change the variance of the barycenter.

4. On Choosing the Strength of the Regularization

This section discusses how to choose the strength of the penalization. Our empirical strategy is to select γ by the value that most accurately fits the observed data. To economize our notation we restrict our discussion to the 1-step-ahead prediction (i.e., h=1). To do so, we regard the regularized barycenter computed at time t, p¯t, as a predictive likelihood for yt+1. This predictive likelihood interpretation of the barycenter can be formally justified by the principal-agent framework similar to the one developed by [26]. Suppose we have collected the regularized barycenters and the realized value of the target variable from the initial period (1) to present (t). We write this collection as It. Then, we can define a maximum likelihood estimator for γ at t with It as

γ^1:tmleargmaxγ0τ=1t1logp¯τ(yτ+1;γ), (11)

and the combined density prediction for yt+1 at time t is

p^(yt+1|It)=p¯t(yt+1;γ^1:tmle). (12)

There is a notion in which this combined density with γ^ is optimal. Suppose that yti.i.d.p*(y), and assume that forecasters report a sequence of predictive densities, pi(y) for yt, t=1,2,,T and i=1,2,,N. These forecasts are reported before the realization of yt, and the barycenter p¯(y;γ) is defined by pi(y)’s and γ>0. Then, the following can be shown under regularity conditions,

1Tt=1Tlogp¯(yt;γ)plogp¯(y;γ)p*(y)dyasT,

for γΓR+. In turn, a maximizer of the left-hand-side term also converges to the maximizer of the right-hand-side term, which is a minimizer of

KL(p¯(y;γ),p*(y))=logp¯(y)p*(y)dy+log(p*(y))p*(y)dy.

Therefore, γ^ converges to the pseudo-true parameter that minimizes Kullback–Leibler (KL) divergence from the regularized barycenter to the true data generating process. In other words, we find γ that makes the resulting barycenter close to the true data generating process in the limit. This asymptotic thought experiment can be justifiable under quite general conditions, allowing for a range of serial dependence in yt as well as a flexible form of the regularized Wasserstein barycenter implied by pi,t1(yt)’s. We can operationalize this by recognizing that p¯t1(y;γ) can be viewed as a predictive likelihood for yt formed at time t1. Then, quasi-MLE theory can be invoked, e.g., [27,28]. We provide a simple example in which the true data generating process follows the autoregressive (AR) process.

Example 2.

Suppose that forecaster 1 and 2 use mean-zero Gaussian AR(1) process to construct their density prediction. The two forecasts differ only by the mean reversion parameter. That is, the means of predictive distribution for forecaster 1 and 2 are μ1t=ρ1yt1 and μ2t=ρ2yt1, respectively. Based on our theory in the previous section, the barycenter is p¯t1(y;γ)=N(μ¯t,σ2+γ/2) where μ¯t=(μ1t+μ2t)/2, and the log density of the regularized barycenter at τ for yτ+1 is

log(p¯τ(yτ+1;γ))=1/2log(2π)1/2log(σ2+γ/2)1/2yτ+1μ¯τ+1σ2+γ/22, (13)

and the ML estimator for γ at time t is

γ^1:tmleargmaxγ0τ=1t11/2log(2π)1/2log(σ2+γ/2)1/2yτ+1μ¯τ+1σ2+γ/22, (14)

which leads to

γ^1:tmle=2×max1(t1)τ=1t1(yτ+1μ¯τ+1)2σ2,0. (15)

Now, suppose that the actual data generating process is

yt=ρ*yt1+vt,vti.i.d.N(0,σ*2). (16)

When the simple average of both forecasters’ autoregressive parameter equals ρ*, the ML estimate for γ depends on the true conditional variance, σ*2, and forecasters’ conditional variance. If the sample variance is larger than that of the forecasters, then γ is chosen so that the resulting regularized barycenter has the same variance as the sample variance. On the other hand, if the sample variance is smaller than that of the forecasters, then γ is set to 0. Note that there is an asymmetry in adjusting the variance of the barycenter. This is natural in that the regularization only makes the resulting density smoother. In practice, this may not be a problem if the practitioner’s concern is the combined density being too sharp (e.g., relative to the linear opinion rule).

Note that γ^1:tmle converges in probability to γ=2max(σ*2σ2,0). The KL divergence between p¯(yt+1;γ) and the true conditional density of yt+1 at t is minimized at γ=γ. This confirms that our selection rule for γ aims to fit the data well by shaping the regularized barycenter as close as possible to the data generating process.

5. Empirical Illustration

In this section, we illustrate our proposed method using macroeconomic data for the U.S. We consider 14 hypothetical forecasters who produce their own 1-step-ahead forecast about the U.S. inflation rate based on the following vector autoregression (VAR) with three variables,

Yt=Φ0+i=14ΦiYti+et,eti.i.dN(0,), (17)

where Yt is a 3×1 vector that consists three quarterly macroeconomic variables, Φ0 is a 3×1 vector, Φ1,Φ2,Φ3,Φ4, are 3×3 matrices. The first two elements of Yt are common to all 14 forecasters: the annualized quarter-over-quarter inflation rate and real GDP growth rate. They differ by the third element of Yt. We assign each forecaster a different macroeconomic variable from the FRED-QD database by [29]. A detailed description of the variable used in this exercise is in Table 1.

Table 1.

Variables used in empirical exercises.

Y(i)=[Y1,Y2,Y3(i)] Used by Variable Description FRED-QD Mnemonic
Variable 1 (Y1) All Inflation rate GDPCTPI
Variable 2 (Y2) All Real GDP growth rate GDPC1
Variable 3 (Y3(i)) Forecaster 1 Real Personal Consumption Expenditures PCECC96
Forecaster 2 Industrial Production Index INDPRO
Forecaster 3 All Employees: Total Nonfarm PAYEMS
Forecaster 4 Housing Starts: Total Privately Owned Housing Units Started HOUST
Forecaster 5 Real Manufacturing and Trade Industries Sales CMRMTSPLx
Forecaster 6 Real Crude Oil Prices: West Texas Intermediate (WTI) OILPRICEx
Forecaster 7 Real Average Hourly Earnings: Manufacturing CES3000000008x
Forecaster 8 10-Year Treasury Constant Maturity Minus 3-Month Treasury Bill GS10TB3Mx
Forecaster 9 Real Commercial and Industrial Loans BUSLOANSx
Forecaster 10 Real Total Assets of Households and Nonprofit Organizations TABSHNOx
Forecaster 11 U.S. / U.K. Foreign Exchange Rate EXUSUKx
Forecaster 12 Consumer Sentiment (University of Michigan) UMCSENTx
Forecaster 13 S&P’s Common Stock Price Index: Composite S&P 500
Forecaster 14 Real Disposable Business Income CNCFx

Note: All variables are obtained from the FRED-QD database [29]. Inflation rate is computed as a log difference of the GDP deflator (GDPCTPI). Real GDP growth rate is computed as a log difference of the real GDP (GDPC1). All other variables are transformed following [29]. We use the 2019–11 vintage data. Each forecaster constructs a predictive distribution using their own vector autoregression with three variables Y(i)=[Y1,Y2,Y3(i)] where i=1,2,,14.

We compute each forecasters’ 1-step-ahead predictive distribution for the inflation rate at time t as πt+1|tN([μt+1|t](1,1),[t+1|t](1,1)) where [x](i,j) denotes (i,j) element of vector/matrix x. These forecasters assume that the 1-step-ahead predictive distribution of Yt+1 at t is Gaussian, and they use their best guess about the predictive mean and variance to construct the predictive distribution. More specifically, they set these two moments as

μt+1|t=Φ^0,t+p=14Ytp+1Φ^p,t,andt+1|t=^t, (18)

where (Φ^0,t,Φ^1,t,Φ^2,t,Φ^3,t,Φ^4,t,^t) is the posterior mean of p(Φ0,Φ1,Φ2,Φ3,Φ4,|Yt:(tR+1)) with a flat prior. We set R=80, meaning that they also use the most recent 20 years of data to construct the predictive distribution.

We let the forecasters to generate their 1-step-ahead predictive distribution for the inflation rate from 2001Q1 to 2018Q4. This leaves us 72 quarters for a forecast evaluation sample. At each point in time, we also combine these 14 predictive densities based on the regularized Wasserstein barycenter with 20 different values of the regularization parameter γ on [0.3,10]. As we explained in the previous section, a larger value of this parameter implies a stronger regularization, and the resulting combined predictive density becomes smoother with a larger variance. We also compute the combined density with γ=0 , which leads to “quantile aggregation” or “Vincentized distribution”. Our computation of the regularized barycenter is based on the algorithm developed and proposed by [19]. The MATLAB toolbox that implements this algorithm is available from https://github.com/gpeyre/2015-SIGGRAPH-convolutional-ot.

We evaluate each forecaster’s, and other forecast aggregation, methods by the sum of log predictive score, which is a logarithm of the predictive density evaluated at the actualized value, over the evaluation sample. These results are presented in Figure 1. The left panel presents the sum of the log score for individual forecasters sorted by their performance. There is a sizeable difference in their historical performance. The solid line represents the performance based on the quantile aggregation, which aggregates all forecasters in the pool. As found by other research papers, e.g., [2,3] the quantile aggregation method generates a decent predictive distribution, which performs slightly better than the ex-post top 4 forecaster.

Figure 1.

Figure 1

Sum of log predictive score for U.S. inflation rate (2000Q1–2018Q4).

The right panel in Figure 1 shows the historical performance of our proposed approach with various choices of regularization parameter, γ. For a wide range of values for γ the regularized barycenter performs better than the quantile aggregation. It does even better than the best individual. This is interesting because we cannot identify the best forecaster a priori.

The optimal value of γ defined in Equation (11) at the end of the evaluation sample would be the value of γ that corresponds to the peak of the curve, which is about γ^2018Q41.3. If we were to use this value at the beginning of the evaluation sample, then the mean difference in the log predictive score between the regularized Wasserstein barycenter and the quantile aggregation would have been 0.12 with the heteroscedasticity and autocorrelation consistent (HAC) standard error being 0.07. This implies that the difference in the peak of the curve and the solid line is statistically significant at 10% confidence level.

To make the γ selection fully adaptive, we also compute the optimal γ sequentially from the beginning to the end of the evaluation sample. That is, we set the predictive density for yt+1 as the regularized barycenter with the value of γ that maximizes the objective function defined in Equation (11) only using the information available from the beginning of the sample up to t. In this way, we do not use any future information when choosing the value of γ. Even in this case the regularized Wasserstein barycenter performs better than the best individual forecaster and the quantile aggregation. The sum of the log predictive score is −93.09, and the mean difference in the log predictive score with the quantile aggregation is 0.11 with the HAC standard error being 0.06. This suggests that the regularized Wasserstein barycenter with the adaptively chosen (e.g., estimated online) γ performs statistically better than its unregularized counterpart, the quantile aggregation, at the 10% significance level. This superior predictive performance of the regularized Wasserstein barycenter relative to the quantile aggregation remains unchanged even when we split the evaluation sample into two. The mean difference in the log predictive score is 0.13 and 0.09 for the first half and the second half of the evaluation sample, respectively.

6. Concluding Remarks

This paper proposes to use the entropy regularized Wasserstein barycenter to combine several probability and density forecasts. The entropy regularization smooths the resulting combined forecast, and it offers a flexible way to adjust the dispersion of the predictive density when it is needed. We study the effect of the regularization on the combined density forecast and provide an exact relationship between the strength of the regularization and the variance–covariance matrix of the combined density when input densities are Gaussian. We then provide a way to select the strength of regularization by choosing the regularized barycenter that most closely matches the data. We apply our proposed methodology to the U.S. inflation density forecasting and show how the entropy regularization can improve the quality of the density forecast relative to its unregularized counterpart.

In this article, we restrict weights of each input densities on the final combined density to be pre-determined at some values (i.e., equal weighting). This choice was intentional to focus on studying the role of entropy regularization. In practice, however, it is possible that a subset of input densities might be superior to others, and one may wish to put different weights on each input density. Alternatively, it is desirable to include only a subset of input densities into the combined density and set other weights to zero, see, for example, [30]. For those cases, it is fruitful to develop a data-dependent method that chooses both the regularization strength and those weights simultaneously, which is a topic for future research.

Acknowledgments

We thank Frank Diebold, Roger Koenker, Frank Schorfheide, R. Miyauchi Lee, and two anonymous referees for their insightful comments.

Appendix A

The authors of [17] provide the following fixed point theorem, which we will use in the proof of Theorem 1.

Lemma A1 (Ran and Reurings, 2004).

Let T be a partially ordered set such that every pair x,yT has a lower bound and an upper bound. Furthermore, let d be a metric on T such that (T,d) is a complete metric space. If F:TT is a continuous, monotone (e.g., either order-preserving or order-reversing) map from T into T such that,

c(0,1):d(F(x),F(y))<cd(x,y),x>y

and

x0T:F(x0)>x0 or F(x0)>x0,

then F has a unique fixed point, x🟉T. Also, for all xT,

limnFn(x)=x🟉.

The following result follows from Lemma 1.

Lemma A2.

Suppose λ(0,1),TRd×d is the set of symmetric matrices with all eigenvalues in the range γ2λ,γ2(1λ), and S1,S2Rd×d are positive definite matrices. Then there is a unique V🟉T such that F(V🟉)=V🟉, where

F(V):=S2S1+S1S1+Iγ/2V(1λ)1S1S2S2+Iγ/2+Vλ1S2.

Also, for any VT,limnFn(V)=V🟉.

Proof. 

Suppose A,BT and A>B. First we will establish that F(·) is order-preserving, which is equivalent to F(A)>F(B). Note that,

S1S1+Iγ/2A(1λ)1S1+Iγ/2B(1λ)1S1>0S1+Iγ/2A(1λ)1>S1+Iγ/2B(1λ)1A<BA>B.

Similar logic implies that for all such A,BT,

S2S2+Iγ/2+Bλ1S2+Iγ/2+Aλ1S2>0A>B,

and since F(A)F(B) is the sum of both of these order-preserving functions, F(·) is also order-preserving.

Clearly our bounds on the eigenvalues imply that F(V) is continuous for all VT. To show that F is a mapping from T into T, note that matrix symmetry is preserved over addition and inversion, so F(V) is symmetric for all VT. Also, note that,

F(Iγ/(2λ))=S1+S1S1+Iγ/(2λ)1S1>Iγ/(2λ)S11/2II+S11γ/(2λ)1S11/2>Iγ/(2λ)S11/2II+S11γ/(2λ)1S11/2<Iγ/(2λ)I+S12λ/γ1<S11γ/(2λ)I>0.

Similar logic can be used to show that F(Iγ/(2(1λ))<Iγ/(2(1λ). This also implies the final requirement of Lemma 1.

The only remaining requirement of Lemma 2 is the penultimate, which we will establish for A,BT such that A>B, and using the norm, d(A,B)=Tr(AB). Also, let α:={1,1}, β:={λ1,λ}, and C denote the spectral norm of CRd×d. We will use the property Tr(CD)CTr(D), where C,DRd×d and C,D>0; see for example, [17]. Note that,

Tr(F(A)F(B))=iαiTr(SiSi+Iγ/2+Aβi1Si+Iγ/2+Bβi1Si)=iαiβiTr(SiSi+Iγ/2+Aβi1BASi+Iγ/2+Bβi1Si)=iαiβiTrSi+Iγ/2+Bβi1SiSiSi+Iγ/2+Aβi1BAiαiβiSi+Iγ/2+Bβi1SiSiSi+Iγ/2+Aβi1TrBA<cTrBAiαiβi=cTrAB,

where c(0,1). The second inequality follows from the matrix SiSi+Iγ/2Aβi1 (respectively, SiSi+Iγ/2Bβi1) being similar to a symmetric matrix, and with eigenvalues contained in (0,1) because AT (BT) implies Iγ/2Aβi>0 (Iγ/2Bβi>0). □

Next we will establish Theorem 1, which is restated below. This is a slightly more general version of the theorem in the main text where the objective function in Equation (10) is a weighted average of Wγ2(p1,q) and Wγ2(p2,q).

Theorem A1.

Let λ(0,1) and p1 and p2 be Gaussian density functions with means μ1,μ2Rd, and variance matrices, S1,S2Rd×d. The regularized Wasserstein barycenter between p1 and p2 is given by the density function of N(μB,SB), where μBRd and SBRd×d are defined by,

μB:=λμ1+(1λ)μ2SB:=V2λ/γ+I1Vλ+Iγ/2+S2V2λ/γ+I1=V2(λ1)/γ+I1V(λ1)+Iγ/2+S1V2(λ1)/γ+I1,

where VRd×d is the unique symmetric matrix that satisfies these equalities and Iγ/(2λ)<V<Iγ/(2(1λ)).

Also, the iterates of the following series converge to V when V(0):=0d×d,

V(k+1)=S2S1+S1S1+Iγ/2V(k)(1λ)1S1S2S2+Iγ/2+V(k)λ1S2.

Proof. 

Let ϕ:RdR be defined as, ϕ(z):=exp(z22/γ), and, for a given function f:RdR, we will denote the convolution of f(z) and ϕ(z) as, f(z)ϕ(z):=Rdf(t)ϕ(zt)dt. When there is little risk of confusion, we will omit the input zRd of functions supported on Rd in the remainder of the proof.

We will characterize the barycenter using the fact that it is the minimizer of the following optimization problem.

minqλWγ2(q,p1)+(1λ)Wγ2(q,p2). (A1)

To do so, note that the optimal coupling corresponding to Wγ2(q,pi) can be defined by instead solving the dual of (8), which is

wi,ui=argmaxwi,uiEpi(log(wi))+Eq(log(ui))γRd×Rdwi(z1)ui(z2)exp(z1z22/γ)dz1dz2, (A2)

and the optimal coupling can be defined in terms of the dual variables as dφi(z1,z2)/dλ=ui(z1)ϕ(z1)ϕ(z2)wi(z2). The first order conditions of (A2) are

pi=wiuiϕ (A3)
q=uiwiϕ. (A4)

Also, since the objective function of (A2) is differentiable, an application of the envelope theorem implies

δWγ2(q,pi)δq=log(ui).

Thus, the optimum of (A1) can be characterized by the following functional derivative being zero.

δδqλWγ2(q,p1)+(1λ)Wγ2(q,p2)=0λlog(u1)+(1λ)log(u2)=0

After combining this equality with (A3) and (A4), we have that the barycenter can be characterized by the system

p1=w1u1ϕγ/2,p2=w2u2ϕγ/2q=u1w1ϕγ/2=u2w2ϕγ/2,and1=u1λu21λ.

This system can be reduced to two equalities after noting that, pi=wiuiϕγ/2 and q=uiwiϕγ/2 implies

q=uipiuiϕγ/2ϕγ/2.

After combining both equalities, and noting u1=u2(λ1)/λ, we have

q=u2(λ1)/λp1u2(λ1)/λϕγ/2ϕγ/2=u2p2u2ϕγ/2ϕγ/2 (A5)

Let G be defined as the set of functions g:RdR+1 of the form

g(z)=aexp((zμg)Vg1(zμg)/2),

where μgRd, VgRd×d is a symmetric and invertible matrix, and aR++1. It will also be convenient to let C:GRd×d be defined so that C(g)=Vg and M:GRd be defined so that M(g)=μg. It is well known that if g,hG are Gaussian density functions, then gb,cg,gh,ghG, where b,cR1 and b0, and it is also straightforward to show

C(gb)=Vg/b,C(cg)=Vg,C(gh)=Vg1+Vh11,andC(gh)=Vg+Vh.

Likewise, in the case of M(·), we will also use the properties

M(gb)=μg,M(cg)=μg,M(gh)=C(gh)Vg1μg+Vh1μh,andM(gh)=μg+μh.

Note that Vg1+Vh1>0 is the necessary and sufficient condition for gh to be well defined, and it is straightforward to verify that the properties above also hold over all pairs of g,hG when this is the case; for the case of normal density functions, see for example [31].

Next, we will suppose that u2 is in G, which, due to (A5), also implies q,u1,w1,w2G, and then show that there exists a unique u2G that satisfies (A5). Since (A1) is a strictly convex optimization problem, when a solution to (A1) exists, it can be characterized uniquely by its first-order conditions. Note that, for any pair ui,wi that solves (A1), we have that uia,wi/a, where aR++1, are also solutions. We avoid complications from this issue by placing the additional restriction on these dual variables that wi(0)=1, as this ensures strict convexity over this set of dual functions. To see that this is also without loss of generality, note that rescaling the dual variables by uia,wi/a would not impact the objective function in (A2) because Rdq(z)dz=Rdpi(z)dz=1. Also, a would not impact the first order conditions (A3) and (A4), so it would also not have an impact on q. Thus, after providing u2G that solves (A5), we will have also shown that this solution is unique even when not restricted to G.

Since ϕ, p1, and p2 are elements of G, and G is closed under multiplication, division, convolution, and exponentiation to the (non-zero) power of (λ1)/λ, if u2G then the functions on both sides of the equality (A5) will also be elements of G. Let Ui:=C(ui) and μu:=M(u2). As noted above, the convolutions in (A5) are only well defined if the following matrix inequalities hold, so we will also require the solution to satisfy these inequalities.

I2/γ+Ui1>0andI2/γ+Ui1(λ1)/λ>0,

which hold if and only if

2/γI<U21<2λ/(γ(1λ))I. (A6)

It is straightforward to verify that these inequalities are identical to the ones that ensure the optimal coupling is integrable, as this coupling is given by, dφi(z1,z2)/dλ=ui(z1)ϕ(z1)ϕ(z2)wi(z2). Thus, Fubini’s theorem implies that they are also sufficient conditions for q to be integrable.

We can find SB by applying C(·) to (A5), which implies

SB1=U21+S21U2+Iγ/211+Iγ/21 (A7)
=U21(λ1)/λ+S11U2λ/(λ1)+Iγ/211+Iγ/21. (A8)

Let bi{λ/(λ1),1}. After three applications of the matrix inversion lemma and simplifying we have that, for each i{1,2}

SB1U21/bi=Si1U2bi+Iγ/211+Iγ/21=Si1I2/γ+4/γ2U21/bi+I2/γ11+Iγ/21=I2/γ4/γ2Si1+4/γ2U21/bi+I2/γ11=I2/γ4/γ2Si+4/γ2Siγ2/4U21/bi+Iγ/2+Si1Si. (A9)

This, along with Equations (A7) and (A8), implies that U2 can be characterized by

γ2/4U21S2+S2γ2/4U21+Iγ/2+S21S2=γ2/4U21(λ1)/λS1+S1γ2/4U21(λ1)/λ+Iγ/2+S11S1.

After defining V as γ2/(4λ)U21, this implies

V=S2S1+S1S1+Iγ/2V(1λ)1S1S2S2+Iγ/2+Vλ1S2.

Note that our requirement that U21 satisfy (A6) can be written in terms of V as, γ/(2λ)I<V<γ/(2(1λ))I, and Lemma 2 implies that there is a unique solution that satisfies these conditions.

The functional form for SB from the statement of this theorem follows from an alternative ordering of the matrix inversion theorem. Specifically, starting from (A9)

SB1U21/bi=I2/γ4/γ2Si1+4/γ2U21/bi+I2/γ11=U21/bi+4/γ2γ2/4U21/bi+Iγ/2γ2/4U21/bi+Iγ/2+Si1×γ2/4U21/bi+Iγ/2=U21/bi+2λ/(γbi)V+Iλ/biV+γ/2I+Si12λ/(γbi)V+I.

Thus,

SB1=V2λ/γ+I1S2+Vλ+Iγ/2V2λ/γ+I1=V2(λ1)/γ+I1S1+V(λ1)+Iγ/2V2(λ1)/γ+I1

After applying M(·) to both sides of (A5), we have Mu2bipiu2biϕγ/2ϕγ/2=

SBU21μu/bi+SB1U21/biSi1U2bi+Iγ/211Si1μiU2bi+Iγ/21μu. (A10)

To simplify this expression, we will first establish three intermediate equalities. First, Equations (A7) and (A8) imply

SB1U21/bi=Si1U2bi+Iγ/211+Iγ/21SB1U21/biSi1U2bi+Iγ/211U2bi+Iγ/21=U2bi+γ/2U2bi+Iγ/2Si11=I+γ/2I+γ/(2bi)U21Si11U21/bi. (A11)

Second, (A11) in turn implies

SB1U21/biSi1U2bi+Iγ/211Si1=I+γ/2I+γ/(2bi)U21Si11I+γ/(2bi)U21Si1. (A12)

Third, after an application of the matrix inverse identity to (A7) and (A8)

SB1=U21/bi+Si1U2bi+Iγ/211+Iγ/21 (A13)
=U21/bi+I2/γI4/γ2I2/γ+Si1U2bi+Iγ/211, (A14)

which implies

SB=U21/bi+I2/γ4/γ2U2bi+Iγ/2I2/γ+Si1I1U2bi+Iγ/21=U21/bi+I2/γ4/γ2I2/γ+(I+U21γ/(2bi))S211U21/biU2bi+Iγ/21.

Thus,

SB=U21/bi+I2/γ1II+γ/2I+γ/(2bi)U21Si111. (A15)

We will start with the coefficient on μu in (A10). The equalities (A11) and (A15) imply that this term is equal to

SBU21/biSB1U21/biSi1U2bi+Iγ/211U2bi+Iγ/21μu=U21/bi+I2/γ1II+γ/2I+γ/(2bi)U21Si111 ×II+γ/2I+γ/(2bi)U21Si11U21/biμu=U21/bi+I2/γ1U21/biμu=I+U22bi/γ1μu.

The equalities (A12) and (A15) imply that the coefficient on μi in (A10) can be written as

SBSB1U21/biSi1U2bi+Iγ/211Si1μi=U21/bi+I2/γ1II+Si1γ/2+U21Si1/biγ2/411 ×I+γ/2I+γ/(2bi)U21Si11I+γ/(2bi)U21Si1μi=U21/bi+I2/γ1γ/2I+γ/(2bi)U21Si11I+γ/(2bi)U21Si1μi=U21γ/(2bi)+I1μi.

After combining these terms, we can define (A10) as the solution to

μq=I+U22bi/γ1μu+U21γ/(2bi)+I1μiI+U22bi/γμqU21γ/(2bi)+I1μi=μuI+U22b1/γμqU21γ/(2b1)+I1μ1=I+U22/γμqU21γ/2+I1μ2.

Since the matrix inverse identity also implies

U21γ/(2bi)+I1=IU22bi/γ+I1,

we have

I+U22/γμqU22/γμ2=I+U22b1/γμqU22b1/γμ1(1b1)μq=μ2b1μ1(1+λ/(1λ))μq=μ2+λ/(1λ)μ1μq=μ2(1λ)+λμ1.

 □

Author Contributions

The authors contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Disclaimer

The views expressed in these papers are solely those of the authors and do not necessarily reflect the views of the Federal Reserve Bank of Philadelphia, the Federal Reserve System, or the Census Bureau. Any errors or omissions are the responsibility of the authors. There are no sensitive data in this paper.

References

  • 1.Geweke J., Amisano G. Optimal prediction pools. J. Econom. 2011;164:130–141. doi: 10.1016/j.jeconom.2011.02.017. [DOI] [Google Scholar]
  • 2.Lichtendahl K.C., Grushka-Cockayne Y., Winkler R. Is it better to average probabilities or quantiles. Manag. Sci. 2013;59:1594–1611. doi: 10.1287/mnsc.1120.1667. [DOI] [Google Scholar]
  • 3.Busetti F. Quantile aggregation of density forecasts. Oxf. Bullet. Econom. Stat. 2017;79:495–512. doi: 10.1111/obes.12163. [DOI] [Google Scholar]
  • 4.Benamou J., Carlier G., Cuturi M., Nenna L., Peyre G. Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 2015;37:1111–1138. doi: 10.1137/141000439. [DOI] [Google Scholar]
  • 5.Janati H., Cuturi M., Gramfort A. Debiased Sinkhorn barycenters. arxiv. 20202006.02575 [Google Scholar]
  • 6.Timmermann A. Handbook of Economic Forecasting. Volume 1. Elsevier; Amsterdam, The Netherlands: 2006. Forecast combinations; pp. 135–196. [Google Scholar]
  • 7.Clemen R. Combining forecasts: A review and annotated bibliography. Int. J. Forecast. 1989;5:559–583. doi: 10.1016/0169-2070(89)90012-5. [DOI] [Google Scholar]
  • 8.Villani C. Topics in Optimal Transportation. Volume 58 American Mathematical Soc.; Providence, RI, USA: 2003. [Google Scholar]
  • 9.Galichon A. Optimal Transport Methods in Economics. Princeton University Press; Princeton, NJ, USA: 2018. [Google Scholar]
  • 10.Ratcliff R. Group reaction time distributions and an analysis of distribution statistics. Psychol. Bullet. 1979;86:446–461. doi: 10.1037/0033-2909.86.3.446. [DOI] [PubMed] [Google Scholar]
  • 11.Genest C. Vincentization revisited. Ann. Stat. 1992;20:1137–1142. doi: 10.1214/aos/1176348676. [DOI] [Google Scholar]
  • 12.Agueh M., Carlier G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011;43:904–924. doi: 10.1137/100805741. [DOI] [Google Scholar]
  • 13.Knott M., Smith C.S. On a generalization of cyclic monotonicity and distances among random vectors. Linear Algebra Appl. 1994;199:363–371. doi: 10.1016/0024-3795(94)90359-X. [DOI] [Google Scholar]
  • 14.Cuturi M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport; Proceedings of the 27th Annual Conference on Neural Information Processing Systems; Lake Tahoe, NV, USA. 5–10 December 2013. [Google Scholar]
  • 15.Peyré G., Cuturi M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 2019;11:355–607. doi: 10.1561/2200000073. [DOI] [Google Scholar]
  • 16.Sinkhorn R. Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 1967;74:402–405. doi: 10.2307/2314570. [DOI] [Google Scholar]
  • 17.Ran A.C., Reurings M.C. A fixed point theorem in partially ordered sets and some applications to matrix equations. Proc. Am. Math. Soc. 2004;132:1435–1443. doi: 10.1090/S0002-9939-03-07220-4. [DOI] [Google Scholar]
  • 18.Cuturi M., Doucet A. Fast computation of Wasserstein barycenters; Proceedings of the 31st International Conference on Machine Learning; Beijing, China. 21–26 June 2014. [Google Scholar]
  • 19.Solomon J., De Goes F., Peyré G., Cuturi M., Butscher A., Nguyen A., Du T., Guibas L. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 2015;34:66. doi: 10.1145/2766963. [DOI] [Google Scholar]
  • 20.Dvurechensky P., Dvinskikh D., Gasnikov A., Uribe C., Nedic A. Decentralize and randomize: Faster algorithm for Wasserstein barycenters; Proceedings of the Annual Conference on Neural Information Processing Systems 2018; Montreal, QC, Canada. 3–8 December 2018. [Google Scholar]
  • 21.Kroshnin A., Tupitsa N., Dvinskikh D., Dvurechensky P., Gasnikov A., Uribe C. On the complexity of approximating Wasserstein barycenters; Proceedings of the 36th International Conference on Machine Learning; Long Beach, CA, USA. 10–15 June 2019. [Google Scholar]
  • 22.Lin T., Ho N., Chen X., Cuturi M., Jordan M.I. Fixed-support Wasserstein barycenters: Computational hardness and fast algorithm. arXiv. 20202002.04783v4 [Google Scholar]
  • 23.Lin T., Ho N., Cuturi M., Jordan M.I. On the complexity of approximating multimarginal optimal transport. arXiv. 20201910.00152v2 [Google Scholar]
  • 24.Janati H., Muzellec B., Peyré G., Cuturi M. Entropic optimal transport between (unbalanced) Gaussian measures has a closed form. arXiv. 20202006.02572 [Google Scholar]
  • 25.Mallasto A., Gerolin A., Minh H.Q. Entropy-regularized 2-Wasserstein distance between Gaussian measures. arXiv. 20202006.03416 [Google Scholar]
  • 26.Del Negro M., Hasegawa R., Schorfheide F. Dynamic prediction pools: An investigation of financial frictions and forecasting performance. J. Econom. 2016;192:391–405. doi: 10.1016/j.jeconom.2016.02.006. [DOI] [Google Scholar]
  • 27.White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. doi: 10.2307/1912526. [DOI] [Google Scholar]
  • 28.Bollerslev T., Wooldridge J.M. Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econom. Rev. 1992;11:143–172. doi: 10.1080/07474939208800229. [DOI] [Google Scholar]
  • 29.McCracken M., Ng S. FRED-QD: A Quarterly Database for Macroeconomic Research. National Bureau of Economic Research; Cambridge, MA, USA: 2020. Working Paper No. 26872. [DOI] [Google Scholar]
  • 30.Diebold F.X., Shin M. Machine learning for regularized survey forecast combination: Partially-egalitarian LASSO and its derivatives. Int. J. Forecast. 2019;35:1679–1691. doi: 10.1016/j.ijforecast.2018.09.006. [DOI] [Google Scholar]
  • 31.Bromiley P. Products and convolutions of Gaussian probability density functions. Tina-Vision Memo. 2003;3:1. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES