Abstract
We propose probability and density forecast combination methods that are defined using the entropy regularized Wasserstein distance. First, we provide a theoretical characterization of the combined density forecast based on the regularized Wasserstein distance under the assumption. More specifically, we show that the regularized Wasserstein barycenter between multivariate Gaussian input densities is multivariate Gaussian, and provide a simple way to compute mean and its variance–covariance matrix. Second, we show how this type of regularization can improve the predictive power of the resulting combined density. Third, we provide a method for choosing the tuning parameter that governs the strength of regularization. Lastly, we apply our proposed method to the U.S. inflation rate density forecasting, and illustrate how the entropy regularization can improve the quality of predictive density relative to its unregularized counterpart.
Keywords: entropy regularization, Wasserstein distance, optimal transport, density forecasting, forecast combination, model combination, quantile aggregation
1. Introduction
In this paper, we study a class of density forecast combination methods based on a Wasserstein metric. In the univariate case, an equally weighted centroid defined by a Wasserstein metric corresponds to a quantile averaging or vincentized center where quantiles of forecast densities are averaged. The resulting combined density tends to be narrower than the linear opinion rule [1,2,3], which may or not be desirable, depending on the context.
We propose to use the entropy regularized Wasserstein metric to construct a combined density forecast. Like its unregularized counterpart, this combined probability/density can be defined by an optimization problem, but the optimization problem in this case includes an additional regularization term that penalizes densities with low entropy, which ensures the combined density forecast is smooth. One advantage of this approach is that the entropy regularized Wasserstein barycenter can be found in a much more computationally efficient manner than its unregularized counterpart when the input densities are multi-dimensional [4].
While computational efficiency is the most commonly cited reason for using entropy regularization, this paper demonstrates that there is an additional advantage of regularization when it comes to the density combination problem. It provides a way to tune the degree of dispersion of the combined density forecast. To the best of our knowledge, this regularized metric has not been explored in the context of the density forecasting combination problem.
As a part of our discussion, we provide a theoretical characterization of the regularized Wasserstein distance under the Gaussian assumption. More specifically, we show that the regularized Wasserstein barycenter between two multivariate Gaussian inputs is multivariate Gaussian. Our proof complements Theorem 1 of [5], which characterizes the regularized Wasserstein barycenter among an arbitrary number of univariate normal densities. In addition, our result also provides a simple recursive equation that is guaranteed to converge to the variance–covariance matrix.
We proceed as follows. Section 2 formulates a density forecast combination problem with a general metric. Several existing aggregation methods in the literature can be formulated with the choice of a specific metric within this unified framework. After discussing these existing approaches, we introduce our proposal of using the entropy regularized Wasserstein barycenter. Section 3 provides theoretical results that describe the impact of entropy regularization on the combined density under a Gaussian assumption and discusses how this helps improve the quality of the combined density prediction. Section 4 discusses how to set the strength of the entropy regularization in practice and shows that our proposed selection rule achieves a certain notion of optimality. Section 5 provides an empirical exercise that illustrates how entropy regularization improves the quality of density prediction of the U.S. inflation rate relative to the unregularized combined density forecast. Section 6 concludes the article.
2. Regularized Wasserstein Barycenter for Density Forecast Combination
This section introduces the density combination problem; see, for example [6]. We assume that agent at time provides a forecast of the density function with distribution function denoted by of the random variable with . We are interested in aggregating information contained in the N agents’ forecasts to generate a better predictive distribution for .
Throughout the paper, we shall focus on density combinations that can be viewed as a type of average over probability densities. Specifically, those that can be defined as
| (1) |
where is a measure of the discrepancy between the densities and When satisfies the usual properties of a distance metric, which is the case when is defined as Euclidean or an unregularized Wasserstein metric, then is known as a Fréchet mean, which is a generalization of the average for real numbers. We will refer to as a barycenter to also encompass the more general case in which is not a metric. As described in Equation (1), we restrict our attention to the case in which is a density forecast with each input density having equal weight, which is known to perform quite well as a combination forecast [7].
A specific choice of metric, , will lead to a different combined density, . Before introducing our proposed definition of the entropy regularized Wasserstein metric, the next two sections introduce choices for that lead to well-known density forecast combination methods.
2.1. Equal-Weighted Linear Opinion Rule
As a starting point let us consider . Then, Equation (1) becomes
| (2) |
which results in the following solution
| (3) |
This can be derived using the first-order condition with respect to which is .
This solution is known as the linear opinion rule with equal-weighting. This is the prototypical aggregation method both in the forecasting literature and in practice; see, for example [1]. This is a particularly tractable density combination method, as it is equivalent to a mixture density, and it has the additional advantage of being computationally tractable to compute. However, one disadvantage is that it does not preserve the shape of the individual forecast densities. For example, when combining two uni-modal densities, the resulting solution is generally bi-modal.
2.2. Quantile Aggregation and the Wasserstein Barycenter
In this section we consider the case in which is defined as the p-Wasserstein metric, which is defined as
| (4) |
where is the set of all joint distributions that have marginal densities given by and , respectively. Formally, we write
| (5) |
In other words, each is a coupling between the distributions and In the optimal transport literature, the minimizer of (4) is also known as the optimal transport plan. This is because, for any can be interpreted as the amount of mass that is moved from A to B in order to minimize where and For more detail on the field of optimal transport, see [8,9].
A special case of this Wasserstein barycenter has a close relation to a recently proposed probability/density forecast combination method in the forecasting literature. More specifically, suppose that input densities are univariate, and is defined as the squared Wasserstein metric, denoted by in this case, we have
| (6) |
where and are the quantile function of agent i and of the combination method, respectively. This forecast aggregation rule is also known as “quantile aggregation” or “Vincentized distribution” [2,3,10]. We prefer the representation of Equation (1) because this definition can be easily extended to higher dimensional densities or mixed data types (e.g., when some inputs are continuous and others are discrete) unlike quantile aggregation.
The Wasserstein barycenter is known to preserve the shape of input densities, such as log-concavity [11]. For example [12] show that the Wasserstein barycenter of the inputs, and is where S is the solution of,
| (7) |
see also [13]. This is different than the linear opinion rule, which leads to a mixture of two normal densities with mean and variance , which, in contrast, can be expected to be bi-modal whenever .
Another difference between these two aggregation methods is that the variance of the Wasserstein barycenter is smaller than that of the combined density resulting from a linear opinion rule. This holds for a more general class of input densities as shown in [2] in the univariate case. Of course, a narrow (i.e., sharp) predictive density can be good or bad depending on the underlying distribution of the target variable. It may be desirable to have an ability to flexibly adjust the dispersion of the combined density.
2.3. Regularized Wasserstein Barycenter
Now, we turn to our proposal. In this paper, we use a regularized Wasserstein distance [14,15] to combine individual probability forecasts. The regularization term used in this approximation of the Wasserstein metric is given by the negative differential entropy, which, when is an absolutely continuous measure, we will define as, where is the Lebesgue measure, and infinity otherwise. We will use to define the regularized Wasserstein metric as
| (8) |
where controls a strength of regularization. Note that is constrained by the same two marginal restrictions as its unregularized counterpart, as described in the definition of . This form of regularization is originally introduced by [14] in order to estimate the Wasserstein metric in a computationally efficient manner using the iterative proportional fitting procedure (IPFP) provided by [16].
When , there is no regularization, so we have One can also show that the optimal coupling, say satisfies when is uniquely defined, and otherwise this limiting value is given by the element of the set of optimal unregularized couplings with maximum entropy [15]. Higher values of place more weight on the second term in the objective function, which results in optimal couplings that are smoother and more dispersed than their unregularized counterparts.
Defining by results in the combined density
| (9) |
which is known as the regularized Wasserstein barycenter. The authors of [4] provided a generalization of the IPFP procedure to find this barycenter that is more computationally efficient than the unregularized case. While computational efficiency is the commonly cited reason for using entropy regularization, as we will see in the later sections, our motivation for regularization is not entirely computational.
For the rest of the paper, we study this regularized Wasserstein barycenter, which is defined in Equation (1) using (8). First, we present analytical results under a parametric assumption that broadens our understanding about the role of the regularization in forecast density combination. Then, we discuss how one can empirically choose the strength of the regularization that would achieve a certain notion of optimality.
3. Analytical Results: The Impact of Entropy Regularization
In this section we provide analytical results that describe the impact of entropy regularization on the shape of the barycenter. To better compare this barycenter with its unregularized counterpart in the Gaussian case, as defined above, we will focus on the regularized barycenter when and are d-dimensional multivariate Gaussians (). The regularized Wasserstein barycenter in this case is defined as
| (10) |
The following theorem completely characterizes the resulting barycenter in this case. Like the unregularized case, the theorem shows that regularization does not impact the mean of the barycenter; however, it does have an impact on its variance–covariance matrix.
Theorem 1.
Let and be Gaussian density functions with means and variance matrices, The regularized Wasserstein barycenter between and is given by the density function of where and are defined by,
where is the unique symmetric matrix that satisfies these equalities and
Also, the iterates of the following series converge to V when
The proof of this result is included in the Appendix A. We prove a slightly more general version of the theorem where the objective function in Equation (10) is a weighted average of and . The proof first derives a system of equations that characterizes the barycenter in the case in which the regularized barycenter is Gaussian. Afterward, a fixed point theorem provided by [17] for mappings on partially ordered sets is used to show that this system has a unique solution, and this, along with convexity of Equation (10), implies the regularized barycenter is Gaussian.
Now, we discuss our theoretical results and their implication to the density forecast combination problem.
Remark 1.
(on location). Regularization does not affect the mean of the resulting barycenter, which is a property that may not hold in the more general setting that does not include a normality assumption. For example, suppose the domain of p is and and consider the barycenter between p and itself. For any fixed density function the optimal coupling of the optimization problem that defines converges to as this is the coupling with maximum entropy that has marginals given by q and see for example, [15]. However, the negative entropy of is less than or equal to that of for any such fixed density We can also ensure these couplings are feasible by defining q to be a uniform density function, so we have This implies that regardless of the Since the unregularized density is given by and the regularization parameter does impact the mean of the barycenter.
Remark 2.
(on dispersion) Regularization tends to smooth the resulting barycenter, leading to a more dispersed combined density. To understand this point, let us consider a simple example below.
Example 1.
Consider a case with univariate and . Then, the original Wasserstein barycenter (quantile averaging) is . On the other hand the regularized Wasserstein barycenter is
As this case exemplifies the strength of the regularization controls a dispersion of the combined density. The heavier the regularization the greater dispersed (or, the smoother) density we obtain. This result highlights that the entropy regularization offers an extra flexibility to control the dispersion of the combined density. In the next section, we propose a data-driven way to select the value of , the strength of the regularization.
Remark 3.
The normality assumption that we made to obtain the closed-form solution for the barycenter is not needed in practice. The regularized barycenter of probability/density forecasts is well-defined and computationally tractable for a broader context. One can have multiple inputs, non-Gaussian densities, discrete/continuous/mixed distribution. This includes many interesting and empirically relevant situations in economic forecasting such as macroeconomic and financial forecasting. The efficient computation of the regularized Wasserstein distance and barycenter with non-Gaussian input densities is still an active area of research. There is a large literature on computing the regularized barycenter in practice; see for example [4,18,19,20,21,22,23].
Remark 4.
During the review process for this paper, we became aware of a similar result that was proved independently of ours by [5]. There are two primary differences between these results. First, our result provides the regularized barycenter between two multivariate normal densities, while Theorem 1 in Janati et al. (2020a) provides the barycenter between an arbitrary number of univariate normal densities. Second, our result also provides a recursive formula to compute the variance–covariance matrix of the barycenter, which guarantees a convergence to a desired solution. We appreciate one of referees who pointed out relevant papers.
There have also been a number of recent results on a few related barycenters, including those that are modified to avoid the increase in the dispersion of the barycenter caused by regularization using one of the following two techniques. First, a Kullback–Leibler divergence penalty term can be used, with a reference measure given by the product of the input densities, rather than differential entropy. Second, a technique known as debiasing can also be used. For example, the remaining results in [5], as well as the results provided by [24,25], characterize these types of regularized Wasserstein barycenters between Gaussian densities. In contrast to the barycenter we consider, which can be viewed as the original discrete entropy regularized Wasserstein barycenter in the limit as the number of bins diverges, increasing the regularization parameter of these alternative barycenters either decreases or does not change the variance of the barycenter.
4. On Choosing the Strength of the Regularization
This section discusses how to choose the strength of the penalization. Our empirical strategy is to select by the value that most accurately fits the observed data. To economize our notation we restrict our discussion to the 1-step-ahead prediction (i.e., ). To do so, we regard the regularized barycenter computed at time t, , as a predictive likelihood for . This predictive likelihood interpretation of the barycenter can be formally justified by the principal-agent framework similar to the one developed by [26]. Suppose we have collected the regularized barycenters and the realized value of the target variable from the initial period to present . We write this collection as . Then, we can define a maximum likelihood estimator for at t with as
| (11) |
and the combined density prediction for at time t is
| (12) |
There is a notion in which this combined density with is optimal. Suppose that , and assume that forecasters report a sequence of predictive densities, for , and . These forecasts are reported before the realization of , and the barycenter is defined by ’s and . Then, the following can be shown under regularity conditions,
for . In turn, a maximizer of the left-hand-side term also converges to the maximizer of the right-hand-side term, which is a minimizer of
Therefore, converges to the pseudo-true parameter that minimizes Kullback–Leibler (KL) divergence from the regularized barycenter to the true data generating process. In other words, we find that makes the resulting barycenter close to the true data generating process in the limit. This asymptotic thought experiment can be justifiable under quite general conditions, allowing for a range of serial dependence in as well as a flexible form of the regularized Wasserstein barycenter implied by ’s. We can operationalize this by recognizing that can be viewed as a predictive likelihood for formed at time . Then, quasi-MLE theory can be invoked, e.g., [27,28]. We provide a simple example in which the true data generating process follows the autoregressive (AR) process.
Example 2.
Suppose that forecaster 1 and 2 use mean-zero Gaussian AR(1) process to construct their density prediction. The two forecasts differ only by the mean reversion parameter. That is, the means of predictive distribution for forecaster 1 and 2 are and , respectively. Based on our theory in the previous section, the barycenter is where , and the log density of the regularized barycenter at τ for is
(13) and the ML estimator for γ at time t is
(14) which leads to
(15) Now, suppose that the actual data generating process is
(16) When the simple average of both forecasters’ autoregressive parameter equals , the ML estimate for γ depends on the true conditional variance, , and forecasters’ conditional variance. If the sample variance is larger than that of the forecasters, then γ is chosen so that the resulting regularized barycenter has the same variance as the sample variance. On the other hand, if the sample variance is smaller than that of the forecasters, then γ is set to 0. Note that there is an asymmetry in adjusting the variance of the barycenter. This is natural in that the regularization only makes the resulting density smoother. In practice, this may not be a problem if the practitioner’s concern is the combined density being too sharp (e.g., relative to the linear opinion rule).
Note that converges in probability to . The KL divergence between and the true conditional density of at t is minimized at . This confirms that our selection rule for γ aims to fit the data well by shaping the regularized barycenter as close as possible to the data generating process.
5. Empirical Illustration
In this section, we illustrate our proposed method using macroeconomic data for the U.S. We consider 14 hypothetical forecasters who produce their own 1-step-ahead forecast about the U.S. inflation rate based on the following vector autoregression (VAR) with three variables,
| (17) |
where is a vector that consists three quarterly macroeconomic variables, is a vector, are matrices. The first two elements of are common to all 14 forecasters: the annualized quarter-over-quarter inflation rate and real GDP growth rate. They differ by the third element of . We assign each forecaster a different macroeconomic variable from the FRED-QD database by [29]. A detailed description of the variable used in this exercise is in Table 1.
Table 1.
Variables used in empirical exercises.
| Used by | Variable Description | FRED-QD Mnemonic | |
| Variable 1 | All | Inflation rate | GDPCTPI |
| Variable 2 | All | Real GDP growth rate | GDPC1 |
| Variable 3 | Forecaster 1 | Real Personal Consumption Expenditures | PCECC96 |
| Forecaster 2 | Industrial Production Index | INDPRO | |
| Forecaster 3 | All Employees: Total Nonfarm | PAYEMS | |
| Forecaster 4 | Housing Starts: Total Privately Owned Housing Units Started | HOUST | |
| Forecaster 5 | Real Manufacturing and Trade Industries Sales | CMRMTSPLx | |
| Forecaster 6 | Real Crude Oil Prices: West Texas Intermediate (WTI) | OILPRICEx | |
| Forecaster 7 | Real Average Hourly Earnings: Manufacturing | CES3000000008x | |
| Forecaster 8 | 10-Year Treasury Constant Maturity Minus 3-Month Treasury Bill | GS10TB3Mx | |
| Forecaster 9 | Real Commercial and Industrial Loans | BUSLOANSx | |
| Forecaster 10 | Real Total Assets of Households and Nonprofit Organizations | TABSHNOx | |
| Forecaster 11 | U.S. / U.K. Foreign Exchange Rate | EXUSUKx | |
| Forecaster 12 | Consumer Sentiment (University of Michigan) | UMCSENTx | |
| Forecaster 13 | S&P’s Common Stock Price Index: Composite | S&P 500 | |
| Forecaster 14 | Real Disposable Business Income | CNCFx |
Note: All variables are obtained from the FRED-QD database [29]. Inflation rate is computed as a log difference of the GDP deflator (GDPCTPI). Real GDP growth rate is computed as a log difference of the real GDP (GDPC1). All other variables are transformed following [29]. We use the 2019–11 vintage data. Each forecaster constructs a predictive distribution using their own vector autoregression with three variables where .
We compute each forecasters’ 1-step-ahead predictive distribution for the inflation rate at time t as where denotes element of vector/matrix x. These forecasters assume that the 1-step-ahead predictive distribution of at t is Gaussian, and they use their best guess about the predictive mean and variance to construct the predictive distribution. More specifically, they set these two moments as
| (18) |
where is the posterior mean of with a flat prior. We set meaning that they also use the most recent 20 years of data to construct the predictive distribution.
We let the forecasters to generate their 1-step-ahead predictive distribution for the inflation rate from 2001Q1 to 2018Q4. This leaves us 72 quarters for a forecast evaluation sample. At each point in time, we also combine these 14 predictive densities based on the regularized Wasserstein barycenter with 20 different values of the regularization parameter on . As we explained in the previous section, a larger value of this parameter implies a stronger regularization, and the resulting combined predictive density becomes smoother with a larger variance. We also compute the combined density with , which leads to “quantile aggregation” or “Vincentized distribution”. Our computation of the regularized barycenter is based on the algorithm developed and proposed by [19]. The MATLAB toolbox that implements this algorithm is available from https://github.com/gpeyre/2015-SIGGRAPH-convolutional-ot.
We evaluate each forecaster’s, and other forecast aggregation, methods by the sum of log predictive score, which is a logarithm of the predictive density evaluated at the actualized value, over the evaluation sample. These results are presented in Figure 1. The left panel presents the sum of the log score for individual forecasters sorted by their performance. There is a sizeable difference in their historical performance. The solid line represents the performance based on the quantile aggregation, which aggregates all forecasters in the pool. As found by other research papers, e.g., [2,3] the quantile aggregation method generates a decent predictive distribution, which performs slightly better than the ex-post top 4 forecaster.
Figure 1.
Sum of log predictive score for U.S. inflation rate (2000Q1–2018Q4).
The right panel in Figure 1 shows the historical performance of our proposed approach with various choices of regularization parameter, . For a wide range of values for the regularized barycenter performs better than the quantile aggregation. It does even better than the best individual. This is interesting because we cannot identify the best forecaster a priori.
The optimal value of defined in Equation (11) at the end of the evaluation sample would be the value of that corresponds to the peak of the curve, which is about . If we were to use this value at the beginning of the evaluation sample, then the mean difference in the log predictive score between the regularized Wasserstein barycenter and the quantile aggregation would have been 0.12 with the heteroscedasticity and autocorrelation consistent (HAC) standard error being 0.07. This implies that the difference in the peak of the curve and the solid line is statistically significant at 10% confidence level.
To make the selection fully adaptive, we also compute the optimal sequentially from the beginning to the end of the evaluation sample. That is, we set the predictive density for as the regularized barycenter with the value of that maximizes the objective function defined in Equation (11) only using the information available from the beginning of the sample up to t. In this way, we do not use any future information when choosing the value of . Even in this case the regularized Wasserstein barycenter performs better than the best individual forecaster and the quantile aggregation. The sum of the log predictive score is −93.09, and the mean difference in the log predictive score with the quantile aggregation is 0.11 with the HAC standard error being 0.06. This suggests that the regularized Wasserstein barycenter with the adaptively chosen (e.g., estimated online) performs statistically better than its unregularized counterpart, the quantile aggregation, at the 10% significance level. This superior predictive performance of the regularized Wasserstein barycenter relative to the quantile aggregation remains unchanged even when we split the evaluation sample into two. The mean difference in the log predictive score is 0.13 and 0.09 for the first half and the second half of the evaluation sample, respectively.
6. Concluding Remarks
This paper proposes to use the entropy regularized Wasserstein barycenter to combine several probability and density forecasts. The entropy regularization smooths the resulting combined forecast, and it offers a flexible way to adjust the dispersion of the predictive density when it is needed. We study the effect of the regularization on the combined density forecast and provide an exact relationship between the strength of the regularization and the variance–covariance matrix of the combined density when input densities are Gaussian. We then provide a way to select the strength of regularization by choosing the regularized barycenter that most closely matches the data. We apply our proposed methodology to the U.S. inflation density forecasting and show how the entropy regularization can improve the quality of the density forecast relative to its unregularized counterpart.
In this article, we restrict weights of each input densities on the final combined density to be pre-determined at some values (i.e., equal weighting). This choice was intentional to focus on studying the role of entropy regularization. In practice, however, it is possible that a subset of input densities might be superior to others, and one may wish to put different weights on each input density. Alternatively, it is desirable to include only a subset of input densities into the combined density and set other weights to zero, see, for example, [30]. For those cases, it is fruitful to develop a data-dependent method that chooses both the regularization strength and those weights simultaneously, which is a topic for future research.
Acknowledgments
We thank Frank Diebold, Roger Koenker, Frank Schorfheide, R. Miyauchi Lee, and two anonymous referees for their insightful comments.
Appendix A
The authors of [17] provide the following fixed point theorem, which we will use in the proof of Theorem 1.
Lemma A1 (Ran and Reurings, 2004).
Let T be a partially ordered set such that every pair has a lower bound and an upper bound. Furthermore, let d be a metric on T such that is a complete metric space. If is a continuous, monotone (e.g., either order-preserving or order-reversing) map from T into T such that,
and
then has a unique fixed point, Also, for all
The following result follows from Lemma 1.
Lemma A2.
Suppose is the set of symmetric matrices with all eigenvalues in the range and are positive definite matrices. Then there is a unique such that where
Also, for any
Proof.
Suppose and First we will establish that is order-preserving, which is equivalent to Note that,
Similar logic implies that for all such
and since is the sum of both of these order-preserving functions, is also order-preserving.
Clearly our bounds on the eigenvalues imply that is continuous for all To show that is a mapping from T into note that matrix symmetry is preserved over addition and inversion, so is symmetric for all Also, note that,
Similar logic can be used to show that This also implies the final requirement of Lemma 1.
The only remaining requirement of Lemma 2 is the penultimate, which we will establish for such that and using the norm, Also, let and denote the spectral norm of We will use the property where and see for example, [17]. Note that,
where The second inequality follows from the matrix (respectively, being similar to a symmetric matrix, and with eigenvalues contained in because implies □
Next we will establish Theorem 1, which is restated below. This is a slightly more general version of the theorem in the main text where the objective function in Equation (10) is a weighted average of and .
Theorem A1.
Let and and be Gaussian density functions with means and variance matrices, The regularized Wasserstein barycenter between and is given by the density function of where and are defined by,
where is the unique symmetric matrix that satisfies these equalities and
Also, the iterates of the following series converge to V when
Proof.
Let be defined as, and, for a given function we will denote the convolution of and as, When there is little risk of confusion, we will omit the input of functions supported on in the remainder of the proof.
We will characterize the barycenter using the fact that it is the minimizer of the following optimization problem.
(A1) To do so, note that the optimal coupling corresponding to can be defined by instead solving the dual of (8), which is
(A2) and the optimal coupling can be defined in terms of the dual variables as The first order conditions of (A2) are
(A3)
(A4) Also, since the objective function of (A2) is differentiable, an application of the envelope theorem implies
Thus, the optimum of (A1) can be characterized by the following functional derivative being zero.
After combining this equality with (A3) and (A4), we have that the barycenter can be characterized by the system
This system can be reduced to two equalities after noting that, and implies
After combining both equalities, and noting we have
(A5) Let be defined as the set of functions of the form
where is a symmetric and invertible matrix, and It will also be convenient to let be defined so that and be defined so that It is well known that if are Gaussian density functions, then where and and it is also straightforward to show
Likewise, in the case of we will also use the properties
Note that is the necessary and sufficient condition for to be well defined, and it is straightforward to verify that the properties above also hold over all pairs of when this is the case; for the case of normal density functions, see for example [31].
Next, we will suppose that is in which, due to (A5), also implies and then show that there exists a unique that satisfies (A5). Since (A1) is a strictly convex optimization problem, when a solution to (A1) exists, it can be characterized uniquely by its first-order conditions. Note that, for any pair that solves (A1), we have that where are also solutions. We avoid complications from this issue by placing the additional restriction on these dual variables that as this ensures strict convexity over this set of dual functions. To see that this is also without loss of generality, note that rescaling the dual variables by would not impact the objective function in (A2) because Also, a would not impact the first order conditions (A3) and (A4), so it would also not have an impact on q. Thus, after providing that solves (A5), we will have also shown that this solution is unique even when not restricted to
Since and are elements of and is closed under multiplication, division, convolution, and exponentiation to the (non-zero) power of if then the functions on both sides of the equality (A5) will also be elements of Let and As noted above, the convolutions in (A5) are only well defined if the following matrix inequalities hold, so we will also require the solution to satisfy these inequalities.
which hold if and only if
(A6) It is straightforward to verify that these inequalities are identical to the ones that ensure the optimal coupling is integrable, as this coupling is given by, Thus, Fubini’s theorem implies that they are also sufficient conditions for q to be integrable.
We can find by applying to (A5), which implies
(A7)
(A8) Let After three applications of the matrix inversion lemma and simplifying we have that, for each
(A9) This, along with Equations (A7) and (A8), implies that can be characterized by
After defining V as this implies
Note that our requirement that satisfy (A6) can be written in terms of V as, and Lemma 2 implies that there is a unique solution that satisfies these conditions.
The functional form for from the statement of this theorem follows from an alternative ordering of the matrix inversion theorem. Specifically, starting from (A9)
Thus,
After applying to both sides of (A5), we have
(A10) To simplify this expression, we will first establish three intermediate equalities. First, Equations (A7) and (A8) imply
(A11) Second, (A11) in turn implies
(A12) Third, after an application of the matrix inverse identity to (A7) and (A8)
(A13)
(A14) which implies
Thus,
(A15) We will start with the coefficient on in (A10). The equalities (A11) and (A15) imply that this term is equal to
The equalities (A12) and (A15) imply that the coefficient on in (A10) can be written as
After combining these terms, we can define (A10) as the solution to
Since the matrix inverse identity also implies
we have
□
Author Contributions
The authors contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
Disclaimer
The views expressed in these papers are solely those of the authors and do not necessarily reflect the views of the Federal Reserve Bank of Philadelphia, the Federal Reserve System, or the Census Bureau. Any errors or omissions are the responsibility of the authors. There are no sensitive data in this paper.
References
- 1.Geweke J., Amisano G. Optimal prediction pools. J. Econom. 2011;164:130–141. doi: 10.1016/j.jeconom.2011.02.017. [DOI] [Google Scholar]
- 2.Lichtendahl K.C., Grushka-Cockayne Y., Winkler R. Is it better to average probabilities or quantiles. Manag. Sci. 2013;59:1594–1611. doi: 10.1287/mnsc.1120.1667. [DOI] [Google Scholar]
- 3.Busetti F. Quantile aggregation of density forecasts. Oxf. Bullet. Econom. Stat. 2017;79:495–512. doi: 10.1111/obes.12163. [DOI] [Google Scholar]
- 4.Benamou J., Carlier G., Cuturi M., Nenna L., Peyre G. Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 2015;37:1111–1138. doi: 10.1137/141000439. [DOI] [Google Scholar]
- 5.Janati H., Cuturi M., Gramfort A. Debiased Sinkhorn barycenters. arxiv. 20202006.02575 [Google Scholar]
- 6.Timmermann A. Handbook of Economic Forecasting. Volume 1. Elsevier; Amsterdam, The Netherlands: 2006. Forecast combinations; pp. 135–196. [Google Scholar]
- 7.Clemen R. Combining forecasts: A review and annotated bibliography. Int. J. Forecast. 1989;5:559–583. doi: 10.1016/0169-2070(89)90012-5. [DOI] [Google Scholar]
- 8.Villani C. Topics in Optimal Transportation. Volume 58 American Mathematical Soc.; Providence, RI, USA: 2003. [Google Scholar]
- 9.Galichon A. Optimal Transport Methods in Economics. Princeton University Press; Princeton, NJ, USA: 2018. [Google Scholar]
- 10.Ratcliff R. Group reaction time distributions and an analysis of distribution statistics. Psychol. Bullet. 1979;86:446–461. doi: 10.1037/0033-2909.86.3.446. [DOI] [PubMed] [Google Scholar]
- 11.Genest C. Vincentization revisited. Ann. Stat. 1992;20:1137–1142. doi: 10.1214/aos/1176348676. [DOI] [Google Scholar]
- 12.Agueh M., Carlier G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011;43:904–924. doi: 10.1137/100805741. [DOI] [Google Scholar]
- 13.Knott M., Smith C.S. On a generalization of cyclic monotonicity and distances among random vectors. Linear Algebra Appl. 1994;199:363–371. doi: 10.1016/0024-3795(94)90359-X. [DOI] [Google Scholar]
- 14.Cuturi M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport; Proceedings of the 27th Annual Conference on Neural Information Processing Systems; Lake Tahoe, NV, USA. 5–10 December 2013. [Google Scholar]
- 15.Peyré G., Cuturi M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 2019;11:355–607. doi: 10.1561/2200000073. [DOI] [Google Scholar]
- 16.Sinkhorn R. Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 1967;74:402–405. doi: 10.2307/2314570. [DOI] [Google Scholar]
- 17.Ran A.C., Reurings M.C. A fixed point theorem in partially ordered sets and some applications to matrix equations. Proc. Am. Math. Soc. 2004;132:1435–1443. doi: 10.1090/S0002-9939-03-07220-4. [DOI] [Google Scholar]
- 18.Cuturi M., Doucet A. Fast computation of Wasserstein barycenters; Proceedings of the 31st International Conference on Machine Learning; Beijing, China. 21–26 June 2014. [Google Scholar]
- 19.Solomon J., De Goes F., Peyré G., Cuturi M., Butscher A., Nguyen A., Du T., Guibas L. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 2015;34:66. doi: 10.1145/2766963. [DOI] [Google Scholar]
- 20.Dvurechensky P., Dvinskikh D., Gasnikov A., Uribe C., Nedic A. Decentralize and randomize: Faster algorithm for Wasserstein barycenters; Proceedings of the Annual Conference on Neural Information Processing Systems 2018; Montreal, QC, Canada. 3–8 December 2018. [Google Scholar]
- 21.Kroshnin A., Tupitsa N., Dvinskikh D., Dvurechensky P., Gasnikov A., Uribe C. On the complexity of approximating Wasserstein barycenters; Proceedings of the 36th International Conference on Machine Learning; Long Beach, CA, USA. 10–15 June 2019. [Google Scholar]
- 22.Lin T., Ho N., Chen X., Cuturi M., Jordan M.I. Fixed-support Wasserstein barycenters: Computational hardness and fast algorithm. arXiv. 20202002.04783v4 [Google Scholar]
- 23.Lin T., Ho N., Cuturi M., Jordan M.I. On the complexity of approximating multimarginal optimal transport. arXiv. 20201910.00152v2 [Google Scholar]
- 24.Janati H., Muzellec B., Peyré G., Cuturi M. Entropic optimal transport between (unbalanced) Gaussian measures has a closed form. arXiv. 20202006.02572 [Google Scholar]
- 25.Mallasto A., Gerolin A., Minh H.Q. Entropy-regularized 2-Wasserstein distance between Gaussian measures. arXiv. 20202006.03416 [Google Scholar]
- 26.Del Negro M., Hasegawa R., Schorfheide F. Dynamic prediction pools: An investigation of financial frictions and forecasting performance. J. Econom. 2016;192:391–405. doi: 10.1016/j.jeconom.2016.02.006. [DOI] [Google Scholar]
- 27.White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. doi: 10.2307/1912526. [DOI] [Google Scholar]
- 28.Bollerslev T., Wooldridge J.M. Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econom. Rev. 1992;11:143–172. doi: 10.1080/07474939208800229. [DOI] [Google Scholar]
- 29.McCracken M., Ng S. FRED-QD: A Quarterly Database for Macroeconomic Research. National Bureau of Economic Research; Cambridge, MA, USA: 2020. Working Paper No. 26872. [DOI] [Google Scholar]
- 30.Diebold F.X., Shin M. Machine learning for regularized survey forecast combination: Partially-egalitarian LASSO and its derivatives. Int. J. Forecast. 2019;35:1679–1691. doi: 10.1016/j.ijforecast.2018.09.006. [DOI] [Google Scholar]
- 31.Bromiley P. Products and convolutions of Gaussian probability density functions. Tina-Vision Memo. 2003;3:1. [Google Scholar]

