Abstract
We present a particular formulation of optimal transport for matrix-valued density functions. Our aim is to devise a geometry which is suitable for comparing power spectral densities of multivariable time series. More specifically, the value of a power spectral density at a given frequency, which in the matricial case encodes power as well as directionality, is thought of as a proxy for a “matrix-valued mass density.” Optimal transport aims at establishing a natural metric in the space of such matrix-valued densities which takes into account differences between power across frequencies as well as misalignment of the corresponding principle axes. Thus, our transportation cost includes a cost of transference of power between frequencies together with a cost of rotating the principle directions of matrix densities. The two endpoint matrix-valued densities can be thought of as marginals of a joint matrix-valued density on a tensor product space. This joint density, very much as in the classical Monge–Kantorovich setting, can be thought to specify the transportation plan. Contrary to the classical setting, the optimal transport plan for matrices is no longer supported on a thin zero-measure set.
Index Terms: Convex optimization, matrix-valued density functions, optimal mass-transport
I. Introduction
The problem of optimal mass transport (OMT) dates back to the work of G. Monge in 1781 [1] while its modern formulation is due to L. Kantorovich [2]. In recent years the subject is developing rapidly due to its intrinsic significance and range of applications in physics, economics, and probability [3]–[5].
Our motivation for studying “matrix-valued transport” originates in the spectral analysis of multi-variable time series. Just as in scalar time series, spectral content is assessed based on (estimated) statistics of the underlying process, where these are simply moments of the corresponding power spectral density (PSD). Different metrics have been proposed to compare PSD’s for purposes of spectral approximation, estimation, and system modeling (see [6], [7] and the references therein). However, since spectra are estimated based on integrals, weak* metrics1 are preferable since they provide continuity of statistics to perturbations in the PSD. Earlier metrics and so called, divergence measures, typically fail in this respect (see [7]). Hence, for this reason, optimal mass transport which endows the space of (scalar) probability/mass/power densities with a natural weak* metric—the Wasserstein metric, is of particular interest. Our aim in this paper is to develop one possible such generalization of the Wasserstein metric that allows comparison of matrix-valued density functions in a similar spirit.
The scalar OMT theory has been adapted in [8] to model slowly time-varying changes in power spectra of time series and has been used for statistical estimation, data assimilation, and morphing. While in scalar time series, the power spectral content may drift across frequencies over time (e.g., when considering Doppler effects, echolocation of a moving target, etc.), in vector-valued time series the power spectral content may shift principle directions as well. In fact, such a rotation of the power-specral content is typical in general antenna-arrays when a scatterer changes position with respect to array elements. Therefore, a concept of transport between matrix-valued densities requires that we take into account both, the cost of shifting power across frequencies as well as the cost of rotating the corresponding principle axes. Besides our particular formulation of a “non-commutative” Monge–Kantorovich transportation problem and of a correponding metric, the main results in this paper are i) that the optimal transport can be cast as a convex-optimization problem, ii) the geodesics and transport paths can be determined using convex programming, and iii) the optimal transport plan has support which, in contrast to the classical Monge–Kantorovich setting, is no longer contained on a thin zero-measure set. The relevance of the proposed metric is highlighted in examples on spectral morphing and spectral tracking in the final section of the paper.
II. Preliminaries on Optimal Mass Transport
Consider two probability density functions μ0 and μ1 supported on ℝ and let ℳ(μ0, μ1) be the set of probability measures m on ℝ × ℝ with μ0 and μ1 as marginals, i.e.,
Clearly, ℳ(μ0, μ1) is not empty since already the product μ0(x)μ1(y) ∈ ℳ(μ0, μ1). Probability densities are thought of as distributions of mass and the optimal mass transport problem is to determine
(1) |
where c(x, y) is the cost of transporting one unit of mass from location x to y. In particular, when c(x, y) = |x − y|2, the optimal cost gives rise to the 2-Wasserstein metric
where
(2) |
In general, (1) is a linear program with dual
(3) |
where ϕ0, ϕ1 are continuous, see [3]. For the quadratic cost c(x, y) = |x − y|2 and in one spatial dimension, 𝓣2(μ0, μ1) can also be written explicitly in terms of the cumulative distributions functions
in the form
(4) |
In this case, the optimal joint probability density m ∈ ℳ(μ0, μ1) has support on (x, T (x)) where T (x) is the sub-differential of a convex lower semi-continuous function, see [3, p. 75]. More specifically, T (x) is uniquely defined by
(5) |
Interestingly, a geodesic μτ (τ ∈ [0, 1]) between μ0 and μ1 can be written explicitly as well in terms of a corresponding cumulative function Mτ, for each τ, defined via
(6) |
Indeed, it readily follows that:
and that μτ (τ ∈ [0, 1]) is a geodesic.
III. Matrix-Valued Optimal Mass Transport
We consider the one-dimensional family of matrix-valued functions
These are Hermitian, positive semi-definite matrix-valued functions on ℝ normalized so that their trace integrates to 1. They will be referred to as matrix-valued densities and can be thought of as a generalization of probability density functions. The scalar-valued tr(μ) represents mass at location x. Thus, all elements in ℱ have the same total mass over the support. Below, we motivate a particular cost of transportation between such matrix-valued functions and introduce a suitable generalization of the Monge–Kantorovich OMT to matrix-valued densities.
A. Tensor Product and Partial Trace
Consider two n-dimensional real or complex (Hilbert) spaces ℋ0 and ℋ1, let ℒ(ℋ0) and ℒ(ℋ1) denote the space of linear operators on ℋ0 and ℋ1, respectively, and let μ0 ∈ ℒ(ℋ0) and μ1 ∈ ℒ(ℋ1). Thus, in the present subsection, μi (i ∈ {0, 1}) are fixed matrices. We denote their tensor product by μ0 ⊗ μ1 ∈ ℒ(ℋ0 ⊗ ℋ1) which is formally defined via
Since our spaces are finite-dimensional this can be identified with the Kronecker product of the corresponding matrix representation of the two operators. The space ℒ(ℋ0 ⊗ ℋ1) is the span of all products μ0 ⊗ μ1 with μi ∈ ℒ(ℋi) for i ∈ {0, 1}.
Consider μ ∈ ℒ(ℋ0 ⊗ ℋ1). The partial traces trℋ0 and trℋ1, or tr0 and tr1 for brevity, are linear maps
defined uniquely by the property that on simple products they act as follows:
for any μ0 ∈ ℒ(ℋ0) and μ1 ∈ ℒ(ℋ1). Alternatively, μ ∈ ℒ(ℋ0 ⊗ ℋ1) can be represented by a matrix [μik,ℓm] of size n2 × n2 as it maps a basis element ui ⊗ vk ∈ ℋ0 ⊗ ℋ1 to Σℓ,m μik,ℓmuℓ ⊗ vm. Then, the partial trace e.g., tr1(μ) is the represented by the n × n matrix with (i, ℓ)-th entry Σkμik,ℓk, for 1 ≤ i, ℓ ≤ n. Likewise the (k, m)-th entry of tr0(μ) is Σi μik,im, for 1 ≤ k, m ≤ n. See [9] for the significance of partial trace in the context of quantum mechanics.
B. Joint Matrix-Valued Density
We now return to considering matrix-valued density functions μ0, μ1 ∈ ℱ. A naive attempt is to seek a joint density m ≥ 0 having support on ℝ × ℝ and having μ0, μ1 as “marginals,” i.e., so that
(7) |
However, in contrast to the scalar case, such an m does not exist in general. To see this, consider the case of matrix valued measures
where δ(·) denotes the Dirac delta. If m(x, y) were to exist, its support would be contained in {(1, 1), (1, 2) (2, 1), (2, 2)}. It is easy to see that there cannot be a consistent selection of four 2 × 2 matrices so that, in pairs, they sum up to the coefficients making up μ0(x) and μ1(y).
Thus, any natural definition of a transportation plan requires that the joint density lives in a bigger space. A particular formulation is as follows: we seek
(8a) |
for (x, y) ∈ ℝ × ℝ, such that
(8b) |
(8c) |
and denote
Since μ0 ⊗ μ1 ∈ M(μ0, μ1), the set M(μ0, μ1) is clearly not empty.
We next motivate a suitable transportation cost. This is a functional on the joint density M(μ0, μ1), just as in the scalar case. However, besides penalizing transport of mass between two points x and y, we also impose a penalty on a corresponding rotation as well.
C. Transportation Cost
As indicated earlier, we interpret tr(m(x, y)) as the total “mass” that is being transferred from x to y. We consider a scalar cost2 c(x, y) = (x − y)2 and the “mass transference” cost
(9) |
This coincides with the optimal transportation cost between scalar-valued densities tr(μ0) and tr(μ1). Thus, if tr(μ0(x)) = tr(μ1(x)), the optimal value of (9) is zero since it reduces to optimal transport between identical scalar marginals. Thus, (9) fails to quantify mismatch of directionality between the given matrix-valued marginals. Below, we introduce a term that penalizes directionality missmatch.
We assume throughout that the marginals are positive definite pointwise. Then, for i ∈ {0, 1}, tr(μi(x)) represents the total mass at x while μi(x)/tr(μi(x)), normalized to have trace 1, encapsulates directional information. Likewise, for the joint density m(x, y), assuming that m(x, y) ≠ 0, we define the normalized partial traces
Their difference captures the directional mismatch between the two partial traces. Hence, we introduce
to quantify the rotational mismatch and we consider the cost functional
with λ > 0, to weigh in the relative significance of the linear and rotational penalties.
D. Optimal Transportation Problem
In view of the above, we define
(10) |
with c(x, y) = (x − y)2, and show next that (10) is in fact a convex optimization problem.
From the definition
and hence
Now let m(x, y) = tr(m(x, y)) and m0(x, y), m1(x, y) be as in (8). The expression for the optimal cost in (10) is lower bounded by
(11) |
For an optimal triple m̂, m̂0, m̂1 of (11), m̂:= m̂0 ⊗ m̂1 is a minimizer of (10) that gives the same optimal value as (11). Thus, the optimal cost in (10) is equivalently written as (11).
For x > 0, the expression (y − z)2/x is jointly convex in the arguments x, y, z, see e.g., [10, p. 72]. It readily follows that the integral in (11) is a jointly convex functional of its arguments. All additional constraints in (11) are convex as well and, therefore, so is the optimization problem.
IV. On the Geometry of Optimal Mass Transport
An important result in the (scalar) OMT theory is that the transportation plan is the sub-differential of a convex function and has support on a thin zero-measure set, see e.g., [3, p. 92]. This property is not shared by the optimal transportation plan between matrix-valued density functions as we explain next.
In standard scalar OMT with convex transportation cost, the optimal transportation plan has a certain cyclically monotonic property [3]. More specifically, if (x1, y1), (x2, y2) are two points where the transportation plan has support (i.e., m(x, y) ≠ 0), then x2 > x1 implies y2 ≥ y1. The interpretation is that optimal transportation paths of mass elements do not cross. For the case of matrix-valued distributions as in (3), this property may not hold in the same way. However, interestingly, a weaker monotonicity property holds for the supporting set of the optimal matrix transportation plan. The property is defined next and the precise statement is given in Proposition 2 below.
Definition 1
A set 𝓢 ⊂ ℝ2 is called a ρ-monotonically nondecreasing, for ρ > 0, if for any two points (x1, y1), (x2, y2) ∈ 𝓢, it holds that
A geometric interpretation for a ρ-monotonically non-decreasing set is that if (x1, y1), (x2, y2) ∈ 𝓢 and x2 > x1, y1 > y2, then the area of the rectangle with vertices (xi, yj) (i, j ∈ {1, 2}) is not larger than ρ. The transportation plan of the scalar-valued optimal transportation problem with a quadratic cost has support on a 0-monotonically non-decreasing set.
Proposition 2
Given μ0, μ1 ∈ ℱ, let m be the optimal transportation plan in (10) with c(x, y) = (x − y)2 and λ > 0. Then m has support on at most a (4 · λ)-monotonically nondecreasing set.
Proof
See the Appendix.
Further, the optimal transportation cost 𝓣2, λ(μ0, μ1) satisfies:
𝓣2,λ (μ0, μ1) = 𝓣2,λ(μ1, μ0),
𝓣2,λ (μ0, μ1) ≥ 0,
𝓣2,λ (μ0, μ1) = 0 if and only if μ0 = μ1.
Thus, although 𝓣2,λ(μ0, μ1) can be used to compare matrix-valued densities, it is not a metric and neither is since the triangular inequality does not hold in general. We will introduce a slightly different formulation of a transportation problem which does give rise to a metric.
A. Optimal Transport on a Subset
In this subsection, we restrict attention to a certain subset of transport plans M(μ0, μ1) and show that the corresponding optimal transportation cost induces a metric. More specifically, let
For m(x, y) ∈ M0(μ0, μ1),
Given μ0 and μ1, the “orientation” of the mass of m(x, y) is fixed. Thus, in this case, the optimal transportation cost is
(12) |
Proposition 3
For 𝓣2,λ as in (12) with λ > 0 and μ0, μ1 ∈ ℱ,
(13) |
defines a metric on ℱ.
Proof
It is straightforward to see that
and that d2,λ (μ0, μ1) = 0 if and only if μ0 = μ1. We now show that the triangle inequality holds as well. For μ0, μ1, μ2 ∈ ℱ, let
denote the optimal transportation plan for the pairs (μ0, μ1) and (μ1, μ2), respectively, where m01 and m12 are two (scalar-valued) joint densities on ℝ2 with marginals tr(μ0), tr(μ1) and tr(μ1), tr(μ2), respectively. Given m01(x, y) and m12(y, z) there is a joint density function m(x, y, z) on ℝ3 with m01 and m12 as the marginals on the corresponding subspaces [3, p. 208]. We set
and note that it has m01 and m12 as matrix-valued marginal distributions. Now, let m02(x, z) = (μ0(x)/tr μ0(x)) ⊗ (μ2(z)/tr μ2(z))m02(x, z) be the marginal of m(x, y, z) when tracing out the y-component. This m02(x, z) is a possible transportation plan between μ0 and μ2. Hence
where the last inequality is due to the metric property of L2.
If λ = 0, then 𝓣̃2,0(μ0, μ1) is exactly the OMT cost between the scalar-valued densities tr(μ0) and tr(μ1) as was explained earlier. In particular, for an optimal transportation plan m(x, y) between tr(μ0) and tr(μ1), the matrix-valued transportation plan m(x, y) = (μ0(x)/tr(μ0(x))) ⊗ (μ1(y)/tr(μ1(y)))m(x, y) is optimal between μ0 and μ1 which satisfies that 𝓣̃2,0(μ0, μ1) = 𝓣2(tr(μ0), tr(μ1)). Thus, 𝓣̃2,0(μ0, μ1) = 0 if and only if tr(μ0) = tr(μ1). Hence, 𝓣̃2,0 fails to be a metric. Moreover, since for any λ ≥ 0 it holds that 𝓣2,λ ≤ 𝓣̃2,λ, if tr(μ0) = tr(μ1) then 𝓣2,0(μ0, μ1) also equals to zero.
Proposition 4
Given μ0, μ1 ∈ ℱ, let m be the optimal transportation plan in (13), then m has support on at most a (2 · λ)-monotonically non-decreasing set.
Proof
We need to prove that if m(x1, y1) ≠ 0 and m(x2, y2) ≠ 0, then x2 > x1, y1 > y2 implies
(14) |
Assume that m evaluated at the four points (xi, yj) with i, j ∈ {1, 2}, is as follows:
with
and m11, m22 > 0. The steps of the proof are similar to those of Proposition 2 detailed in the Appendix: first, we assume that Proposition 4 fails and that
Then we show that a smaller cost can be incurred by rearranging the “mass.” Consider the situation when m22 ≥ m11 first and let m̂ be a new transportation plan with
Then, m̂, m have the same marginals at the four points, the cost incurred by m is
(15) |
and the cost incurred by m̂ is
(16) |
To show that (15) is larger than (16), after canceling common terms, it suffices to show that
However, the above holds true since
The last inequality follows from:
The case m11 > m22 proceeds similarly.
V. Examples
We give two different examples where matrix-valued OMT can be directly applied. Both relate to spectral analysis of multivariable time series.3
A. Spectral Morphing
We first highlight the relevance of matrix-valued OMT to spectral analysis with a numerical example on spectral morphing. The idea is to model slowly time-varying changes in the spectral domain by geodesics in a suitable geometry (see e.g., [7], [8]). The use of geodesic interpolation can be thought of as a regularization technique. Indeed, geodesics smoothly shift spectral power across frequencies lessening the possibility of a fade-in fade-out artifacts and OMT, for scalar power spectra, has been used to this end in [7], [8]. Below we exemplify how geodesics appear in matrix-valued OMT.
Starting with μ0, μ1 ∈ ℱ we approximate the geodesic between the two by constructing N − 1 points intermediate matrix densities. To this end, we set μτ0 = μ0 and μτN = μ1, and determine μτk ∈ ℱ for k = 1, …, N − 1 as the solution to
(17) |
As noted in Section III-D, this can be obtained numerically via convex programming. The present example uses
with
shown in Fig. 1. Since the value of a power spectral density at each point in frequency is a 2 × 2 Hermitian matrix, we have used the (1, 1), (1, 2), and (2, 2) subplots to display the magnitude of the corresponding entries, i.e., |μ(1, 1)|, |μ(1, 2)|, (= |μ(2, 1)|), and |μ(2, 2)|, respectively, and the (2,1) subplot to display the phase ∠μ(1, 2) (= −∠μ(2, 1)).
Fig. 1.
Subplots (1,1), (1,2), and (2,2) show μi(1, 1), |μi(1, 2)| (same as |μi(2, 1)|) and μi(2, 2). Subplot (2,1) shows ∠(μi(2, 1)) for i ∈ {0, 1} in blue and red, respectively.
The 3-D plots in Fig. 2 refer to (17), with λ = 0.1, for an approximation of a geodesic. The two boundary plots represent the power spectra μ0 and μ1 shown in blue and red, respectively, using the same convention about magnitudes and phases explained above. There are in total 7 power spectra μτk, k = 1, …, 7 shown along the geodesic between μ0 and μ1, and the time-indices correspond to τk = k/8. It is interesting to observe the smooth shift of the energy over the geodesic path from the one “channel” to the other while, at the same time, the corresponding peak shifts from one frequency to another. One should bear in mind that the so-constructed geodesic is a non-parametric path interpolating/linking the given spectra.
Fig. 2.
Interpolated results μτk for k = 0, …, 8 computed from (17) with μ0 and μ1 as the two boundary points: subplots (1, 1), (1, 2), and (2, 2) show μτk (1, 1), |μτk (1, 2)| (same as |μτk (2, 1)|) and μτk (2, 2), subplot (2, 1) shows ∠(μτk (2, 1)).
B. Regularization Using Geodesics
Consider two time series
for t = 1, …, 2000, both consisting of sinusoidal signals with time-varying amplitude and frequency (chirp-like) with added white noise w1(t) and w2(t). The amplitude a1(t) decreases from 1.2 to 0.1 while a2(t) increases from 0.1 to 1.2. Frequency θ1(t) decreases from (π/4) to (π/4) − (π/30) while θ2(t) increases from (π/3) to (π/3) + (π/30). Then [w1(t), w2(t)]′ is white, with independent components, and sampled from a zero-mean Gaussian distribution with covariance . The initial phases of the sinusoids are randomly selected in [0, 2π].
Since we are dealing with non-stationary time series, we truncate the observed time series, to segments of length equal to 200, to retain resolution. Thus, we let xi,k(t) := xi(200k+t) with i = 1, 2, k = 0, …, 9 and process separately the segments {xi,k(1), xi,k(2), …, xi,k(200)}. We obtain matrix-valued sample covariances {Rk,−20, …, Rk,0, …, Rk,20} for each. We then determine autoregressive models based on these sample covariances and, thereby, the corresponding power spectral density functions. More specifically, for ℓ = 0, 1, …, 20, we compute
and let . We then solve the Yule-Walker equations
for the autoregressive (matricial) coefficients A1, …, A20. We let be the corresponding innovation variance. The estimated power spectral density function for the kth segment, denoted as μ̂τk, is given as μ̂τk(θ) = Ak(ejθ)−1Ω(Ak(ejθ)*)−1 with
We scale μ̂τk so that the integral of its trace is normalized to one. Thus, the observation record is used to obtain 10 PSD’s denoted as μ̂τk, for k = 1, …, 10. These represent estimates of the spectral power at intermediary points in time. We change the time scale so that τ1 = 0 and τ10 = 1. The spectrogram is shown in Fig. 3(a).
Fig. 3.
(a) Shows the estimated spectrogram of the observed time series and (b) corresponds to the geodesic-fitted spectrogram.
We construct an OMT-geodesic to regularize the estimated PSD’s. This idea was proposed and carried out in [8] for scalar time series and scalar PSD’s. For the present matrix-valued setting, the geodesic is obtained by solving
(18) |
An explicit formula of the OMT geodesic is not available. However, in light of Proposition 2, (18) can be approximated for small λ as follows. Let μ̂τk = tr(μ̂τk) for k = 1, …, 10. These are scalar-valued PSD’s. Let M̂τk denote the corresponding cumulative distribution functions. For λ small, following [8], we compute μ0 := tr(μ0) and μ1 := tr(μ1) via solving:
with M0 and M1 representing the cumulative distribution function of μ0 and μ1, respectively. Then, as was shown in [8], the μτk’s for 1 < k < 10 can be computed via
The matrix-valued PSD’s μ0 and μ1 are obtained by solving
and the μτk’s for 1 < k < 10 are computed via
We display this geodesic-fitted spectrogram in Fig. 3(b). It can be seen that the shift of energy from one channel to another and between resonant frequencies is smoother than that shown in Fig. 3(a) (which is a spectrogram based on matricial auto-regressive models).
In order to compare the resolution between the two techniques (spectrogram based on AR-modeling vs. geodesic regularization), we identify the frequency and directionality of peak power for the two power spectral densities and compare principal direction. This we explain next as the result is quite revealing and suggesting.
For each μ̂τk (θ), we find two frequencies θ1 and θ2 where the power spectral densities (PSD’s) have locally maximal power, i.e., the two frequencies where tr(μ̂τk (θ)) has the largest peaks. Then we compute the (normalized) eigenvectors corresponding to the dominant eigenvalues of μτk (θ1) and μτk (θ2), respectively. These eigenvectors are shown in Fig. 4(a) using black dashed lines. The red and green plots in Fig. 4(a) represent the path of the two eigenvectors as τk increases from 0 to 1. The axes in Fig. 4(a) correspond to the two channels/components of the time series and τk. (Should all the power be present in one of the two channels, the eigenvector would line up, accordingly, to one of the axes.) The values of the eigenvector when projected onto the two channels/axes, reflect the energy of the signals in the corresponding channels. Thus, in antenna-array applications, the direction of eigenvectors corresponds to the direction of a scatterer relative to the array. Statistical errors are reflected in the jagged nature of the paths when these are based on a spectrogram as in Fig. 4(a). However, when comparing with the eigenvectors of the OMT regularized spectrogram/AR-models μτk, the corresponding paths shown in Fig. 4(b) are smooth. Direct comparison between Fig. 4(a) and (b) highlights the potential advantages of using geodesics as a means to regularize power distribution in non-stationary time series.
Fig. 4.
In (a), the trajectories of the dominant eigenvector of μ̂τk (θ1) and μ̂τk (θ2) are shown in red and blue, respectively. The corresponding trajectories of μτk (θ1) and μτk (θ2) are shown in (b).
VI. Conclusion
The geometry of Monge–Kantorovich optimal mass transport provides scalar densities with a natural metric structure (see [11], [12]; also [13] for a systems viewpoint and connections to image analysis and power spectra). Our interest has been in extending such a geometric structure to matrix-valued densities. To this end, we formulated one possible matrix-valued version of the Monge–Kantorovich transportation problem. Computations require convex programming and the framework directly extends the scalar case. An alternative generalization of the Monge–Kantorovich theory to a “non-commutative” setting has been given in the context of the theory of free-probabilities [14]. However, this may not be suitable for matrix-valued power distributions as it is not weak* continuous. Alternative non-commutative generalizations of the Wasserstein metric are given in [15]–[17]. Possible connections between the formulation herein and these alternative viewpoints is the subject of current investigation.
Biographies
Lipeng Ning received the B.S. and M.S. degrees in control science and engineering from Beijing Institute of Technology, Beijing, China, in 2006 and 2008, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Minnesota, Minneapolis, in 2013.
He is currently a Postdoc Research Fellow in Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA. His research interests include system identification, spectral estimation, sparse signal recovery, and diffusion magnetic resonance imaging.
Tryphon T. Georgiou (F’00) received the Diploma in mechanical and electrical engineering from the National Technical University of Athens, Athens, Greece, in 1979 and the Ph.D. degree from the University of Florida, Gainesville, in 1983.
He is a faculty member in the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, and the Vincentine Hermes-Luh Chair. He has served as a Co-Director of the Control Science and Dynamical Systems Center at the University of Minnesota (1990–2014), and on the Board of Governors of the Control Systems Society of the IEEE (2002–2005).
Dr. Georgiou has been a recipient of the George S. Axelby Outstanding Paper Award of the IEEE Control Systems Society for the years 1992, 1999, and 2003. He is a Foreign Member of the Royal Swedish Academy of Engineering Sciences (IVA).
Allen Tannenbaum (F’08) is a faculty member in computer science and applied mathematics at Stony Brook University, Stony Brook, NY, USA. He works in control, computer vision, and medical imaging.
Appendix
Appendix: Proof of Proposition 2
We need to show that if m(x1, y1) ≠ 0 and m(x2, y2) ≠ 0, then x2 > x1, y1 > y2 implies
(19) |
Without loss of generality, let
(20) |
with Aij, Bij ≥ 0, tr(Aij) = tr(Bij) = 1 and i, j ∈ {1, 2}. Note that m12 and m21 could be zero if m does not have support on the particular point. We assume that the condition in the proposition fails and that
(21) |
We then show that by rearranging the mass, the cost can be reduced.
We first consider the situation when m22 ≥ m11. By rearranging the value of m at the four points (xi, yj) with i, j ∈ {1, 2}, we construct a new transportation plan m̃ at these four locations as follows:
(22a) |
(22b) |
(22c) |
(22d) |
where
This new transportation plan m̃ has the same marginals as m at x1, x2 and y1, y2. The original cost incurred by m at these four locations is
(23) |
while the cost incurred by m̃ is
(24) |
After simplification, to show that (23) is larger than (24), it suffices to show that
(25) |
is larger than
(26a) |
(26b) |
(26c) |
From (21), it follows that the value in (25) is greater than 20λm11. We derive upper bounds for each term in (26). First,
where the last inequality follows from the fact that:
for any A, B ≥ 0 with tr(A) = tr(B) = 1. Now consider
where the second equality follows from the definitions of Ã12 and B̃12 while the last inequality is obtained by bounding the terms in the trace. Thus, referring to expressions by the respective equation numbering
In a similar manner, (26c) ≤ 2λm11. Therefore
which implies that the cost incurred by m̃ is smaller than the cost incurred by m.
For the case where m11 > m22, we can prove the claim by constructing a new transportation plan m̂ with values
with
The rest of the proof is carried out in a similar manner.
Footnotes
A sequence of measures dμn converges weak* to dμ if and only if ∫fdμn → ∫fdμ for all continuous and bounded f.
It is interesting to consider functionals of the form tr(c(x, y)m(x, y)), with c(x, y) being matricial, and how to utilize such so as to reflect transportation cost with practical relevance.
Matlab code is available at http://www.ece.umn.edu/users/ningx015/research.html.
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Contributor Information
Lipeng Ning, Email: lning@bwh.harvard.edu, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115 USA.
Tryphon T. Georgiou, Email: tryphon@umn.edu, Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA.
Allen Tannenbaum, Email: allen.tannenbaum@stonybrook.edu, Departments of Computer Science and Applied Mathematics, Stony Brook University, Stony Brook, NY 11794 USA.
References
- 1.Monge G. Open Library. De l’Imprimerie Royale; 1781. Mémoire sur la théorie des déblais et des remblais. [Google Scholar]
- 2.Kantorovich L. On the transfer of masses. Dokl Akad Nauk SSSR. 1942;37:227–229. [Google Scholar]
- 3.Villani C. Topics in optimal transportation. Vol. 58. Providence, RI: American Mathematical Society; 2003. [Google Scholar]
- 4.Ambrosio L. Lecture notes on optimal transport problems. Mathematical aspects of evolving interfaces. 2003:1–52. [Google Scholar]
- 5.Rachev S, Rüschendorf L. Mass Transportation Problems: Theory. Vol. 1. New York: Springer-Verlag; 1998. [Google Scholar]
- 6.Ferrante A, Pavon M, Ramponi F. Hellinger versus Kullback–Leibler multivariable spectrum approximation. IEEE Trans Autom Control. 2008 Apr;53(4):954–967. [Google Scholar]
- 7.Jiang X, Ning L, Georgiou TT. Distances and Riemannian metrics for multivariate spectral densities. IEEE Trans Autom Control. 2012 Jul;57(7):1723–1735. [Google Scholar]
- 8.Jiang X, Luo Z, Georgiou T. Geometric methods for spectral analysis. IEEE Trans Signal Process. 2012 Mar;60(3):1064–1074. [Google Scholar]
- 9.Petz D. Quantum Information Theory and Quantum Statistics (Theoretical and Mathematical Physics) Berlin, Germany: Springer; 2008. [Google Scholar]
- 10.Boyd S, Vandenberghe L. Convex optimization. Cambridge, MA: Cambridge Univ. Press; 2004. [Google Scholar]
- 11.Benamou J, Brenier Y. A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem. Numerische Mathematik. 2000;84(3):375–393. [Google Scholar]
- 12.Jordan R, Kinderlehrer D, Otto F. The variational formulation of the Fokker-Planck equation. SIAM J Mathemat Anal. 1998;29(1):1–17. [Google Scholar]
- 13.Tannenbaum E, Georgiou T, Tannenbaum A. Signals and control aspects of optimal mass transport and the Boltzmann entropy. Proc. 49th IEEE Conf. Decision and Control; 2010. pp. 1885–1890. [Google Scholar]
- 14.Biane P, Voiculescu D. A free probability analogue of the Wasserstein metric on the trace-state space. Geometric and Funct Anal. 2001;11(6):1125–1138. [Google Scholar]
- 15.Rieffel M. Metrics on state spaces. Doc Math J DMV. 1999;4:559–600. [Google Scholar]
- 16.Andrea FD, Martinetti P. A view on optimal transport from noncommutative geometry. SIGMA. 2010;6(057):24. [Google Scholar]
- 17.Martinetti P. Towards a Monge–Kantorovich metric in noncommutative geometry. Zap Nauch Semin POMI. 2013:411. arXiv preprint arXiv:1210.6573. [Google Scholar]