On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid

Frank Nielsen

doi:10.3390/e22020221

. 2020 Feb 16;22(2):221. doi: 10.3390/e22020221

On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid

Frank Nielsen ¹

PMCID: PMC7516653 PMID: 33285995

Abstract

The Jensen–Shannon divergence is a renown bounded symmetrization of the Kullback–Leibler divergence which does not require probability densities to have matching supports. In this paper, we introduce a vector-skew generalization of the scalar $α$ -Jensen–Bregman divergences and derive thereof the vector-skew $α$ -Jensen–Shannon divergences. We prove that the vector-skew $α$ -Jensen–Shannon divergences are f-divergences and study the properties of these novel divergences. Finally, we report an iterative algorithm to numerically compute the Jensen–Shannon-type centroids for a set of probability densities belonging to a mixture family: This includes the case of the Jensen–Shannon centroid of a set of categorical distributions or normalized histograms.

Keywords: Bregman divergence, f-divergence, Jensen–Bregman divergence, Jensen diversity, Jensen–Shannon divergence, capacitory discrimination, Jensen–Shannon centroid, mixture family, information geometry, difference of convex (DC) programming

1. Introduction

Let $(X, F, μ)$ be a measure space [1] where $X$ denotes the sample space, $F$ the $σ$ -algebra of measurable events, and $μ$ a positive measure; for example, the measure space defined by the Lebesgue measure $μ_{L}$ with Borel $σ$ -algebra $B (R^{d})$ for $X = R^{d}$ or the measure space defined by the counting measure $μ_{c}$ with the power set $σ$ -algebra $2^{X}$ on a finite alphabet $X$ . Denote by $L_{1} (X, F, μ)$ the Lebesgue space of measurable functions, $P_{1}$ the subspace of positive integrable functions f such that $\int_{X} f (x) d μ (x) = 1$ and $f (x) > 0$ for all $x \in X$ , and ${\bar{P}}_{1}$ the subspace of non-negative integrable functions f such that $\int_{X} f (x) d μ (x) = 1$ and $f (x) \geq 0$ for all $x \in X$ .

We refer to the book of Deza and Deza [2] and the survey of Basseville [3] for an introduction to the many types of statistical divergences met in information sciences and their justifications. The Kullback–Leibler Divergence (KLD) $KL : P_{1} \times P_{1} \to [0, \infty]$ is an oriented statistical distance (commonly called the relative entropy in information theory [4]) defined between two densities p and q (i.e., the Radon–Nikodym densities of $μ$ -absolutely continuous probability measures P and Q) by

KL (p : q) : = \int p log \frac{p}{q} d μ .

(1)

Although $KL (p : q) \geq 0$ with equality iff. $p = q$ $μ$ -a. e. (Gibb’s inequality [4]), the KLD may diverge to infinity depending on the underlying densities. Since the KLD is asymmetric, several symmetrizations [5] have been proposed in the literature.

A well-grounded symmetrization of the KLD is the Jensen–Shannon Divergence [6] (JSD), also called capacitory discrimination in the literature (e.g., see [7]):

\begin{matrix} JS (p, q) & : = & \frac{1}{2} (KL (p : \frac{p + q}{2}) + KL (q : \frac{p + q}{2})), \end{matrix}

(2)

\begin{matrix} = & \frac{1}{2} \int (p log \frac{2 p}{p + q} + q log \frac{2 q}{p + q}) d μ = JS (q, p) . \end{matrix}

(3)

The Jensen–Shannon divergence can be interpreted as the total KL divergence to the average distribution $\frac{p + q}{2}$ . The Jensen–Shannon divergence was historically implicitly introduced in [8] (Equation (19)) to calculate distances between random graphs. A nice feature of the Jensen–Shannon divergence is that this divergence can be applied to densities with arbitrary support (i.e., $p, q \in {\bar{P}}_{1}$ with the convention that $0 log 0 = 0$ and $log \frac{0}{0} = 0$ ); moreover, the JSD is always upper bounded by $log 2$ . Let $X_{p} = supp (p)$ and $X_{q} = supp (q)$ denote the supports of the densities p and q, respectively, where $supp (p) : = {x \in X : p (x) > 0}$ . The JSD saturates to $log 2$ whenever the supports $X_{p}$ and $X_{p}$ are disjoints. We can rewrite the JSD as

JS (p, q) = h (\frac{p + q}{2}) - \frac{h (p) + h (q)}{2},

(4)

where $h (p) = - \int p log p d μ$ denotes Shannon’s entropy. Thus, the JSD can also be interpreted as the entropy of the average distribution minus the average of the entropies.

The square root of the JSD is a metric [9] satisfying the triangle inequality, but the square root of the JD is not a metric (nor any positive power of the Jeffreys divergence, see [10]). In fact, the JSD can be interpreted as a Hilbert metric distance, meaning that there exists some isometric embedding of $(X, \sqrt{JS})$ into a Hilbert space [11,12]. Other principled symmetrizations of the KLD have been proposed in the literature: For example, Naghshvar et al. [13] proposed the extrinsic Jensen–Shannon divergence and demonstrated its use for variable-length coding over a discrete memoryless channel (DMC).

Another symmetrization of the KLD sometimes met in the literature [14,15,16] is the Jeffreys divergence [17,18] (JD) defined by

J (p, q) : = KL (p : q) + KL (q : p) = \int (p - q) log \frac{p}{q} d μ = J (q, p) .

(5)

However, we point out that this Jeffreys divergence lacks sound information-theoretical justifications.

For two positive but not necessarily normalized densities $\tilde{p}$ and $\tilde{q}$ , we define the extended Kullback–Leibler divergence as follows:

\begin{matrix} {KL}^{+} (\tilde{p} : \tilde{q}) & : = & KL (\tilde{p} : \tilde{q}) + \int \tilde{q} d μ - \int \tilde{p} d μ, \end{matrix}

(6)

\begin{matrix} = & \int (\tilde{p} log \frac{\tilde{p}}{\tilde{q}} + \tilde{q} - \tilde{p}) d μ . \end{matrix}

(7)

The Jensen–Shannon divergence and the Jeffreys divergence can both be extended to positive (unnormalized) densities without changing their formula expressions:

\begin{matrix} {JS}^{+} (\tilde{p}, \tilde{q}) & : = & \frac{1}{2} ({KL}^{+} (\tilde{p} : \frac{\tilde{p} + \tilde{q}}{2}) + {KL}^{+} (\tilde{q} : \frac{\tilde{p} + \tilde{q}}{2})), \end{matrix}

(8)

\begin{matrix} = & \frac{1}{2} (KL (\tilde{p} : \frac{\tilde{p} + \tilde{q}}{2}) + KL (\tilde{q} : \frac{\tilde{p} + \tilde{q}}{2})) = JS (\tilde{p}, \tilde{q}), \end{matrix}

(9)

\begin{matrix} J^{+} (\tilde{p}, \tilde{q}) & : = & {KL}^{+} (\tilde{p} : \tilde{q}) + {KL}^{+} (\tilde{p} : \tilde{q}) = \int (\tilde{p} - \tilde{q}) log \frac{\tilde{p}}{\tilde{q}} d μ = J (\tilde{p}, \tilde{q}) . \end{matrix}

(10)

However, the extended ${JS}^{+}$ divergence is upper-bounded by $(\frac{1}{2} log 2) (\int (\tilde{p} + \tilde{q}) d μ) = \frac{1}{2} (μ (p) + μ (q)) log 2$ instead of $log 2$ for normalized densities (i.e., when $μ (p) + μ (q) = 2$ ).

Let ${(p q)}_{α} (x) : = (1 - α) p (x) + α q (x)$ denote the statistical weighted mixture with component densities p and q for $α \in [0, 1]$ . The asymmetric $α$ -skew Jensen–Shannon divergence can be defined for a scalar parameter $α \in (0, 1)$ by considering the weighted mixture ${(p q)}_{α}$ as follows:

\begin{matrix} {JS}_{a}^{α} (p : q) & : = & (1 - α) KL (p : {(p q)}_{α}) + α KL (q : {(p q)}_{α}), \end{matrix}

(11)

\begin{matrix} = & (1 - α) \int p log \frac{p}{{(p q)}_{α}} d μ + α \int q log \frac{q}{{(p q)}_{α}} d μ . \end{matrix}

(12)

Let us introduce the α-skew K-divergence [6,19] $K_{α} (p : q)$ by:

K_{α} (p : q) : = KL (p : (1 - α) p + α q) = KL (p : {(p q)}_{α}) .

(13)

Then, both the Jensen–Shannon divergence and the Jeffreys divergence can be rewritten [20] using $K_{α}$ as follows:

\begin{matrix} JS (p, q) & = & \frac{1}{2} (K_{\frac{1}{2}} (p : q) + K_{\frac{1}{2}} (q : p)), \end{matrix}

(14)

\begin{matrix} J (p, q) & = & K_{1} (p : q) + K_{1} (q : p), \end{matrix}

(15)

since ${(p q)}_{1} = q$ , $KL (p : q) = K_{1} (p : q)$ and ${(p q)}_{\frac{1}{2}} = {(q p)}_{\frac{1}{2}}$ .

We can thus define the symmetric α-skew Jensen–Shannon divergence [20] for $α \in (0, 1)$ as follows:

\begin{matrix} {JS}^{α} (p, q) & : = & \frac{1}{2} K_{α} (p : q) + \frac{1}{2} K_{α} (q : p) = {JS}^{α} (q, p) . \end{matrix}

(16)

The ordinary Jensen–Shannon divergence is recovered for $α = \frac{1}{2}$ .

In general, skewing divergences (e.g., using the divergence $K_{α}$ instead of the KLD) have been experimentally shown to perform better in applications like in some natural language processing (NLP) tasks [21].

The α-Jensen–Shannon divergences are Csiszár f-divergences [22,23,24]. An f-divergence is defined for a convex function f, strictly convex at 1, and satisfies $f (1) = 0$ as:

I_{f} (p : q) = \int q (x) f (\frac{p (x)}{q (x)}) d x \geq f (1) = 0 .

(17)

We can always symmetrize f-divergences by taking the conjugate convex function $f^{*} (x) = x f (\frac{1}{x})$ (related to the perspective function): $I_{f + f^{*}} (p, q)$ is a symmetric divergence. The f-divergences are convex statistical distances which are provably the only separable invariant divergences in information geometry [25], except for binary alphabets $X$ (see [26]).

The Jeffreys divergence is an f-divergence for the generator $f (x) = (x - 1) log x$ , and the $α$ -Jensen–Shannon divergences are f-divergences for the generator family $f_{α} (x) = - log ((1 - α) + α x) - x log ((1 - α) + \frac{α}{x})$ . The f-divergences are upper-bounded by $f (0) + f^{*} (0)$ . Thus, the f-divergences are finite when $f (0) + f^{*} (0) < \infty$ .

The main contributions of this paper are summarized as follows:

First, we generalize the Jensen–Bregman divergence by skewing a weighted separable Jensen–Bregman divergence with a k-dimensional vector $α \in {[0, 1]}^{k}$ in Section 2. This yields a generalization of the symmetric skew $α$ -Jensen–Shannon divergences to a vector-skew parameter. This extension retains the key properties for being upper-bounded and for application to densities with potentially different supports. The proposed generalization also allows one to grasp a better understanding of the “mechanism” of the Jensen–Shannon divergence itself. We also show how to directly obtain the weighted vector-skew Jensen–Shannon divergence from the decomposition of the KLD as the difference of the cross-entropy minus the entropy (i.e., KLD as the relative entropy).
Second, we prove that weighted vector-skew Jensen–Shannon divergences are f-divergences (Theorem 1), and show how to build families of symmetric Jensen–Shannon-type divergences which can be controlled by a vector of parameters in Section 2.3, generalizing the work of [20] from scalar skewing to vector skewing. This may prove useful in applications by providing additional tuning parameters (which can be set, for example, by using cross-validation techniques).
Third, we consider the calculation of the Jensen–Shannon centroids in Section 3 for densities belonging to mixture families. Mixture families include the family of categorical distributions and the family of statistical mixtures sharing the same prescribed components. Mixture families are well-studied manifolds in information geometry [25]. We show how to compute the Jensen–Shannon centroid using a concave–convex numerical iterative optimization procedure [27]. The experimental results graphically compare the Jeffreys centroid with the Jensen–Shannon centroid for grey-valued image histograms.

2. Extending the Jensen–Shannon Divergence

2.1. Vector-Skew Jensen–Bregman Divergences and Jensen Diversities

Recall our notational shortcut: ${(a b)}_{α} : = (1 - α) a + α b$ . For a k-dimensional vector $α \in {[0, 1]}^{k}$ , a weight vector w belonging to the $(k - 1)$ -dimensional open simplex $Δ_{k}$ , and a scalar $γ \in (0, 1)$ , let us define the following vector skew α-Jensen–Bregman divergence ( $α$ -JBD) following [28]:

{JB}_{F}^{α, γ, w} (θ_{1} : θ_{2}) : = \sum_{i = 1}^{k} w_{i} B_{F} ({(θ_{1} θ_{2})}_{α_{i}} : {(θ_{1} θ_{2})}_{γ}) \geq 0,

(18)

where $B_{F}$ is the Bregman divergence [29] induced by a strictly convex and smooth generator F:

B_{F} (θ_{1} : θ_{2}) : = F (θ_{1}) - F (θ_{2}) - 〈 θ_{1} - θ_{2}, \nabla F (θ_{2}) 〉,

(19)

with $〈 \cdot, \cdot 〉$ denoting the Euclidean inner product $〈 x, y 〉 = x^{⊤} y$ (dot product). Expanding the Bregman divergence formulas in the expression of the $α$ -JBD and using the fact that

{(θ_{1} θ_{2})}_{α_{i}} - {(θ_{1} θ_{2})}_{γ} = (γ - α_{i}) (θ_{1} - θ_{2}),

(20)

we get the following expression:

{JB}_{F}^{α, γ, w} (θ_{1} : θ_{2}) = (\sum_{i = 1}^{k} w_{i} F ({(θ_{1} θ_{2})}_{α_{i}})) - F ({(θ_{1} θ_{2})}_{γ}) - 〈\sum_{i = 1}^{k} w_{i} (γ - α_{i}) (θ_{1} - θ_{2}), \nabla F ({(θ_{1} θ_{2})}_{γ})〉 .

(21)

The inner product term of Equation (21) vanishes when

γ = \sum_{i = 1}^{k} w_{i} α_{i} : = \bar{α} .

(22)

Thus, when $γ = \bar{α}$ (assuming at least two distinct components in $α$ so that $γ \in (0, 1)$ ), we get the simplified formula for the vector-skew $α$ -JBD:

\begin{matrix} {JB}_{F}^{α, w} (θ_{1} : θ_{2}) = (\sum_{i = 1}^{k} w_{i} F ({(θ_{1} θ_{2})}_{α_{i}})) - F ({(θ_{1} θ_{2})}_{\bar{α}}) . \end{matrix}

(23)

This vector-skew Jensen–Bregman divergence is always finite and amounts to a Jensen diversity [30] $J_{F}$ induced by Jensen’s inequality gap:

{JB}_{F}^{α, w} (θ_{1} : θ_{2}) = J_{F} ({(θ_{1} θ_{2})}_{α_{1}}, \dots, {(θ_{1} θ_{2})}_{α_{k}}; w_{1}, \dots, w_{k}) : = \sum_{i = 1}^{k} w_{i} F ({(θ_{1} θ_{2})}_{α_{i}}) - F ({(θ_{1} θ_{2})}_{\bar{α}}) \geq 0 .

(24)

The Jensen diversity is a quantity which arises as a generalization of the cluster variance when clustering with Bregman divergences instead of the ordinary squared Euclidean distance; see [29,30] for details. In the context of Bregman clustering, the Jensen diversity has been called the Bregman information [29] and motivated by rate distortion theory: Bregman information measures the minimum expected loss when encoding a set of points using a single point when the loss is measured using a Bregman divergence. In general, a k-point measure is called a diversity measure (for $k > 2$ ), while a distance/divergence is the special case of a 2-point measure.

Conversely, in 1D, we may start from Jensen’s inequality for a strictly convex function F:

\sum_{i = 1}^{k} w_{i} F (θ_{i}) \geq F (\sum_{i = 1}^{k} w_{i} θ_{i}) .

(25)

Let us notationally write $[k] : = {1, \dots, k}$ , and define $θ_{m} : = {min}_{i \in [k]} {θ_{i}}_{i}$ and $θ_{M} : = {max}_{i \in [k]} {θ_{i}}_{i} > θ_{m}$ (i.e., assuming at least two distinct values). We have the barycenter $\bar{θ} = \sum_{i} w_{i} θ_{i} = : {(θ_{m} θ_{M})}_{γ}$ which can be interpreted as the linear interpolation of the extremal values for some $γ \in (0, 1)$ . Let us write $θ_{i} = {(θ_{m} θ_{M})}_{α_{i}}$ for $i \in [k]$ and proper values of the $α_{i}$ s. Then, it comes that

\begin{matrix} \bar{θ} & = & \sum_{i} w_{i} θ_{i}, \end{matrix}

(26)

\begin{matrix} = & \sum_{i} w_{i} {(θ_{m} θ_{M})}_{α_{i}}, \end{matrix}

(27)

\begin{matrix} = & \sum_{i} w_{i} ((1 - α_{i}) θ_{m} + α_{i} θ_{M}), \end{matrix}

(28)

\begin{matrix} = & (1 - \sum_{i} w_{i} α_{i}) θ_{m} + \sum_{i} α_{i} w_{i} θ_{M}, \end{matrix}

(29)

\begin{matrix} = & {(θ_{m} θ_{M})}_{\sum_{i} w_{i} α_{i}} = {(θ_{m} θ_{M})}_{γ}, \end{matrix}

(30)

so that $γ = \sum_{i} w_{i} α_{i} = \bar{α}$ .

2.2. Vector-Skew Jensen–Shannon Divergences

Let $f (x) = x log x - x$ be a strictly smooth convex function on $(0, \infty)$ . Then, the Bregman divergence induced by this univariate generator is

B_{f} (p : q) = p log \frac{p}{q} + q - p = {kl}_{+} (p : q),

(31)

the extended scalar Kullback–Leibler divergence.

We extend the scalar-skew Jensen–Shannon divergence as follows: ${JS}^{α, w} (p : q) : = {JB}_{- h}^{α, \bar{α}, w} (p : q)$ for h, the Shannon’s entropy [4] (a strictly concave function [4]).

Definition 1

(Weighted vector-skew $(α, w)$ -Jensen–Shannon divergence). For a vector $α \in {[0, 1]}^{k}$ and a unit positive weight vector $w \in Δ_{k}$ , the $(α, w)$ -Jensen–Shannon divergence between two densities $p, q \in {\bar{P}}_{1}$ is defined by:

$\begin{matrix} {JS}^{α, w} (p : q) : = \sum_{i = 1}^{k} w_{i} KL ({(p q)}_{α_{i}} : {(p q)}_{\bar{α}}) = h ({(p q)}_{\bar{α}}) - \sum_{i = 1}^{k} w_{i} h ({(p q)}_{α_{i}}), \end{matrix}$

with $\bar{α} = \sum_{i = 1}^{k} w_{i} α_{i}$ , where $h (p) = - \int p (x) log p (x) d μ (x)$ denotes the Shannon entropy [4] (i.e., $- h$ is strictly convex).

This definition generalizes the ordinary JSD; we recover the ordinary Jensen–Shannon divergence when $k = 2$ , $α_{1} = 0$ , $α_{2} = 1$ , and $w_{1} = w_{2} = \frac{1}{2}$ with $\bar{α} = \frac{1}{2}$ : $JS (p, q) = {JS}^{(0, 1), (\frac{1}{2}, \frac{1}{2})} (p : q)$ .

Let ${KL}_{α, β} (p : q) : = KL ({(p q)}_{α} : {(p q)}_{β})$ . Then, we have ${KL}_{α, β} (q : p) = {KL}_{1 - α, 1 - β} (p : q)$ . Using this $(α, β)$ -KLD, we have the following identity:

\begin{matrix} {JS}^{α, w} (p : q) & = & \sum_{i = 1}^{k} w_{i} {KL}_{α_{i}, \bar{α}} (p : q), \end{matrix}

(32)

\begin{matrix} = & \sum_{i = 1}^{k} w_{i} {KL}_{1 - α_{i}, 1 - \bar{α}} (q : p) = {JS}^{1_{k} - α, w} (q : p), \end{matrix}

(33)

since $\sum_{i = 1}^{k} w_{i} (1 - α_{i}) = \bar{1_{k} - α} = 1 - \bar{α}$ , where $1_{k} = (1, \dots, 1)$ is a k-dimensional vector of ones.

A very interesting property is that the vector-skew Jensen–Shannon divergences are f-divergences [22].

Theorem 1.

The vector-skew Jensen–Shannon divergences ${JS}^{α, w} (p : q)$ are f-divergences for the generator $f_{α, w} (u) = \sum_{i = 1}^{k} w_{i} (α_{i} u + (1 - α_{i})) log \frac{(1 - α_{i}) + α_{i} u}{(1 - \bar{α}) + \bar{α} u}$ with $\bar{α} = \sum_{i = 1}^{k} w_{i} α_{i}$ .

Proof.

First, let us observe that the positively weighted sum of f-divergences is an f-divergence: $\sum_{i = 1}^{k} w_{i} I_{f_{i}} (p : q) = I_{f} (p : q)$ for the generator $f (u) = \sum_{i = 1}^{k} w_{i} f_{i} (u)$ .

Now, let us express the divergence ${KL}_{α, β} (p : q)$ as an f-divergence:

${KL}_{α, β} (p : q) = I_{f_{α, β}} (p : q),$ (34)

with generator

$f_{α, β} (u) = (α u + 1 - α) log \frac{(1 - α) + α u}{(1 - β) + β u} .$ (35)

Thus, it follows that

$\begin{matrix} {JS}^{α, w} (p : q) & = & \sum_{i = 1}^{k} w_{i} KL ({(p q)}_{α_{i}} : {(p q)}_{\bar{α}}), \end{matrix}$ (36)

$\begin{matrix} = & \sum_{i = 1}^{k} w_{i} I_{f_{α_{i}, \bar{α}}} (p : q), \end{matrix}$ (37)

$\begin{matrix} = & I_{\sum_{i = 1}^{k} w_{i} f_{α_{i}, \bar{α}}} (p : q) . \end{matrix}$ (38)

Therefore, the vector-skew Jensen–Shannon divergence is an f-divergence for the following generator:

$\begin{matrix} f_{α, w} (u) = \sum_{i = 1}^{k} w_{i} (α_{i} u + (1 - α_{i})) log \frac{(1 - α_{i}) + α_{i} u}{(1 - \bar{α}) + \bar{α} u}, \end{matrix}$ (39)

where $\bar{α} = \sum_{i = 1}^{k} w_{i} α_{i}$ .

When $α = (0, 1)$ and $w = (\frac{1}{2}, \frac{1}{2})$ , we recover the f-divergence generator for the JSD:

$\begin{matrix} f_{JS} (u) & = & \frac{1}{2} log \frac{1}{\frac{1}{2} + \frac{1}{2} u} + \frac{1}{2} u log \frac{u}{\frac{1}{2} + \frac{1}{2} u}, \end{matrix}$ (40)

$\begin{matrix} = & \frac{1}{2} (log \frac{2}{1 + u} + u log \frac{2 u}{1 + u}) . \end{matrix}$ (41)

Observe that $f_{α, w}^{*} (u) = u f_{α, w} (1 / u) = f_{1 - α, w} (u)$ , where $1 - α : = (1 - α_{1}, \dots, 1 - α_{k})$ .

We also refer the reader to Theorem 4.1 of [31], which defines skew f-divergences from any f-divergence. □

Remark 1.

Since the vector-skew Jensen divergence is an f-divergence, we easily obtain Fano and Pinsker inequalities following [32], or reverse Pinsker inequalities following [33,34] (i.e., upper bounds for the vector-skew Jensen divergences using the total variation metric distance), data processing inequalities using [35], etc.

Next, we show that ${KL}_{α, β}$ (and ${JS}^{α, w}$ ) are separable convex divergences. Since the f-divergences are separable convex, the ${KL}_{α, β}$ divergences and the ${JS}^{α, w}$ divergences are separable convex. For the sake of completeness, we report a simplex explicit proof below.

Theorem 2

(Separable convexity). The divergence ${KL}_{α, β} (p : q)$ is strictly separable convex for $α \neq β$ and $x \in X_{p} \cap X_{q}$ .

Proof.

Let us calculate the second partial derivative of ${KL}_{α, β} (x : y)$ with respect to x, and show that it is strictly positive:

$\frac{\partial^{2}}{\partial x^{2}} {KL}_{α, β} (x : y) = \frac{{(β - α)}^{2} y^{2}}{{(x y)}_{α} {(x y)}_{β}^{2}} > 0,$ (42)

for $x, y > 0$ . Thus, ${KL}_{α, β}$ is strictly convex on the left argument. Similarly, since ${KL}_{α, β} (y : x) = {KL}_{1 - α, 1 - β} (x : y)$ , we deduce that ${KL}_{α, β}$ is strictly convex on the right argument. Therefore, the divergence ${KL}_{α, β}$ is separable convex. □

It follows that the divergence ${JS}^{α, w} (p : q)$ is strictly separable convex, since it is a convex combination of weighted ${KL}_{α_{i}, \bar{α}}$ divergences.

Another way to derive the vector-skew JSD is to decompose the KLD as the difference of the cross-entropy $h^{\times}$ minus the entropy h (i.e., KLD is also called the relative entropy):

KL (p : q) = h^{\times} (p : q) - h (p),

(43)

where $h^{\times} (p : q) : = - \int p log q d μ$ and $h (p) : = h^{\times} (p : p)$ (self cross-entropy). Since $α_{1} h^{\times} (p_{1} : q) + α_{2} h^{\times} (p_{2} : q) = h^{\times} (α_{1} p_{1} + α_{2} p_{2} : q)$ (for $α_{2} = 1 - α_{1}$ ), it follows that

\begin{matrix} {JS}^{α, w} (p : q) & : = & \sum_{i = 1}^{k} w_{i} KL ({(p q)}_{α_{i}} : {(p q)}_{γ}), \end{matrix}

(44)

\begin{matrix} = & \sum_{i = 1}^{k} w_{i} (h^{\times} ({(p q)}_{α_{i}} : {(p q)}_{γ}) - h ({(p q)}_{α_{i}})), \end{matrix}

(45)

\begin{matrix} = & h^{\times} (\sum_{i = 1}^{k} w_{i} {(p q)}_{α_{i}} : {(p q)}_{γ}) - \sum_{i = 1}^{k} w_{i} h ({(p q)}_{α_{i}}) . \end{matrix}

(46)

Here, the “trick” is to choose $γ = \bar{α}$ in order to “convert” the cross-entropy into an entropy: $h^{\times} (\sum_{i = 1}^{k} w_{i} {(p q)}_{α_{i}} : {(p q)}_{γ}) = h ({(p q)}_{\bar{α}})$ when $γ = \bar{α}$ . Then, we end up with

\begin{matrix} {JS}^{α, w} (p : q) = h ({(p q)}_{\bar{α}}) - \sum_{i = 1}^{k} w_{i} h ({(p q)}_{α_{i}}) . \end{matrix}

(47)

When $α = (α_{1}, α_{2})$ with $α_{1} = 0$ and $α_{2} = 0$ and $w = (w_{1}, w_{2}) = (\frac{1}{2}, \frac{1}{2})$ , we have $\bar{α} = \frac{1}{2}$ , and we recover the Jensen–Shannon divergence:

JS (p : q) = h (\frac{p + q}{2}) - \frac{h (p) + h (q)}{2} .

(48)

Notice that Equation (13) is the usual definition of the Jensen–Shannon divergence, while Equation (48) is the reduced formula of the JSD, which can be interpreted as a Jensen gap for Shannon entropy, hence its name: The Jensen–Shannon divergence.

Moreover, if we consider the cross-entropy/entropy extended to positive densities $\tilde{p}$ and $\tilde{q}$ :

h_{+}^{\times} (\tilde{p} : \tilde{q}) = - \int (\tilde{p} log \tilde{q} + \tilde{q}) d μ, h_{+} (\tilde{p}) = h_{+}^{\times} (\tilde{p} : \tilde{p}) = - \int (\tilde{p} log \tilde{p} + \tilde{p}) d μ,

(49)

we get:

{JS}_{+}^{α, w} (\tilde{p} : \tilde{q}) = \sum_{i = 1}^{k} w_{i} {KL}_{+} ({(\tilde{p} \tilde{q})}_{α_{i}} : {(\tilde{p} \tilde{q})}_{γ}) = h_{+} ({(\tilde{p} \tilde{q})}_{\bar{α}}) - \sum_{i = 1}^{k} w_{i} h_{+} ({(\tilde{p} \tilde{q})}_{α_{i}}) .

(50)

Next, we shall prove that our generalization of the skew Jensen–Shannon divergence to vector-skewing is always bounded. We first start by a lemma bounding the KLD between two mixtures sharing the same components:

Lemma 1

(KLD between two w-mixtures). For $α \in [0, 1]$ and $β \in (0, 1)$ , we have:

${KL}_{α, β} (p : q) = KL ({(p q)}_{α} : {(p q)}_{β}) \leq log max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} .$

Proof.

For $p (x), q (x) > 0$ , we have

$\frac{(1 - α) p (x) + α q (x)}{(1 - β) p (x) + β q (x)} \leq max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} .$ (51)

Indeed, by considering the two cases $α \geq β$ (or equivalently, $1 - α \leq 1 - β$ ) and $α \leq β$ (or equivalently, $1 - α \geq 1 - β$ ), we check that $(1 - α) p (x) \leq max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} (1 - β) p (x)$ and $α q (x) \leq max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} β q (x)$ . Thus, we have $\frac{(1 - α) p (x) + α q (x)}{(1 - β) p (x) + β q (x)} \leq max \{\frac{1 - α}{1 - β}, \frac{α}{β}\}$ . Therefore, it follows that:

$KL ({(p q)}_{α} : {(p q)}_{β}) \leq \int {(p q)}_{α} log max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} d μ = log max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} .$ (52)

Notice that we can interpret $log max \{\frac{1 - α}{1 - β}, \frac{α}{β}\} = max {log \frac{1 - α}{1 - β}, log \frac{α}{β}}$ as the ∞-Rényi divergence [36,37] between the following two two-point distributions: $(α, 1 - α)$ and $(β, 1 - β)$ . See Theorem 6 of [36].

A weaker upper bound is $KL ({(p q)}_{α} : {(p q)}_{β}) \leq log \frac{1}{β (1 - β)}$ . Indeed, let us form a partition of the sample space $X$ into two dominance regions:

$R_{p} : = {x \in X : q (x) \leq p (x)}$ and

$R_{q} : = {x \in X : q (x) > p (x)}$ .

We have ${(p q)}_{α} (x) = (1 - α) p (x) + α q (x) \leq p (x)$ for $x \in R_{p}$ and ${(p q)}_{α} (x) \leq q (x)$ for $x \in R_{q}$ . It follows that

$KL ({(p q)}_{α} : {(p q)}_{β}) \leq \int_{R_{p}} {(p q)}_{α} (x) log \frac{p (x)}{(1 - β) p (x)} d μ (x) + \int_{R_{q}} {(p q)}_{α} (x) log \frac{q (x)}{β q (x)} d μ (x) .$

That is, $KL ({(p q)}_{α} : {(p q)}_{β}) \leq - log (1 - β) - log β = log \frac{1}{β (1 - β)}$ . Notice that we allow $α \in {0, 1}$ but not $β$ to take the extreme values (i.e., $β \in (0, 1)$ ). □

In fact, it is known that for both $α, β \in (0, 1)$ , $KL ({(p q)}_{α} : {(p q)}_{β})$ amount to compute a Bregman divergence for the Shannon negentropy generator, since ${{(p q)}_{γ} : γ \in (0, 1)}$ defines a mixture family [38] of order 1 in information geometry. Hence, it is always finite, as Bregman divergences are always finite (but not necessarily bounded).

By using the fact that

{JS}^{α, w} (p : q) = \sum_{i = 1}^{k} w_{i} KL ({(p q)}_{α_{i}} : {(p q)}_{\bar{α}}),

(53)

we conclude that the vector-skew Jensen–Shannon divergence is upper-bounded:

Lemma 2

(Bounded $(w, α)$ -Jensen–Shannon divergence). ${JS}^{α, w}$ is bounded by $log \frac{1}{\bar{α} (1 - \bar{α})}$ where $\bar{α} = \sum_{i = 1}^{k} w_{i} α_{i} \in (0, 1)$ .

Proof.

We have ${JS}^{α, w} (p : q) = \sum_{i} w_{i} KL ({(p q)}_{α_{i}} : {(p q)}_{\bar{α}})$ . Since $0 \leq KL ({(p q)}_{α_{i}} : {(p q)}_{\bar{α}}) \leq log \frac{1}{\bar{α} (1 - \bar{α})}$ , it follows that we have

$0 \leq {JS}^{α, w} (p : q) \leq log \frac{1}{\bar{α} (1 - \bar{α})} .$

Notice that we also have

${JS}^{α, w} (p : q) \leq \sum_{i} w_{i} log max \{\frac{1 - α_{i}}{1 - \bar{α}}, \frac{α_{i}}{\bar{α}}\} .$

□

The vector-skew Jensen–Shannon divergence is symmetric if and only if for each index $i \in [k]$ there exists a matching index $σ (i)$ such that $α_{σ (i)} = 1 - α_{i}$ and $w_{σ (i)} = w_{i}$ .

For example, we may define the symmetric scalar α-skew Jensen–Shannon divergence as

\begin{matrix} {JS}_{s}^{α} (p, q) & = & \frac{1}{2} KL ({(p q)}_{α} : {(p q)}_{\frac{1}{2}}) + \frac{1}{2} KL ({(p q)}_{1 - α} : {(p q)}_{\frac{1}{2}}), \end{matrix}

(54)

\begin{matrix} = & \frac{1}{2} \int {(p q)}_{α} log \frac{{(p q)}_{α}}{{(p q)}_{\frac{1}{2}}} d μ + \frac{1}{2} \int {(p q)}_{1 - α} log \frac{{(p q)}_{1 - α}}{{(p q)}_{\frac{1}{2}}} d μ, \end{matrix}

(55)

\begin{matrix} = & \frac{1}{2} \int {(q p)}_{1 - α} log \frac{{(q p)}_{1 - α}}{{(q p)}_{\frac{1}{2}}} d μ + + \frac{1}{2} \int {(q p)}_{α} log \frac{{(q p)}_{α}}{{(q p)}_{\frac{1}{2}}} d μ, \end{matrix}

(56)

\begin{matrix} = & h ({(p q)}_{\frac{1}{2}}) - \frac{h ({(p q)}_{α}) + h ({(p q)}_{1 - α})}{2}, \end{matrix}

(57)

\begin{matrix} = : & {JS}_{s}^{α} (q, p), \end{matrix}

(58)

since it holds that ${(a b)}_{c} = {(b a)}_{1 - c}$ for any $a, b, c \in R$ . Note that ${JS}_{s}^{α} (p, q) \neq {JS}^{α} (p, q)$ .

Remark 2.

We can always symmetrize a vector-skew Jensen–Shannon divergence by doubling the dimension of the skewing vector. Let $α = (α_{1}, \dots, α_{k})$ and w be the vector parameters of an asymmetric vector-skew JSD, and consider $α^{'} = (1 - α_{1}, \dots, 1 - α_{k})$ and w to be the parameters of ${JS}^{α^{'}, w}$ . Then, ${JS}^{(α, α^{'}), (\frac{w}{2}, \frac{w}{2})}$ is a symmetric skew-vector JSD:

$\begin{matrix} {JS}^{(α, α^{'}), (\frac{w}{2}, \frac{w}{2})} (p : q) & : = & \frac{1}{2} {JS}^{α, w} (p : q) + \frac{1}{2} {JS}^{α^{'}, w} (p : q), \end{matrix}$ (59)

$\begin{matrix} = & \frac{1}{2} {JS}^{α, w} (p : q) + \frac{1}{2} {JS}^{α, w} (q : p) = {JS}^{(α, α^{'}), (\frac{w}{2}, \frac{w}{2})} (q : p) . \end{matrix}$ (60)

Since the vector-skew Jensen–Shannon divergence is an f-divergence for the generator $f_{α, w}$ (Theorem 1), we can take generator $f_{w, α}^{s} (u) = \frac{f_{w, α} (u) + f_{w, α}^{*} (u)}{2}$ to define the symmetrized f-divergence, where $f_{w, α}^{*} (u) = u f_{w, α} (\frac{1}{u})$ denotes the convex conjugate function. When $f_{α, w}$ yields a symmetric f-divergence $I_{f_{α, w}}$ , we can apply the generic upper bound of f-divergences (i.e., $I_{f} \leq f (0) + f^{*} (0)$ ) to get the upper bound on the symmetric vector-skew Jensen–Shannon divergences:

$\begin{matrix} I_{f_{α, w}} (p : q) & \leq & f_{α, w} (0) + f_{α, w}^{*} (0), \end{matrix}$ (61)

$\begin{matrix} \leq & \sum_{i = 1}^{k} w_{i} ((1 - α_{i}) log \frac{1 - α_{i}}{1 - \bar{α}} + α_{i} log \frac{α_{i}}{\bar{α}}), \end{matrix}$ (62)

since

$\begin{matrix} f_{α, w}^{*} (u) & = & u f_{α, w} (\frac{1}{u}), \end{matrix}$ (63)

$\begin{matrix} = & \sum_{i = 1}^{k} w_{i} ((1 - α_{i}) u + α_{i}) log \frac{(1 - α_{i}) u + α_{i}}{(1 - \bar{α}) u + \bar{α}} . \end{matrix}$ (64)

For example, consider the ordinary Jensen–Shannon divergence with $w = (\frac{1}{2}, \frac{1}{2})$ and $α = (0, 1)$ . Then, we find $JS (p : w) = I_{f_{(0, 1), (\frac{1}{2}, \frac{1}{2})}} (p : q) \leq \frac{1}{2} log 2 + \frac{1}{2} log 2 = log 2$ , the usual upper bound of the JSD.

As a side note, let us notice that our notation ${(p q)}_{α}$ allows one to compactly write the following property:

Property 1.

We have $q = {(q q)}_{λ}$ for any $λ \in [0, 1]$ , and ${({(p_{1} p_{2})}_{λ} {(q_{1} q_{2})}_{λ})}_{α} = {({(p_{1} q_{1})}_{α} {(p_{2} q_{2})}_{α})}_{λ}$ for any $α, λ \in [0, 1]$ .

Proof.

Clearly, $q = (1 - λ) q + λ q = : ({(q q)}_{λ})$ for any $λ \in [0, 1]$ . Now, we have

$\begin{matrix} {({(p_{1} p_{2})}_{λ} {(q_{1} q_{2})}_{λ})}_{α} & = & (1 - α) {(p_{1} p_{2})}_{λ} + α {(q_{1} q_{2})}_{λ}, \end{matrix}$ (65)

$\begin{matrix} = & (1 - α) ((1 - λ) p_{1} + λ p_{2}) + α ((1 - λ) q_{1} + λ q_{2}), \end{matrix}$ (66)

$\begin{matrix} = & (1 - λ) ((1 - α) p_{1} + α q_{1}) + λ ((1 - α) p_{2} + α q_{2}), \end{matrix}$ (67)

$\begin{matrix} = & (1 - λ) {(p_{1} q_{1})}_{α} + λ {(p_{2} q_{2})}_{α}, \end{matrix}$ (68)

$\begin{matrix} = & {({(p_{1} q_{1})}_{α} {(p_{2} q_{2})}_{α})}_{λ} . \end{matrix}$ (69)

□

2.3. Building Symmetric Families of Vector-Skewed Jensen–Shannon Divergences

We can build infinitely many vector-skew Jensen–Shannon divergences. For example, consider $α = (0, 1, \frac{1}{3})$ and $w = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$ . Then, $\bar{α} = \frac{1}{3} + \frac{1}{9} = \frac{4}{9}$ , and

{JS}^{α, w} (p : q) = h ({(p q)}_{\frac{4}{9}}) - \frac{h (p) + h (q) + h ({(p q)}_{\frac{1}{3}})}{3} \neq {JS}^{α, w} (q : p) .

(70)

Interestingly, we can also build infinitely many families of symmetric vector-skew Jensen–Shannon divergences. For example, consider these two examples that illustrate the construction process:

Consider $k = 2$ . Let $(w, 1 - w)$ denote the weight vector, and $α = (α_{1}, α_{2})$ the skewing vector. We have $\bar{α} = w α_{1} + (1 - w) α_{2} = α_{2} + w (α_{1} - α_{2})$ . The vector-skew JSD is symmetric iff. $w = 1 - w = \frac{1}{2}$ (with $\bar{α} = \frac{α_{1} + α_{2}}{2}$ ) and $α_{2} = 1 - α_{1}$ . In that case, we have $\bar{α} = \frac{1}{2}$ , and we obtain the following family of symmetric Jensen–Shannon divergences:
$\begin{matrix} {JS}^{(α, 1 - α), (\frac{1}{2}, \frac{1}{2})} (p, q) & = & h ({(p q)}_{\frac{1}{2}}) - \frac{h ({(p q)}_{α}) + h ({(p q)}_{1 - α})}{2}, \end{matrix}$ (71)

$\begin{matrix} = & h ({(p q)}_{\frac{1}{2}}) - \frac{h ({(p q)}_{α}) + h ({(q p)}_{α})}{2} = {JS}^{(α, 1 - α), (\frac{1}{2}, \frac{1}{2})} (q, p) . \end{matrix}$ (72)
Consider $k = 4$ , weight vector $w = (\frac{1}{3}, \frac{1}{3}, \frac{1}{6}, \frac{1}{6})$ , and skewing vector $α = (α_{1}, 1 - α_{1}, α_{2}, 1 - α_{2})$ for $α_{1}, α_{2} \in (0, 1)$ . Then, $\bar{α} = \frac{1}{2}$ , and we get the following family of symmetric vector-skew JSDs:
$\begin{matrix} {JS}^{(α_{1}, α_{2})} (p, q) & = & h ({(p q)}_{\frac{1}{2}}) - \frac{2 h ({(p q)}_{α_{1}}) + 2 h ({(p q)}_{1 - α_{1}}) + h ({(p q)}_{α_{2}}) + h ({(p q)}_{1 - α_{2}})}{6}, \end{matrix}$ (73)

$\begin{matrix} = & h ({(p q)}_{\frac{1}{2}}) - \frac{2 h ({(p q)}_{α_{1}}) + 2 h ({(q p)}_{α_{1}}) + h ({(p q)}_{α_{2}}) + h ({(q p)}_{α_{2}})}{6}, \end{matrix}$ (74)

$\begin{matrix} = & {JS}^{(α_{1}, α_{2})} (q, p) . \end{matrix}$ (75)
We can similarly carry on the construction of such symmetric JSDs by increasing the dimensionality of the skewing vector.

In fact, we can define

{JS}_{s}^{α, w} (p, q) : = h ({(p q)}_{\frac{1}{2}}) - \sum_{i = 1}^{k} w_{i} \frac{h ({(p q)}_{α_{i}}) + h ({(p q)}_{1 - α_{i}})}{2} = \sum_{i = 1}^{k} w_{i} {JS}_{s}^{α_{i}} (p, q),

(76)

with

{JS}_{s}^{α} (p, q) : = h ({(p q)}_{\frac{1}{2}}) - \frac{h ({(p q)}_{α}) + h ({(p q)}_{1 - α})}{2} .

(77)

3. Jensen–Shannon Centroids on Mixture Families

3.1. Mixture Families and Jensen–Shannon Divergences

Consider a mixture family in information geometry [25]. That is, let us give a prescribed set of $D + 1$ linearly independent probability densities $p_{0} (x), \dots, p_{D} (x)$ defined on the sample space $X$ . A mixture family $M$ of order D consists of all strictly convex combinations of these component densities:

M : = \{m (x; θ) : = \sum_{i = 1}^{D} θ^{i} p_{i} (x) + (1 - \sum_{i = 1}^{D} θ^{i}) p_{0} (x) : θ^{i} > 0, \sum_{i = 1}^{D} θ^{i} < 1\} .

(78)

For example, the family of categorical distributions (sometimes called “multinouilli” distributions) is a mixture family [25]:

M = \{m_{θ} (x) = \sum_{i = 1}^{D} θ_{i} δ (x - x_{i}) + (1 - \sum_{i = 1}^{D} θ_{i}) δ (x - x_{0})\},

(79)

where $δ (x)$ is the Dirac distribution (i.e., $δ (x) = 1$ for $x = 0$ and $δ (x) = 0$ for $x \neq 0$ ). Note that the mixture family of categorical distributions can also be interpreted as an exponential family.

Notice that the linearly independent assumption on probability densities is to ensure to have an identifiable model: $θ \leftrightarrow m (x; θ)$ .

The KL divergence between two densities of a mixture family $M$ amounts to a Bregman divergence for the Shannon negentropy generator $F (θ) = - h (m_{θ})$ (see [38]):

\begin{matrix} KL (m_{θ_{1}} : m_{θ_{2}}) = B_{F} (θ_{1} : θ_{2}) = B_{- h (m_{θ})} (θ_{1} : θ_{2}) . \end{matrix}

(80)

On a mixture manifold $M$ , the mixture density $(1 - α) m_{θ_{1}} + α m_{θ_{2}}$ of two mixtures $m_{θ_{1}}$ and $m_{θ_{2}}$ of $M$ also belongs to $M$ :

(1 - α) m_{θ_{1}} + α m_{θ_{2}} = m_{{(θ_{1} θ_{2})}_{α}} \in M,

(81)

where we extend the notation ${(θ_{1} θ_{2})}_{α} : = (1 - α) θ_{1} + α θ_{2}$ to vectors $θ_{1}$ and $θ_{2}$ : ${(θ_{1} θ_{2})}_{α}^{i} = {(θ_{1}^{i} θ_{2}^{i})}_{α}$ .

Thus, the vector-skew JSD amounts to a vector-skew Jensen diversity for the Shannon negentropy convex function $F (θ) = - h (m_{θ})$ :

\begin{matrix} {JS}^{α, w} (m_{θ_{1}} : m_{θ_{2}}) & = & \sum_{i = 1}^{k} w_{i} KL ({(m_{θ_{1}} m_{θ_{2}})}_{α_{i}} : {(m_{θ_{1}} m_{θ_{2}})}_{\bar{α}}), \end{matrix}

(82)

\begin{matrix} = & \sum_{i = 1}^{k} w_{i} KL (m_{{(θ_{1} θ_{2})}_{α_{i}}} : m_{{(θ_{1} θ_{2})}_{\bar{α}}}), \end{matrix}

(83)

\begin{matrix} = & \sum_{i = 1}^{k} w_{i} B_{F} ({(θ_{1} θ_{2})}_{α_{i}} : {(θ_{1} θ_{2})}_{\bar{α}}), \end{matrix}

(84)

\begin{matrix} = & {JB}_{F}^{α, \bar{α}, w} (θ_{1} : θ_{2}), \end{matrix}

(85)

\begin{matrix} = & \sum_{i = 1}^{k} w_{i} F ({(θ_{1} θ_{2})}_{α_{i}}) - F ({(θ_{1} θ_{2})}_{\bar{α}}), \end{matrix}

(86)

\begin{matrix} = & h (m_{{(θ_{1} θ_{2})}_{\bar{α}}}) - \sum_{i = 1}^{k} w_{i} h (m_{{(θ_{1} θ_{2})}_{α_{i}}}) . \end{matrix}

(87)

3.2. Jensen–Shannon Centroids

Given a set of n mixture densities $m_{θ_{1}}, \dots, m_{θ_{n}}$ of $M$ , we seek to calculate the skew-vector Jensen–Shannon centroid (or barycenter for non-uniform weights) defined as $m_{θ^{*}}$ , where $θ^{*}$ is the minimizer of the following objective function (or loss function):

L (θ) : = \sum_{j = 1}^{n} ω_{j} {JS}^{α, w} (m_{θ_{k}} : m_{θ}),

(88)

where $ω \in Δ_{n}$ is the weight vector of densities (uniform weight for the centroid and non-uniform weight for a barycenter). This definition of the skew-vector Jensen–Shannon centroid is a generalization of the Fréchet mean (the Fréchet mean may not be unique, as it is the case on the sphere for two antipodal points for which their Fréchet means with respect to the geodesic metric distance form a great circle) [39] to non-metric spaces. Since the divergence ${JS}^{α, w}$ is strictly separable convex, it follows that the Jensen–Shannon-type centroids are unique when they exist.

Plugging Equation (82) into Equation (88), we get that the calculation of the Jensen–Shannon centroid amounts to the following minimization problem:

L (θ) = \sum_{j = 1}^{n} ω_{j} (\sum_{i = 1}^{k} w_{i} F ({(θ_{j} θ)}_{α_{i}}) - F ({(θ_{j} θ)}_{\bar{α}})) .

(89)

This optimization is a Difference of Convex (DC) programming optimization, for which we can use the ConCave–Convex procedure [27,40] (CCCP). Indeed, let us define the following two convex functions:

\begin{matrix} A (θ) & = & \sum_{j = 1}^{n} \sum_{i = 1}^{k} ω_{j} w_{i} F ({(θ_{j} θ)}_{α_{i}}), \end{matrix}

(90)

\begin{matrix} B (θ) & = & \sum_{j = 1}^{n} ω_{j} F ({(θ_{j} θ)}_{\bar{α}}) . \end{matrix}

(91)

Both functions $A (θ)$ and $B (θ)$ are convex since F is convex. Then, the minimization problem of Equation (89) to solve can be rewritten as:

min_{θ} A (θ) - B (θ) .

(92)

This is a DC programming optimization problem which can be solved iteratively by initializing $θ$ to an arbitrary value $θ^{(0)}$ (say, the centroid of the $θ_{i}$ s), and then by updating the parameter at step t using the CCCP [27] as follows:

θ^{(t + 1)} = {(\nabla B)}^{- 1} (\nabla A (θ^{(t)})) .

(93)

Compared to a gradient descent local optimization, there is no required step size (also called “learning” rate) in CCCP.

We have $\nabla A (θ) = \sum_{j = 1}^{n} \sum_{i = 1}^{k} ω_{j} w_{i} α_{i} \nabla F ({(θ_{j} θ)}_{α_{i}})$ and $\nabla B (θ) = \sum_{j = 1}^{n} ω_{j} \bar{α} \nabla F ({(θ_{j} θ)}_{\bar{α}})$ .

The CCCP converges to a local optimum $θ^{*}$ where the support hyperplanes of the function graphs of A and B at $θ^{*}$ are parallel to each other, as depicted in Figure 1. The set of stationary points is ${θ : \nabla A (θ) = \nabla B (θ)}$ . In practice, the delicate step is to invert $\nabla B$ . Next, we show how to implement this algorithm for the Jensen–Shannon centroid of a set of categorical distributions (i.e., normalized histograms with all non-empty bins).

The Convex–ConCave Procedure (CCCP) iteratively updates the parameter $θ$ by aligning the support hyperplanes at $θ$ . In the limit case of convergence to $θ^{*}$ , the support hyperplanes at $θ^{*}$ are parallel to each other. CCCP finds a local minimum.

3.2.1. Jensen–Shannon Centroids of Categorical Distributions

To illustrate the method, let us consider the mixture family of categorical distributions [25]:

M = \{m_{θ} (x) = \sum_{i = 1}^{D} θ_{i} δ (x - x_{i}) + (1 - \sum_{i = 1}^{D} θ_{i}) δ (x - x_{0})\} .

(94)

The Shannon negentropy is

F (θ) = - h (m_{θ}) = \sum_{i = 1}^{D} θ_{i} log θ_{i} + (1 - \sum_{i = 1}^{D} θ_{i}) log (1 - \sum_{i = 1}^{D} θ_{i}) .

(95)

We have the partial derivatives

\nabla F (θ) = {[\frac{\partial}{\partial θ_{i}}]}_{i}, \frac{\partial}{\partial θ_{i}} F (θ) = log \frac{θ_{i}}{1 - \sum_{j = 1}^{D} θ_{j}} .

(96)

Inverting the gradient $\nabla F$ requires us to solve the equation $\nabla F (θ) = η$ so that we get $θ = {(\nabla F)}^{- 1} (η)$ . We find that

\nabla F^{*} (η) = {(\nabla F)}^{- 1} (η) = \frac{1}{1 + \sum_{j = 1}^{D} exp (η_{j})} {[exp (η_{i})]}_{i}, θ_{i} = {(\nabla F^{- 1} (η))}_{i} = \frac{exp (η_{i})}{1 + \sum_{j = 1}^{D} exp (η_{j})}, \forall i \in [D] .

(97)

Table 1 summarizes the dual view of the family of categorical distributions, either interpreted as an exponential family or as a mixture family.

Table 1.

Two views of the family of categorical distributions with d choices: An exponential family or a mixture family of order $D = d - 1$ . Note that the Bregman divergence associated to the exponential family view corresponds to the reverse Kullback–Leibler (KL) divergence, while the Bregman divergence associated to the mixture family view corresponds to the KL divergence.

	Exponential Family	Mixture Family
pdf	$p_{θ} (x) = \prod_{i = 1}^{d} p_{i}^{t_{i} (x)}, p_{i} = Pr (x = e_{i}), t_{i} (x) \in {0, 1}, \sum_{i = 1}^{d} t_{i} (x) = 1$	$m_{θ} (x) = \sum_{i = 1}^{d} p_{i} δ_{e_{i}} (x)$
primal $θ$	$θ_{i} = log \frac{p_{i}}{p_{d}}$	$θ_{i} = p_{i}$
$F (θ)$	$log (1 + \sum_{i = 1}^{D} exp (θ_{i}))$	$θ_{i} log θ_{i} + (1 - \sum_{i = 1}^{D} θ_{i}) log (1 - \sum_{i = 1}^{D} θ_{i})$
dual $η = \nabla F (θ)$	$\frac{e^{θ_{i}}}{1 + \sum_{j = 1}^{D} exp (θ_{j})}$	$log \frac{θ_{i}}{1 - \sum_{j = 1}^{D} θ_{j}}$
primal $θ = \nabla F^{*} (η)$	$log \frac{η_{i}}{1 - \sum_{j = 1}^{D} η_{j}}$	$\frac{e^{θ_{i}}}{1 + \sum_{j = 1}^{D} exp (θ_{j})}$
$F^{*} (η)$	$\sum_{i = 1}^{D} η_{i} log η_{i} + (1 - \sum_{j = 1}^{D} η_{j}) log (1 - \sum_{j = 1}^{D} η_{j})$	$log (1 + \sum_{i = 1}^{D} exp (η_{i}))$
Bregman divergence	$B_{F} (θ : θ^{'}) = {KL}^{*} (p_{θ} : p_{θ^{'}})$	$B_{F} (θ : θ^{'}) = KL (m_{θ} : m_{θ^{'}})$
	$= KL (p_{θ^{'}} : p_{θ})$

Open in a new tab

We have $JS (p_{1}, p_{2}) = J_{F} (θ_{1}, θ_{2})$ for $p_{1} = m_{θ_{1}}$ and $p_{2} = m_{θ_{2}}$ , where

J_{F} (θ_{1} : θ_{2}) = \frac{F (θ_{1}) + F (θ_{2})}{2} - F (\frac{θ_{1} + θ_{2}}{2}),

(98)

is the Jensen divergence [40]. Thus, to compute the Jensen–Shannon centroid of a set of n densities $p_{1}, \dots, p_{n}$ of a mixture family (with $p_{i} = m_{θ_{i}}$ ), we need to solve the following optimization problem for a density $p = m_{θ}$ :

\begin{matrix} min_{p} \sum_{i} JS (p_{i}, p), \end{matrix}

(99)

\begin{matrix} min_{θ} \sum_{i} J_{F} (θ_{i}, θ), \end{matrix}

(100)

\begin{matrix} min_{θ} \sum_{i} \frac{F (θ_{i}) + F (θ)}{2} - F (\frac{θ_{i} + θ}{2}), \end{matrix}

(101)

\begin{matrix} \equiv min_{θ} \frac{1}{2} F (θ) - \frac{1}{n} \sum_{i} F (\frac{θ_{i} + θ}{2}) : = E (θ) . \end{matrix}

(102)

The CCCP algorithm for the Jensen–Shannon centroid proceeds by initializing $θ^{(0)} = \frac{1}{n} \sum_{i} θ_{i}$ (center of mass of the natural parameters), and iteratively updates as follows:

θ^{(t + 1)} = {(\nabla F)}^{- 1} (\frac{1}{n} \sum_{i} \nabla F (\frac{θ_{i} + θ^{(t)}}{2})) .

(103)

We iterate until the absolute difference $| E (θ^{(t)}) - E (θ^{(t + 1)}) |$ between two successive $θ^{(t)}$ and $θ^{(t + 1)}$ goes below a prescribed threshold value. The convergence of the CCCP algorithm is linear [41] to a local minimum that is a fixed point of the equation

θ = M_{H} (\frac{θ_{1} + θ}{2}, \dots, \frac{θ_{n} + θ}{2}),

(104)

where $M_{H} (v_{1}, \dots, v_{n}) : = H^{- 1} (\sum_{i = 1}^{n} H (v_{i}))$ is a vector generalization of the formula of the quasi-arithmetic means [30,40] obtained for the generator $H = \nabla F$ . Algorithm 1 summarizes the method for approximating the Jensen–Shannon centroid of a given set of categorical distributions (given a prescribed number of iterations). In the pseudo-code, we used the notation $^{(t + 1)} θ$ instead of $θ^{(t + 1)}$ in order to highlight the conversion procedures of the natural parameters to/from the mixture weight parameters by using superscript notations for coordinates.

Algorithm 1: The CCCP algorithm for computing the Jensen–Shannon centroid of a set of categorical distributions.

graphic file with name entropy-22-00221-i001.jpg

Open in a new tab

Figure 2 displays the results of the calculations of the Jeffreys centroid [18] and the Jensen–Shannon centroid for two normalized histograms obtained from grey-valued images of Lena and Barbara. Figure 3 show the Jeffreys centroid and the Jensen–Shannon centroid for the Barbara image and its negative image. Figure 4 demonstrates that the Jensen–Shannon centroid is well defined even if the input histograms do not have coinciding supports. Notice that on the parts of the support where only one distribution is defined, the JS centroid is a scaled copy of that defined distribution.

The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for two grey normalized histograms of the `Lena` image (red histogram) and the `Barbara` image (blue histogram). Although these Jeffreys and Jensen–Shannon centroids look quite similar, observe that there is a major difference between them in the range $[0, 20]$ where the blue histogram is zero.

The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for the grey normalized histogram of the `Barbara` image (red histogram) and its negative image (blue histogram which corresponds to the reflection around the vertical axis $x = 128$ of the red histogram).

Jensen–Shannon centroid (black histogram) for the clamped grey normalized histogram of the `Lena` image (red histograms) and the clamped gray normalized histogram of `Barbara` image (blue histograms). Notice that on the part of the sample space where only one distribution is non-zero, the JS centroid scales that histogram portion.

3.2.2. Special Cases

Let us now consider two special cases:

For the special case of $D = 1$ , the categorical family is the Bernoulli family, and we have $F (θ) = θ log θ + (1 - θ) log (1 - θ)$ (binary negentropy), $F^{'} (θ) = log \frac{θ}{1 - θ}$ (and $F^{''} (θ) = \frac{1}{θ (1 - θ)} > 0$ ) and ${(F^{'})}^{- 1} (η) = \frac{e^{η}}{1 + e^{η}}$ . The CCCP update rule to compute the binary Jensen–Shannon centroid becomes
$θ^{(t + 1)} = {(F^{'})}^{- 1} (\sum_{i} w_{i} F^{'} (\frac{θ^{(t)} + θ_{i}}{2})) .$ (105)
Since the skew-vector Jensen–Shannon divergence formula holds for positive densities:
$\begin{matrix} {JS}^{+}^{α, w} (\tilde{p} : \tilde{q}) & = & \sum_{i = 1}^{k} w_{i} {KL}^{+} ({(\tilde{p} \tilde{q})}_{α_{i}} : ({(\tilde{p} \tilde{q})}_{\bar{α}}), \end{matrix}$ (106)

$\begin{matrix} = & \sum_{i = 1}^{k} w_{i} (KL ({(\tilde{p} \tilde{q})}_{α_{i}} : ({(\tilde{p} \tilde{q})}_{\bar{α}}) + \int {(\tilde{p} \tilde{q})}_{\bar{α}} d μ - \underset{= \int {(\tilde{p} \tilde{q})}_{\bar{α}} d μ}{\underset{︸}{\sum_{i = 1}^{k} w_{i} \int {(\tilde{p} \tilde{q})}_{α_{i}} d μ}}), \end{matrix}$ (107)

$\begin{matrix} = & {JS}^{α, w} (\tilde{p} : \tilde{q}), \end{matrix}$ (108)
we can relax the computation of the Jensen–Shannon centroid by considering 1D separable minimization problems. We then normalize the positive JS centroids to get an approximation of the probability JS centroids. This approach was also considered when dealing with the Jeffreys’ centroid [18]. In 1D, we have $F (θ) = θ log θ - θ$ , $F^{'} (θ) = log θ$ and ${(F^{'})}^{- 1} (η) = e^{η}$ .

In general, calculating the negentropy for a mixture family with continuous densities sharing the same support is not tractable because of the log-sum term of the differential entropy. However, the following remark emphasizes an extension of the mixture family of categorical distributions:

3.2.3. Some Remarks and Properties

Remark 3.

Consider a mixture family $m (θ) = \sum_{i = 1}^{D} θ_{i} p_{i} (x) + (1 - \sum_{i = 1}^{D} θ_{i}) p_{0} (x)$ (for a parameter θ belonging to the D-dimensional standard simplex) of probability densities $p_{0} (x), \dots, p_{D} (x)$ defined respectively on the supports $X_{0}, X_{1}, \dots, X_{D}$ . Let $θ_{0} : = 1 - \sum_{i = 1}^{D} θ_{i}$ . Assume that the support $X_{i}$ s of the $p_{i}$ s are mutually non-intersecting( $X_{i} \cap X_{j} = \emptyset$ for all $i \neq j$ implying that the $D + 1$ densities are linearly independent) so that $m_{θ} (x) = θ_{i} p_{i} (x)$ for all $x \in X_{i}$ , and let $X = \cup_{i} X_{i}$ . Consider Shannon negative entropy $F (θ) = - h (m_{θ})$ as a strictly convex function. Then, we have

$\begin{matrix} F (θ) & = & - h (m_{θ}) = \int_{X} m_{θ} (x) log m_{θ} (x), \end{matrix}$ (109)

$\begin{matrix} = & \sum_{i = 0}^{D} θ_{i} \int_{X_{i}} p_{i} (x) log (θ_{i} p_{i} (x)) d μ (x), \end{matrix}$ (110)

$\begin{matrix} = & \sum_{i = 0}^{D} θ_{i} log θ_{i} - \sum_{i = 0}^{D} θ_{i} h (p_{i}) . \end{matrix}$ (111)

Note that the term $\sum_{i} θ_{i} h (p_{i})$ is affine in θ, and Bregman divergences are defined up to affine terms so that the Bregman generator F is equivalent to the Bregman generator of the family of categorical distributions. This example generalizes the ordinary mixture family of categorical distributions where the $p_{i}$ s are distinct Dirac distributions. Note that when the support of the component distributions are not pairwise disjoint, the (neg)entropy may not be analytic [42] (e.g., mixture of the convex weighting of two prescribed distinct Gaussian distributions). This contrasts with the fact that the cumulant function of an exponential family is always real-analytic [43]. Observe that the term $\sum_{i} θ_{i} h (p_{i})$ can be interpreted as a conditional entropy: $\sum_{i} θ_{i} h (p_{i}) = h (X | Θ)$ where $\Pr (Θ = i) = θ_{i}$ and $\Pr (X \in S | Θ = i) = \int_{S} p_{i} (x) d μ (x)$ .

Notice that we can truncate an exponential family [25] to get a (potentially non-regular [44]) exponential family for defining the $p_{i}$ s on mutually non-intersecting domains $X_{i}$ s. The entropy of a natural exponential family ${e (x : θ) = exp (x^{⊤} θ - ψ (θ)) : θ \in Θ}$ with cumulant function $ψ (θ)$ and natural parameter space Θ is $- ψ^{*} (η)$ , where $η = \nabla ψ (θ)$ , and $ψ^{*}$ is the Legendre convex conjugate [45]: $h (e (x : θ)) = - ψ^{*} (\nabla ψ (θ))$ .

In general, the entropy and cross-entropy between densities of a mixture family (whether the distributions have disjoint supports or not) can be calculated in closed-form.

Property 2.

The entropy of a density belonging to a mixture family $M$ is $h (m_{θ}) = - F (θ)$ , and the cross-entropy between two mixture densities $m_{θ_{1}}$ and $m_{θ_{2}}$ is $h^{\times} (m_{θ_{1}} : m_{θ_{2}}) = - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} η_{2} = F^{*} (η_{2}) - θ_{1}^{⊤} η_{2}$ .

Proof.

Let us write the KLD as the difference between the cross-entropy minus the entropy [4]:

$\begin{matrix} KL (m_{θ_{1}} : m_{θ_{2}}) & = & h^{\times} (m_{θ_{1}} : m_{θ_{2}}) - h (m_{θ_{1}}), \end{matrix}$ (112)

$\begin{matrix} = & B_{F} (θ_{1} : θ_{2}), \end{matrix}$ (113)

$\begin{matrix} = & F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} \nabla F (θ_{2}) . \end{matrix}$ (114)

Following [45], we deduce that $h (m_{θ}) = - F (θ) + c$ and $h^{\times} (m_{θ_{1}} : m_{θ_{2}}) = - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} η_{2} - c$ for a constant c. Since $F (θ) = - h (m_{θ})$ by definition, it follows that $c = 0$ and that $h^{\times} (m_{θ_{1}} : m_{θ_{2}}) = - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} η_{2} = F^{*} (η_{2}) - θ_{1}^{⊤} η_{2}$ where $η = \nabla F (θ)$ . □

Thus, we can numerically compute the Jensen–Shannon centroids (or barycenters) of a set of densities belonging to a mixture family. This includes the case of categorical distributions and the case of Gaussian Mixture Models (GMMs) with prescribed Gaussian components [38] (although in this case, the negentropy needs to be stochastically approximated using Monte Carlo techniques [46]). When the densities do not belong to a mixture family (say, the Gaussian family, which is an exponential family [25]), we face the problem that the mixture of two densities does not belong to the family anymore. One way to tackle this problem is to project the mixture onto the Gaussian family. This corresponds to an m-projection (mixture projection) which can be interpreted as a Maximum Entropy projection of the mixture [25,47]).

Notice that we can perform fast k-means clustering without centroid calculations using a generalization of the k-means++ probabilistic initialization [48,49]. See [50] for details of the generalized k-means++ probabilistic initialization defined according to an arbitrary divergence.

Finally, let us notice some decompositions of the Jensen–Shannon divergence and the skew Jensen divergences.

Remark 4.

We have the following decomposition for the Jensen–Shannon divergence:

$\begin{matrix} JS (p_{1}, p_{2}) & = & h (\frac{p_{1} + p_{2}}{2}) - \frac{h (p_{1}) + h (p_{2})}{2}, \end{matrix}$ (115)

$\begin{matrix} = & h_{JS}^{\times} (p_{1} : p_{2}) - h_{JS} (p_{2}) \geq 0, \end{matrix}$ (116)

where

$h_{JS}^{\times} (p_{1} : p_{2}) = h (\frac{p_{1} + p_{2}}{2}) - \frac{1}{2} h (p_{1}),$ (117)

and $h_{JS} (p_{2}) = h_{JS}^{\times} (p_{2} : p_{2}) = h (p_{2}) - \frac{1}{2} h (p_{2}) = \frac{1}{2} h (p_{2})$ . This decomposition bears some similarity with the KLD decomposition viewed as the cross-entropy minus the entropy (with the cross-entropy always upper-bounding the entropy).

Similarly, the α-skew Jensen divergence

$J_{F}^{α} (θ_{1} : θ_{2}) : = {(F (θ_{1}) F (θ_{2}))}_{α} - F ({(θ_{1} θ_{2})}_{α}), α \in (0, 1)$ (118)

can be decomposed as the sum of the information $I_{F}^{α} (θ_{1}) = (1 - α) F (θ_{1})$ minus the cross-information $C_{F}^{α} (θ_{1} : θ_{2}) : = F ({(θ_{1} θ_{2})}_{α}) - α F (θ_{2})$ :

$J_{F}^{α} (θ_{1} : θ_{2}) = I_{F}^{α} (θ_{1}) - C_{F}^{α} (θ_{1} : θ_{2}) \geq 0 .$ (119)

Notice that the information $I_{F}^{α} (θ_{1})$ is the self cross-information: $I_{F}^{α} (θ_{1}) = C_{F}^{α} (θ_{1} : θ_{1}) = (1 - α) F (θ_{1})$ . Recall that the convex information is the negentropy where the entropy is concave. For the Jensen–Shannon divergence on the mixture family of categorical distributions, the convex generator $F (θ) = - h (m_{θ}) = \sum_{i = 1}^{D} θ^{i} log θ^{i}$ is the Shannon negentropy.

Finally, let us briefly mention the Jensen–Shannon diversity [30] which extends the Jensen–Shannon divergence to a weighted set of densities as follows:

JS (p_{1}, \dots, p_{k}; w_{1}, \dots, w_{k}) : = \sum_{i = 1}^{k} w_{i} KL (p_{i} : \bar{p}),

(120)

where $\bar{p} = \sum_{i = 1}^{k} w_{i} p_{i}$ . The Jensen–Shannon diversity plays the role of the variance of a cluster with respect to the KLD. Indeed, let us state the compensation identity [51]: For any q, we have

\sum_{i = 1}^{k} w_{i} KL (p_{i} : q) = \sum_{i = 1}^{k} w_{i} KL (p_{i} : \bar{p}) + KL (\bar{p} : q) .

(121)

Thus, the cluster center defined as the minimizer of $\sum_{i = 1}^{k} w_{i} KL (p_{i} : q)$ is the centroid $\bar{p}$ , and

\sum_{i = 1}^{k} w_{i} KL (p_{i} : \bar{p}) = JS (p_{1}, \dots, p_{k}; w_{1}, \dots, w_{k}) .

(122)

4. Conclusions and Discussion

The Jensen–Shannon divergence [6] is a renown symmetrization of the Kullback–Leibler oriented divergence that enjoys the following three essential properties:

It is always bounded,
it applies to densities with potentially different supports, and
it extends to unnormalized densities while enjoying the same formula expression.

This JSD plays an important role in machine learning and in deep learning for studying Generative Adversarial Networks (GANs) [52]. Traditionally, the JSD has been skewed with a scalar parameter [19,53] $α \in (0, 1)$ . In practice, it has been experimentally demonstrated that skewing divergences may significantly improve the performance of some tasks (e.g., [21,54]).

In general, we can symmetrize the KLD $KL (p : q)$ by taking an abstract mean (we require a symmetric mean $M (x, y) = M (y, x)$ with the in-betweenness property: $min {x, y} \leq M (x, y) \leq max {x, y}$ ) M between the two orientations $KL (p : q)$ and $KL (q : p)$ :

{KL}_{M} (p, q) : = M (KL (p : q), KL (q : p)) .

(123)

We recover the Jeffreys divergence by taking the arithmetic mean twice (i.e., $J (p, q) = 2 A (KL (p : q), KL (q : p))$ where $A (x, y) = \frac{x + y}{2}$ ), and the resistor average divergence [55] by taking the harmonic mean (i.e., $R_{KL} (p, q) = H (KL (p : q), KL (q : p)) = \frac{2 KL (p : q) KL (q : p)}{KL (p : q) + KL (q : p)}$ where $H (x, y) = \frac{2}{\frac{1}{x} + \frac{1}{y}}$ ). When we take the limit of Hölder power means, we get the following extremal symmetrizations of the KLD:

\begin{matrix} {KL}^{\min} (p : q) & = & min {KL (p : q), KL (q : p)} = {KL}^{\min} (q : p), \end{matrix}

(124)

\begin{matrix} {KL}^{\max} (p : q) & = & max {KL (p : q), KL (q : p)} = {KL}^{\max} (q : p) . \end{matrix}

(125)

In this work, we showed how to vector-skew the JSD while preserving the above three properties. These new families of weighted vector-skew Jensen–Shannon divergences may allow one to fine-tune the dissimilarity in applications by replacing the skewing scalar parameter of the JSD by a vector parameter (informally, adding some “knobs” for tuning a divergence). We then considered computing the Jensen–Shannon centroids of a set of densities belonging to a mixture family [25] by using the convex–concave procedure [27].

In general, we can vector-skew any arbitrary divergence D by using two k-dimensional vectors $α \in {[0, 1]}^{k}$ and $β \in {[0, 1]}^{k}$ (with $α \neq β$ ) by building a weighted separable divergence as follows:

D^{α, β, w} (p : q) : = \sum_{i = 1}^{k} w_{i} D ({(p q)}_{α_{i}} : {(p q)}_{β_{i}}) = D^{1_{k} - α, 1_{k} - β, w} (q : p), α \neq β .

(126)

This bi-vector-skew divergence unifies the Jeffreys divergence with the Jensen–Shannon $α$ -skew divergence by setting the following parameters:

\begin{matrix} {KL}^{(0, 1), (1, 0), (1, 1)} (p : q) & = & KL (p : q) + KL (q : p) = J (p, q), \end{matrix}

(127)

\begin{matrix} {KL}^{(0, α), (1, 1 - α), (\frac{1}{2}, \frac{1}{2})} (p : q) & = & \frac{1}{2} KL (p : {(p q)}_{α}) + \frac{1}{2} KL (q : {(p q)}_{α}) . \end{matrix}

(128)

We have shown in this paper that interesting properties may occur when the skewing vector $β$ is purposely correlated to the skewing vector $α$ : Namely, for the bi-vector-skew Bregman divergences with $β = (\bar{α}, \dots, \bar{α})$ and $\bar{α} = \sum_{i} w_{i} α_{i}$ , we obtain an equivalent Jensen diversity for the Jensen–Bregman divergence, and, as a byproduct, a vector-skew generalization of the Jensen–Shannon divergence.

Acknowledgments

The author is very grateful to the two Reviewers and the Academic Editor for their careful reading, helpful comments, and suggestions which led to this improved manuscript. In particular, Reviewer 2 kindly suggested the stronger bound of Lemma 1 and hinted at Theorem 1.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

1.Billingsley P. Probability and Measure. John Wiley & Sons; Hoboken, NJ, USA: 2008. [Google Scholar]
2.Deza M.M., Deza E. Encyclopedia of Distances. Springer; Berlin/Heidelberg, Germany: 2009. [Google Scholar]
3.Basseville M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013;93:621–633. doi: 10.1016/j.sigpro.2012.09.003. [DOI] [Google Scholar]
4.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]
5.Nielsen F. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy. 2019;21:485. doi: 10.3390/e21050485. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory. 1991;37:145–151. doi: 10.1109/18.61115. [DOI] [Google Scholar]
7.Sason I. Tight bounds for symmetric divergence measures and a new inequality relating f-divergences; Proceedings of the 2015 IEEE Information Theory Workshop (ITW); Jerusalem, Israel. 26 April–1 May 2015; pp. 1–5. [Google Scholar]
8.Wong A.K., You M. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1985;7:599–609. doi: 10.1109/TPAMI.1985.4767707. [DOI] [PubMed] [Google Scholar]
9.Endres D.M., Schindelin J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory. 2003;49:1858–1860. doi: 10.1109/TIT.2003.813506. [DOI] [Google Scholar]
10.Kafka P., Österreicher F., Vincze I. On powers of f-divergences defining a distance. Stud. Sci. Math. Hung. 1991;26:415–422. [Google Scholar]
11.Fuglede B. Spirals in Hilbert space: With an application in information theory. Expo. Math. 2005;23:23–45. doi: 10.1016/j.exmath.2005.01.014. [DOI] [Google Scholar]
12.Acharyya S., Banerjee A., Boley D. Bregman divergences and triangle inequality; Proceedings of the 2013 SIAM International Conference on Data Mining; Austin, TX, USA. 2–4 May 2013; pp. 476–484. [Google Scholar]
13.Naghshvar M., Javidi T., Wigger M. Extrinsic Jensen–Shannon divergence: Applications to variable-length coding. IEEE Trans. Inf. Theory. 2015;61:2148–2164. doi: 10.1109/TIT.2015.2401004. [DOI] [Google Scholar]
14.Bigi B. European Conference on Information Retrieval. Springer; Berlin/Heidelberg, Germany: 2003. Using Kullback-Leibler distance for text categorization; pp. 305–319. [Google Scholar]
15.Chatzisavvas K.C., Moustakidis C.C., Panos C. Information entropy, information distances, and complexity in atoms. J. Chem. Phys. 2005;123:174111. doi: 10.1063/1.2121610. [DOI] [PubMed] [Google Scholar]
16.Yurdakul B. Ph.D. Thesis. Western Michigan University; Kalamazoo, MI, USA: 2018. Statistical Properties of Population Stability Index. [Google Scholar]
17.Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
18.Nielsen F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013;20:657–660. doi: 10.1109/LSP.2013.2260538. [DOI] [Google Scholar]
19.Lee L. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99. Association for Computational Linguistics; Stroudsburg, PA, USA: 1999. Measures of Distributional Similarity; pp. 25–32. [DOI] [Google Scholar]
20.Nielsen F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv. 20101009.4004 [Google Scholar]
21.Lee L. On the effectiveness of the skew divergence for statistical language analysis; Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AISTATS 2001); Key West, FL, USA. 4–7 January 2001. [Google Scholar]
22.Csiszár I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967;2:229–318. [Google Scholar]
23.Ali S.M., Silvey S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966;28:131–142. doi: 10.1111/j.2517-6161.1966.tb00626.x. [DOI] [Google Scholar]
24.Sason I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy. 2018;20:383. doi: 10.3390/e20050383. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Amari S.I. Information Geometry and Its Applications. Springer; Berlin/Heidelberg, Germany: 2016. [Google Scholar]
26.Jiao J., Courtade T.A., No A., Venkat K., Weissman T. Information measures: The curious case of the binary alphabet. IEEE Trans. Inf. Theory. 2014;60:7616–7626. doi: 10.1109/TIT.2014.2360184. [DOI] [Google Scholar]
27.Yuille A.L., Rangarajan A. The concave-convex procedure (CCCP); Proceedings of the Neural Information Processing Systems 2002; Vancouver, BC, Canada. 9–14 December 2002; pp. 1033–1040. [Google Scholar]
28.Nielsen F., Nock R. Transactions on Computational Science XIV. Springer; Berlin/Heidelberg, Germany: 2011. Skew Jensen-Bregman Voronoi diagrams; pp. 102–128. [Google Scholar]
29.Banerjee A., Merugu S., Dhillon I.S., Ghosh J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005;6:1705–1749. [Google Scholar]
30.Nielsen F., Nock R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory. 2009;55:2882–2904. doi: 10.1109/TIT.2009.2018176. [DOI] [Google Scholar]
31.Melbourne J., Talukdar S., Bhaban S., Madiman M., Salapaka M.V. On the Entropy of Mixture distributions. [(accessed on 16 February 2020)]; Available online: http://box5779.temp.domains/~jamesmel/publications/
32.Guntuboyina A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory. 2011;57:2386–2399. doi: 10.1109/TIT.2011.2110791. [DOI] [Google Scholar]
33.Sason I., Verdu S. f-divergence Inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
34.Melbourne J., Madiman M., Salapaka M.V. Relationships between certain f-divergences; Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton); Monticello, IL, USA . 24–27 September 2019; pp. 1068–1073. [Google Scholar]
35.Sason I. On Data-Processing and Majorization Inequalities for f-Divergences with Applications. Entropy. 2019;21:1022. doi: 10.3390/e21101022. [DOI] [Google Scholar]
36.Van Erven T., Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]
37.Xu P., Melbourne J., Madiman M. Infinity-Rényi entropy power inequalities; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 2985–2989. [Google Scholar]
38.Nielsen F., Nock R. On the geometry of mixtures of prescribed distributions; Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Calgary, AB, Canada. 15–20 April 2018; pp. 2861–2865. [Google Scholar]
39.Fréchet M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. De L’institut Henri PoincarÉ. 1948;10:215–310. [Google Scholar]
40.Nielsen F., Boltz S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory. 2011;57:5455–5466. doi: 10.1109/TIT.2011.2159046. [DOI] [Google Scholar]
41.Lanckriet G.R., Sriperumbudur B.K. On the convergence of the concave-convex procedure; Proceedings of the Advances in Neural Information Processing Systems 22 (NIPS 2009); Vancouver, BC, Canada. 7–10 December 2009; pp. 1759–1767. [Google Scholar]
42.Nielsen F., Sun K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy. 2016;18:442. doi: 10.3390/e18120442. [DOI] [Google Scholar]
43.Springer Verlag GmbH, European Mathematical Society Encyclopedia of Mathematics. [(accessed on 19 December 2019)]; Available online: https://www.encyclopediaofmath.org/
44.Del Castillo J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994;46:57–66. doi: 10.1007/BF00773592. [DOI] [Google Scholar]
45.Nielsen F., Nock R. Entropies and cross-entropies of exponential families; Proceedings of the 2010 IEEE International Conference on Image Processing; Hong Kong, China. 26–29 September 2010; pp. 3621–3624. [Google Scholar]
46.Nielsen F., Hadjeres G. Monte Carlo information geometry: The dually flat case. arXiv. 20181803.07225 [Google Scholar]
47.Schwander O., Nielsen F. Matrix Information Geometry. Springer; Berlin/Heidelberg, Germany: 2013. Learning mixtures by simplifying kernel density estimators; pp. 403–426. [Google Scholar]
48.Arthur D., Vassilvitskii S. k-means++: The advantages of careful seeding; Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07); New Orleans, LA, USA. 7–9 January 2007; pp. 1027–1035. [Google Scholar]
49.Nielsen F., Nock R., Amari S.I. On clustering histograms with k-means by using mixed α-divergences. Entropy. 2014;16:3273–3301. doi: 10.3390/e16063273. [DOI] [Google Scholar]
50.Nielsen F., Nock R. Total Jensen divergences: Definition, properties and clustering; Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brisbane, QLD, Australia. 19–24 April 2015; pp. 2016–2020. [Google Scholar]
51.Topsøe F. Basic concepts, identities and inequalities-the toolkit of information theory. Entropy. 2001;3:162–190. doi: 10.3390/e3030162. [DOI] [Google Scholar]
52.Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial nets; Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014); Montreal, QC, Canada. 8–13 December 2014; pp. 2672–2680. [Google Scholar]
53.Yamano T. Some bounds for skewed α-Jensen-Shannon divergence. Results Appl. Math. 2019;3:100064. doi: 10.1016/j.rinam.2019.100064. [DOI] [Google Scholar]
54.Kotlerman L., Dagan I., Szpektor I., Zhitomirsky-Geffet M. Directional distributional similarity for lexical inference. Nat. Lang. Eng. 2010;16:359–389. doi: 10.1017/S1351324910000124. [DOI] [Google Scholar]
55.Johnson D., Sinanovic S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory. 2001:1–8. [Google Scholar]

[B1-entropy-22-00221] 1.Billingsley P. Probability and Measure. John Wiley & Sons; Hoboken, NJ, USA: 2008. [Google Scholar]

[B2-entropy-22-00221] 2.Deza M.M., Deza E. Encyclopedia of Distances. Springer; Berlin/Heidelberg, Germany: 2009. [Google Scholar]

[B3-entropy-22-00221] 3.Basseville M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013;93:621–633. doi: 10.1016/j.sigpro.2012.09.003. [DOI] [Google Scholar]

[B4-entropy-22-00221] 4.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]

[B5-entropy-22-00221] 5.Nielsen F. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy. 2019;21:485. doi: 10.3390/e21050485. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6-entropy-22-00221] 6.Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory. 1991;37:145–151. doi: 10.1109/18.61115. [DOI] [Google Scholar]

[B7-entropy-22-00221] 7.Sason I. Tight bounds for symmetric divergence measures and a new inequality relating f-divergences; Proceedings of the 2015 IEEE Information Theory Workshop (ITW); Jerusalem, Israel. 26 April–1 May 2015; pp. 1–5. [Google Scholar]

[B8-entropy-22-00221] 8.Wong A.K., You M. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1985;7:599–609. doi: 10.1109/TPAMI.1985.4767707. [DOI] [PubMed] [Google Scholar]

[B9-entropy-22-00221] 9.Endres D.M., Schindelin J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory. 2003;49:1858–1860. doi: 10.1109/TIT.2003.813506. [DOI] [Google Scholar]

[B10-entropy-22-00221] 10.Kafka P., Österreicher F., Vincze I. On powers of f-divergences defining a distance. Stud. Sci. Math. Hung. 1991;26:415–422. [Google Scholar]

[B11-entropy-22-00221] 11.Fuglede B. Spirals in Hilbert space: With an application in information theory. Expo. Math. 2005;23:23–45. doi: 10.1016/j.exmath.2005.01.014. [DOI] [Google Scholar]

[B12-entropy-22-00221] 12.Acharyya S., Banerjee A., Boley D. Bregman divergences and triangle inequality; Proceedings of the 2013 SIAM International Conference on Data Mining; Austin, TX, USA. 2–4 May 2013; pp. 476–484. [Google Scholar]

[B13-entropy-22-00221] 13.Naghshvar M., Javidi T., Wigger M. Extrinsic Jensen–Shannon divergence: Applications to variable-length coding. IEEE Trans. Inf. Theory. 2015;61:2148–2164. doi: 10.1109/TIT.2015.2401004. [DOI] [Google Scholar]

[B14-entropy-22-00221] 14.Bigi B. European Conference on Information Retrieval. Springer; Berlin/Heidelberg, Germany: 2003. Using Kullback-Leibler distance for text categorization; pp. 305–319. [Google Scholar]

[B15-entropy-22-00221] 15.Chatzisavvas K.C., Moustakidis C.C., Panos C. Information entropy, information distances, and complexity in atoms. J. Chem. Phys. 2005;123:174111. doi: 10.1063/1.2121610. [DOI] [PubMed] [Google Scholar]

[B16-entropy-22-00221] 16.Yurdakul B. Ph.D. Thesis. Western Michigan University; Kalamazoo, MI, USA: 2018. Statistical Properties of Population Stability Index. [Google Scholar]

[B17-entropy-22-00221] 17.Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]

[B18-entropy-22-00221] 18.Nielsen F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013;20:657–660. doi: 10.1109/LSP.2013.2260538. [DOI] [Google Scholar]

[B19-entropy-22-00221] 19.Lee L. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99. Association for Computational Linguistics; Stroudsburg, PA, USA: 1999. Measures of Distributional Similarity; pp. 25–32. [DOI] [Google Scholar]

[B20-entropy-22-00221] 20.Nielsen F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv. 20101009.4004 [Google Scholar]

[B21-entropy-22-00221] 21.Lee L. On the effectiveness of the skew divergence for statistical language analysis; Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AISTATS 2001); Key West, FL, USA. 4–7 January 2001. [Google Scholar]

[B22-entropy-22-00221] 22.Csiszár I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967;2:229–318. [Google Scholar]

[B23-entropy-22-00221] 23.Ali S.M., Silvey S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966;28:131–142. doi: 10.1111/j.2517-6161.1966.tb00626.x. [DOI] [Google Scholar]

[B24-entropy-22-00221] 24.Sason I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy. 2018;20:383. doi: 10.3390/e20050383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25-entropy-22-00221] 25.Amari S.I. Information Geometry and Its Applications. Springer; Berlin/Heidelberg, Germany: 2016. [Google Scholar]

[B26-entropy-22-00221] 26.Jiao J., Courtade T.A., No A., Venkat K., Weissman T. Information measures: The curious case of the binary alphabet. IEEE Trans. Inf. Theory. 2014;60:7616–7626. doi: 10.1109/TIT.2014.2360184. [DOI] [Google Scholar]

[B27-entropy-22-00221] 27.Yuille A.L., Rangarajan A. The concave-convex procedure (CCCP); Proceedings of the Neural Information Processing Systems 2002; Vancouver, BC, Canada. 9–14 December 2002; pp. 1033–1040. [Google Scholar]

[B28-entropy-22-00221] 28.Nielsen F., Nock R. Transactions on Computational Science XIV. Springer; Berlin/Heidelberg, Germany: 2011. Skew Jensen-Bregman Voronoi diagrams; pp. 102–128. [Google Scholar]

[B29-entropy-22-00221] 29.Banerjee A., Merugu S., Dhillon I.S., Ghosh J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005;6:1705–1749. [Google Scholar]

[B30-entropy-22-00221] 30.Nielsen F., Nock R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory. 2009;55:2882–2904. doi: 10.1109/TIT.2009.2018176. [DOI] [Google Scholar]

[B31-entropy-22-00221] 31.Melbourne J., Talukdar S., Bhaban S., Madiman M., Salapaka M.V. On the Entropy of Mixture distributions. [(accessed on 16 February 2020)]; Available online: http://box5779.temp.domains/~jamesmel/publications/

[B32-entropy-22-00221] 32.Guntuboyina A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory. 2011;57:2386–2399. doi: 10.1109/TIT.2011.2110791. [DOI] [Google Scholar]

[B33-entropy-22-00221] 33.Sason I., Verdu S. f-divergence Inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]

[B34-entropy-22-00221] 34.Melbourne J., Madiman M., Salapaka M.V. Relationships between certain f-divergences; Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton); Monticello, IL, USA . 24–27 September 2019; pp. 1068–1073. [Google Scholar]

[B35-entropy-22-00221] 35.Sason I. On Data-Processing and Majorization Inequalities for f-Divergences with Applications. Entropy. 2019;21:1022. doi: 10.3390/e21101022. [DOI] [Google Scholar]

[B36-entropy-22-00221] 36.Van Erven T., Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]

[B37-entropy-22-00221] 37.Xu P., Melbourne J., Madiman M. Infinity-Rényi entropy power inequalities; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 2985–2989. [Google Scholar]

[B38-entropy-22-00221] 38.Nielsen F., Nock R. On the geometry of mixtures of prescribed distributions; Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Calgary, AB, Canada. 15–20 April 2018; pp. 2861–2865. [Google Scholar]

[B39-entropy-22-00221] 39.Fréchet M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. De L’institut Henri PoincarÉ. 1948;10:215–310. [Google Scholar]

[B40-entropy-22-00221] 40.Nielsen F., Boltz S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory. 2011;57:5455–5466. doi: 10.1109/TIT.2011.2159046. [DOI] [Google Scholar]

[B41-entropy-22-00221] 41.Lanckriet G.R., Sriperumbudur B.K. On the convergence of the concave-convex procedure; Proceedings of the Advances in Neural Information Processing Systems 22 (NIPS 2009); Vancouver, BC, Canada. 7–10 December 2009; pp. 1759–1767. [Google Scholar]

[B42-entropy-22-00221] 42.Nielsen F., Sun K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy. 2016;18:442. doi: 10.3390/e18120442. [DOI] [Google Scholar]

[B43-entropy-22-00221] 43.Springer Verlag GmbH, European Mathematical Society Encyclopedia of Mathematics. [(accessed on 19 December 2019)]; Available online: https://www.encyclopediaofmath.org/

[B44-entropy-22-00221] 44.Del Castillo J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994;46:57–66. doi: 10.1007/BF00773592. [DOI] [Google Scholar]

[B45-entropy-22-00221] 45.Nielsen F., Nock R. Entropies and cross-entropies of exponential families; Proceedings of the 2010 IEEE International Conference on Image Processing; Hong Kong, China. 26–29 September 2010; pp. 3621–3624. [Google Scholar]

[B46-entropy-22-00221] 46.Nielsen F., Hadjeres G. Monte Carlo information geometry: The dually flat case. arXiv. 20181803.07225 [Google Scholar]

[B47-entropy-22-00221] 47.Schwander O., Nielsen F. Matrix Information Geometry. Springer; Berlin/Heidelberg, Germany: 2013. Learning mixtures by simplifying kernel density estimators; pp. 403–426. [Google Scholar]

[B48-entropy-22-00221] 48.Arthur D., Vassilvitskii S. k-means++: The advantages of careful seeding; Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07); New Orleans, LA, USA. 7–9 January 2007; pp. 1027–1035. [Google Scholar]

[B49-entropy-22-00221] 49.Nielsen F., Nock R., Amari S.I. On clustering histograms with k-means by using mixed α-divergences. Entropy. 2014;16:3273–3301. doi: 10.3390/e16063273. [DOI] [Google Scholar]

[B50-entropy-22-00221] 50.Nielsen F., Nock R. Total Jensen divergences: Definition, properties and clustering; Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brisbane, QLD, Australia. 19–24 April 2015; pp. 2016–2020. [Google Scholar]

[B51-entropy-22-00221] 51.Topsøe F. Basic concepts, identities and inequalities-the toolkit of information theory. Entropy. 2001;3:162–190. doi: 10.3390/e3030162. [DOI] [Google Scholar]

[B52-entropy-22-00221] 52.Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial nets; Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014); Montreal, QC, Canada. 8–13 December 2014; pp. 2672–2680. [Google Scholar]

[B53-entropy-22-00221] 53.Yamano T. Some bounds for skewed α-Jensen-Shannon divergence. Results Appl. Math. 2019;3:100064. doi: 10.1016/j.rinam.2019.100064. [DOI] [Google Scholar]

[B54-entropy-22-00221] 54.Kotlerman L., Dagan I., Szpektor I., Zhitomirsky-Geffet M. Directional distributional similarity for lexical inference. Nat. Lang. Eng. 2010;16:359–389. doi: 10.1017/S1351324910000124. [DOI] [Google Scholar]

[B55-entropy-22-00221] 55.Johnson D., Sinanovic S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory. 2001:1–8. [Google Scholar]

PERMALINK

On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid

Frank Nielsen

Abstract

1. Introduction

2. Extending the Jensen–Shannon Divergence

2.1. Vector-Skew Jensen–Bregman Divergences and Jensen Diversities

2.2. Vector-Skew Jensen–Shannon Divergences

Definition 1

Theorem 1.

Proof.

Remark 1.

Theorem 2

Proof.

Lemma 1

Proof.

Lemma 2

Proof.

Remark 2.

Property 1.

Proof.

2.3. Building Symmetric Families of Vector-Skewed Jensen–Shannon Divergences

3. Jensen–Shannon Centroids on Mixture Families

3.1. Mixture Families and Jensen–Shannon Divergences

3.2. Jensen–Shannon Centroids

Figure 1.

3.2.1. Jensen–Shannon Centroids of Categorical Distributions

Table 1.

Figure 2.

Figure 3.

Figure 4.

3.2.2. Special Cases

3.2.3. Some Remarks and Properties

Remark 3.

Property 2.

Proof.

Remark 4.

4. Conclusions and Discussion

Acknowledgments

Funding

Conflicts of Interest

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases