Skip to main content
Entropy logoLink to Entropy
. 2020 Feb 16;22(2):221. doi: 10.3390/e22020221

On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid

Frank Nielsen 1
PMCID: PMC7516653  PMID: 33285995

Abstract

The Jensen–Shannon divergence is a renown bounded symmetrization of the Kullback–Leibler divergence which does not require probability densities to have matching supports. In this paper, we introduce a vector-skew generalization of the scalar α-Jensen–Bregman divergences and derive thereof the vector-skew α-Jensen–Shannon divergences. We prove that the vector-skew α-Jensen–Shannon divergences are f-divergences and study the properties of these novel divergences. Finally, we report an iterative algorithm to numerically compute the Jensen–Shannon-type centroids for a set of probability densities belonging to a mixture family: This includes the case of the Jensen–Shannon centroid of a set of categorical distributions or normalized histograms.

Keywords: Bregman divergence, f-divergence, Jensen–Bregman divergence, Jensen diversity, Jensen–Shannon divergence, capacitory discrimination, Jensen–Shannon centroid, mixture family, information geometry, difference of convex (DC) programming

1. Introduction

Let (X,F,μ) be a measure space [1] where X denotes the sample space, F the σ-algebra of measurable events, and μ a positive measure; for example, the measure space defined by the Lebesgue measure μL with Borel σ-algebra B(Rd) for X=Rd or the measure space defined by the counting measure μc with the power set σ-algebra 2X on a finite alphabet X. Denote by L1(X,F,μ) the Lebesgue space of measurable functions, P1 the subspace of positive integrable functions f such that Xf(x)dμ(x)=1 and f(x)>0 for all xX, and P¯1 the subspace of non-negative integrable functions f such that Xf(x)dμ(x)=1 and f(x)0 for all xX.

We refer to the book of Deza and Deza [2] and the survey of Basseville [3] for an introduction to the many types of statistical divergences met in information sciences and their justifications. The Kullback–Leibler Divergence (KLD) KL:P1×P1[0,] is an oriented statistical distance (commonly called the relative entropy in information theory [4]) defined between two densities p and q (i.e., the Radon–Nikodym densities of μ-absolutely continuous probability measures P and Q) by

KL(p:q):=plogpqdμ. (1)

Although KL(p:q)0 with equality iff. p=qμ-a. e. (Gibb’s inequality [4]), the KLD may diverge to infinity depending on the underlying densities. Since the KLD is asymmetric, several symmetrizations [5] have been proposed in the literature.

A well-grounded symmetrization of the KLD is the Jensen–Shannon Divergence [6] (JSD), also called capacitory discrimination in the literature (e.g., see [7]):

JS(p,q):=12KLp:p+q2+KLq:p+q2, (2)
=12plog2pp+q+qlog2qp+qdμ=JS(q,p). (3)

The Jensen–Shannon divergence can be interpreted as the total KL divergence to the average distribution p+q2. The Jensen–Shannon divergence was historically implicitly introduced in [8] (Equation (19)) to calculate distances between random graphs. A nice feature of the Jensen–Shannon divergence is that this divergence can be applied to densities with arbitrary support (i.e., p,qP¯1 with the convention that 0log0=0 and log00=0); moreover, the JSD is always upper bounded by log2. Let Xp=supp(p) and Xq=supp(q) denote the supports of the densities p and q, respectively, where supp(p):={xX:p(x)>0}. The JSD saturates to log2 whenever the supports Xp and Xp are disjoints. We can rewrite the JSD as

JS(p,q)=hp+q2h(p)+h(q)2, (4)

where h(p)=plogpdμ denotes Shannon’s entropy. Thus, the JSD can also be interpreted as the entropy of the average distribution minus the average of the entropies.

The square root of the JSD is a metric [9] satisfying the triangle inequality, but the square root of the JD is not a metric (nor any positive power of the Jeffreys divergence, see [10]). In fact, the JSD can be interpreted as a Hilbert metric distance, meaning that there exists some isometric embedding of (X,JS) into a Hilbert space [11,12]. Other principled symmetrizations of the KLD have been proposed in the literature: For example, Naghshvar et al. [13] proposed the extrinsic Jensen–Shannon divergence and demonstrated its use for variable-length coding over a discrete memoryless channel (DMC).

Another symmetrization of the KLD sometimes met in the literature [14,15,16] is the Jeffreys divergence [17,18] (JD) defined by

J(p,q):=KL(p:q)+KL(q:p)=(pq)logpqdμ=J(q,p). (5)

However, we point out that this Jeffreys divergence lacks sound information-theoretical justifications.

For two positive but not necessarily normalized densities p˜ and q˜, we define the extended Kullback–Leibler divergence as follows:

KL+(p˜:q˜):=KL(p˜:q˜)+q˜dμp˜dμ, (6)
=p˜logp˜q˜+q˜p˜dμ. (7)

The Jensen–Shannon divergence and the Jeffreys divergence can both be extended to positive (unnormalized) densities without changing their formula expressions:

JS+(p˜,q˜):=12KL+p˜:p˜+q˜2+KL+q˜:p˜+q˜2, (8)
=12KLp˜:p˜+q˜2+KLq˜:p˜+q˜2=JS(p˜,q˜), (9)
J+(p˜,q˜):=KL+(p˜:q˜)+KL+(p˜:q˜)=(p˜q˜)logp˜q˜dμ=J(p˜,q˜). (10)

However, the extended JS+ divergence is upper-bounded by (12log2)((p˜+q˜)dμ)=12(μ(p)+μ(q))log2 instead of log2 for normalized densities (i.e., when μ(p)+μ(q)=2).

Let (pq)α(x):=(1α)p(x)+αq(x) denote the statistical weighted mixture with component densities p and q for α[0,1]. The asymmetric α-skew Jensen–Shannon divergence can be defined for a scalar parameter α(0,1) by considering the weighted mixture (pq)α as follows:

JSaα(p:q):=(1α)KL(p:(pq)α)+αKL(q:(pq)α), (11)
=(1α)plogp(pq)αdμ+αqlogq(pq)αdμ. (12)

Let us introduce the α-skew K-divergence [6,19] Kα(p:q) by:

Kαp:q:=KLp:(1α)p+αq=KLp:(pq)α. (13)

Then, both the Jensen–Shannon divergence and the Jeffreys divergence can be rewritten [20] using Kα as follows:

JSp,q=12K12p:q+K12q:p, (14)
Jp,q=K1(p:q)+K1(q:p), (15)

since (pq)1=q, KL(p:q)=K1(p:q) and (pq)12=(qp)12.

We can thus define the symmetric α-skew Jensen–Shannon divergence [20] for α(0,1) as follows:

JSα(p,q):=12Kα(p:q)+12Kα(q:p)=JSα(q,p). (16)

The ordinary Jensen–Shannon divergence is recovered for α=12.

In general, skewing divergences (e.g., using the divergence Kα instead of the KLD) have been experimentally shown to perform better in applications like in some natural language processing (NLP) tasks [21].

The α-Jensen–Shannon divergences are Csiszár f-divergences [22,23,24]. An f-divergence is defined for a convex function f, strictly convex at 1, and satisfies f(1)=0 as:

If(p:q)=q(x)fp(x)q(x)dxf(1)=0. (17)

We can always symmetrize f-divergences by taking the conjugate convex function f*(x)=xf(1x) (related to the perspective function): If+f*(p,q) is a symmetric divergence. The f-divergences are convex statistical distances which are provably the only separable invariant divergences in information geometry [25], except for binary alphabets X (see [26]).

The Jeffreys divergence is an f-divergence for the generator f(x)=(x1)logx, and the α-Jensen–Shannon divergences are f-divergences for the generator family fα(x)=log((1α)+αx)xlog((1α)+αx). The f-divergences are upper-bounded by f(0)+f*(0). Thus, the f-divergences are finite when f(0)+f*(0)<.

The main contributions of this paper are summarized as follows:

  • First, we generalize the Jensen–Bregman divergence by skewing a weighted separable Jensen–Bregman divergence with a k-dimensional vector α[0,1]k in Section 2. This yields a generalization of the symmetric skew α-Jensen–Shannon divergences to a vector-skew parameter. This extension retains the key properties for being upper-bounded and for application to densities with potentially different supports. The proposed generalization also allows one to grasp a better understanding of the “mechanism” of the Jensen–Shannon divergence itself. We also show how to directly obtain the weighted vector-skew Jensen–Shannon divergence from the decomposition of the KLD as the difference of the cross-entropy minus the entropy (i.e., KLD as the relative entropy).

  • Second, we prove that weighted vector-skew Jensen–Shannon divergences are f-divergences (Theorem 1), and show how to build families of symmetric Jensen–Shannon-type divergences which can be controlled by a vector of parameters in Section 2.3, generalizing the work of [20] from scalar skewing to vector skewing. This may prove useful in applications by providing additional tuning parameters (which can be set, for example, by using cross-validation techniques).

  • Third, we consider the calculation of the Jensen–Shannon centroids in Section 3 for densities belonging to mixture families. Mixture families include the family of categorical distributions and the family of statistical mixtures sharing the same prescribed components. Mixture families are well-studied manifolds in information geometry [25]. We show how to compute the Jensen–Shannon centroid using a concave–convex numerical iterative optimization procedure [27]. The experimental results graphically compare the Jeffreys centroid with the Jensen–Shannon centroid for grey-valued image histograms.

2. Extending the Jensen–Shannon Divergence

2.1. Vector-Skew Jensen–Bregman Divergences and Jensen Diversities

Recall our notational shortcut: (ab)α:=(1α)a+αb. For a k-dimensional vector α[0,1]k, a weight vector w belonging to the (k1)-dimensional open simplex Δk, and a scalar γ(0,1), let us define the following vector skew α-Jensen–Bregman divergence (α-JBD) following [28]:

JBFα,γ,w(θ1:θ2):=i=1kwiBF(θ1θ2)αi:(θ1θ2)γ0, (18)

where BF is the Bregman divergence [29] induced by a strictly convex and smooth generator F:

BF(θ1:θ2):=F(θ1)F(θ2)θ1θ2,F(θ2), (19)

with ·,· denoting the Euclidean inner product x,y=xy (dot product). Expanding the Bregman divergence formulas in the expression of the α-JBD and using the fact that

(θ1θ2)αi(θ1θ2)γ=(γαi)(θ1θ2), (20)

we get the following expression:

JBFα,γ,w(θ1:θ2)=i=1kwiF(θ1θ2)αiF(θ1θ2)γi=1kwi(γαi)(θ1θ2),F((θ1θ2)γ). (21)

The inner product term of Equation (21) vanishes when

γ=i=1kwiαi:=α¯. (22)

Thus, when γ=α¯ (assuming at least two distinct components in α so that γ(0,1)), we get the simplified formula for the vector-skew α-JBD:

JBFα,w(θ1:θ2)=i=1kwiF(θ1θ2)αiF(θ1θ2)α¯. (23)

This vector-skew Jensen–Bregman divergence is always finite and amounts to a Jensen diversity [30] JF induced by Jensen’s inequality gap:

JBFα,w(θ1:θ2)=JF((θ1θ2)α1,,(θ1θ2)αk;w1,,wk):=i=1kwiF(θ1θ2)αiF(θ1θ2)α¯0. (24)

The Jensen diversity is a quantity which arises as a generalization of the cluster variance when clustering with Bregman divergences instead of the ordinary squared Euclidean distance; see [29,30] for details. In the context of Bregman clustering, the Jensen diversity has been called the Bregman information [29] and motivated by rate distortion theory: Bregman information measures the minimum expected loss when encoding a set of points using a single point when the loss is measured using a Bregman divergence. In general, a k-point measure is called a diversity measure (for k>2), while a distance/divergence is the special case of a 2-point measure.

Conversely, in 1D, we may start from Jensen’s inequality for a strictly convex function F:

i=1kwiF(θi)Fi=1kwiθi. (25)

Let us notationally write [k]:={1,,k}, and define θm:=mini[k]{θi}i and θM:=maxi[k]{θi}i>θm (i.e., assuming at least two distinct values). We have the barycenter θ¯=iwiθi=:(θmθM)γ which can be interpreted as the linear interpolation of the extremal values for some γ(0,1). Let us write θi=(θmθM)αi for i[k] and proper values of the αis. Then, it comes that

θ¯=iwiθi, (26)
=iwi(θmθM)αi, (27)
=iwi((1αi)θm+αiθM), (28)
=1iwiαiθm+iαiwiθM, (29)
=(θmθM)iwiαi=(θmθM)γ, (30)

so that γ=iwiαi=α¯.

2.2. Vector-Skew Jensen–Shannon Divergences

Let f(x)=xlogxx be a strictly smooth convex function on (0,). Then, the Bregman divergence induced by this univariate generator is

Bf(p:q)=plogpq+qp=kl+(p:q), (31)

the extended scalar Kullback–Leibler divergence.

We extend the scalar-skew Jensen–Shannon divergence as follows: JSα,w(p:q):=JBhα,α¯,w(p:q) for h, the Shannon’s entropy [4] (a strictly concave function [4]).

Definition 1

(Weighted vector-skew (α,w)-Jensen–Shannon divergence). For a vector α[0,1]k and a unit positive weight vector wΔk, the (α,w)-Jensen–Shannon divergence between two densities p,qP¯1 is defined by:

JSα,w(p:q):=i=1kwiKL((pq)αi:(pq)α¯)=h(pq)α¯i=1kwih(pq)αi,

with α¯=i=1kwiαi, where h(p)=p(x)logp(x)dμ(x) denotes the Shannon entropy [4] (i.e., h is strictly convex).

This definition generalizes the ordinary JSD; we recover the ordinary Jensen–Shannon divergence when k=2, α1=0, α2=1, and w1=w2=12 with α¯=12: JS(p,q)=JS(0,1),(12,12)(p:q).

Let KLα,β(p:q):=KL((pq)α:(pq)β). Then, we have KLα,β(q:p)=KL1α,1β(p:q). Using this (α,β)-KLD, we have the following identity:

JSα,w(p:q)=i=1kwiKLαi,α¯(p:q), (32)
=i=1kwiKL1αi,1α¯(q:p)=JS1kα,w(q:p), (33)

since i=1kwi(1αi)=1kα¯=1α¯, where 1k=(1,,1) is a k-dimensional vector of ones.

A very interesting property is that the vector-skew Jensen–Shannon divergences are f-divergences [22].

Theorem 1.

The vector-skew Jensen–Shannon divergences JSα,w(p:q) are f-divergences for the generator fα,w(u)=i=1kwi(αiu+(1αi))log(1αi)+αiu(1α¯)+α¯u with α¯=i=1kwiαi.

Proof. 

First, let us observe that the positively weighted sum of f-divergences is an f-divergence: i=1kwiIfi(p:q)=If(p:q) for the generator f(u)=i=1kwifi(u).

Now, let us express the divergence KLα,β(p:q) as an f-divergence:

KLα,β(p:q)=Ifα,β(p:q), (34)

with generator

fα,β(u)=(αu+1α)log(1α)+αu(1β)+βu. (35)

Thus, it follows that

JSα,w(p:q)=i=1kwiKL((pq)αi:(pq)α¯), (36)
=i=1kwiIfαi,α¯(p:q), (37)
=Ii=1kwifαi,α¯(p:q). (38)

Therefore, the vector-skew Jensen–Shannon divergence is an f-divergence for the following generator:

fα,w(u)=i=1kwi(αiu+(1αi))log(1αi)+αiu(1α¯)+α¯u, (39)

where α¯=i=1kwiαi.

When α=(0,1) and w=(12,12), we recover the f-divergence generator for the JSD:

fJS(u)=12log112+12u+12ulogu12+12u, (40)
=12log21+u+ulog2u1+u. (41)

Observe that fα,w*(u)=ufα,w(1/u)=f1α,w(u), where 1α:=(1α1,,1αk).

We also refer the reader to Theorem 4.1 of [31], which defines skew f-divergences from any f-divergence.  □

Remark 1.

Since the vector-skew Jensen divergence is an f-divergence, we easily obtain Fano and Pinsker inequalities following [32], or reverse Pinsker inequalities following [33,34] (i.e., upper bounds for the vector-skew Jensen divergences using the total variation metric distance), data processing inequalities using [35], etc.

Next, we show that KLα,β (and JSα,w) are separable convex divergences. Since the f-divergences are separable convex, the KLα,β divergences and the JSα,w divergences are separable convex. For the sake of completeness, we report a simplex explicit proof below.

Theorem 2

(Separable convexity). The divergence KLα,β(p:q) is strictly separable convex for αβ and xXpXq.

Proof. 

Let us calculate the second partial derivative of KLα,β(x:y) with respect to x, and show that it is strictly positive:

2x2KLα,β(x:y)=(βα)2y2(xy)α(xy)β2>0, (42)

for x,y>0. Thus, KLα,β is strictly convex on the left argument. Similarly, since KLα,β(y:x)=KL1α,1β(x:y), we deduce that KLα,β is strictly convex on the right argument. Therefore, the divergence KLα,β is separable convex.  □

It follows that the divergence JSα,w(p:q) is strictly separable convex, since it is a convex combination of weighted KLαi,α¯ divergences.

Another way to derive the vector-skew JSD is to decompose the KLD as the difference of the cross-entropy h× minus the entropy h (i.e., KLD is also called the relative entropy):

KL(p:q)=h×(p:q)h(p), (43)

where h×(p:q):=plogqdμ and h(p):=h×(p:p) (self cross-entropy). Since α1h×(p1:q)+α2h×(p2:q)=h×(α1p1+α2p2:q) (for α2=1α1), it follows that

JSα,w(p:q):=i=1kwiKL((pq)αi:(pq)γ), (44)
=i=1kwih×((pq)αi:(pq)γ)h((pq)αi), (45)
=h×i=1kwi(pq)αi:(pq)γi=1kwih(pq)αi. (46)

Here, the “trick” is to choose γ=α¯ in order to “convert” the cross-entropy into an entropy: h×(i=1kwi(pq)αi:(pq)γ)=h((pq)α¯) when γ=α¯. Then, we end up with

JSα,w(p:q)=h(pq)α¯i=1kwih(pq)αi. (47)

When α=(α1,α2) with α1=0 and α2=0 and w=(w1,w2)=(12,12), we have α¯=12, and we recover the Jensen–Shannon divergence:

JS(p:q)=hp+q2h(p)+h(q)2. (48)

Notice that Equation (13) is the usual definition of the Jensen–Shannon divergence, while Equation (48) is the reduced formula of the JSD, which can be interpreted as a Jensen gap for Shannon entropy, hence its name: The Jensen–Shannon divergence.

Moreover, if we consider the cross-entropy/entropy extended to positive densities p˜ and q˜:

h+×(p˜:q˜)=(p˜logq˜+q˜)dμ,h+(p˜)=h+×(p˜:p˜)=(p˜logp˜+p˜)dμ, (49)

we get:

JS+α,w(p˜:q˜)=i=1kwiKL+((p˜q˜)αi:(p˜q˜)γ)=h+((p˜q˜)α¯)i=1kwih+((p˜q˜)αi). (50)

Next, we shall prove that our generalization of the skew Jensen–Shannon divergence to vector-skewing is always bounded. We first start by a lemma bounding the KLD between two mixtures sharing the same components:

Lemma 1

(KLD between two w-mixtures). For α[0,1] and β(0,1), we have:

KLα,β(p:q)=KL(pq)α:(pq)βlogmax1α1β,αβ.

Proof. 

For p(x),q(x)>0, we have

(1α)p(x)+αq(x)(1β)p(x)+βq(x)max1α1β,αβ. (51)

Indeed, by considering the two cases αβ (or equivalently, 1α1β) and αβ (or equivalently, 1α1β), we check that (1α)p(x)max1α1β,αβ(1β)p(x) and αq(x)max1α1β,αββq(x). Thus, we have (1α)p(x)+αq(x)(1β)p(x)+βq(x)max1α1β,αβ. Therefore, it follows that:

KL(pq)α:(pq)β(pq)αlogmax1α1β,αβdμ=logmax1α1β,αβ. (52)

Notice that we can interpret logmax1α1β,αβ=max{log1α1β,logαβ} as the -Rényi divergence [36,37] between the following two two-point distributions: (α,1α) and (β,1β). See Theorem 6 of [36].

A weaker upper bound is KL((pq)α:(pq)β)log1β(1β). Indeed, let us form a partition of the sample space X into two dominance regions:

  • Rp:={xX:q(x)p(x)} and

  • Rq:={xX:q(x)>p(x)}.

We have (pq)α(x)=(1α)p(x)+αq(x)p(x) for xRp and (pq)α(x)q(x) for xRq. It follows that

KL(pq)α:(pq)βRp(pq)α(x)logp(x)(1β)p(x)dμ(x)+Rq(pq)α(x)logq(x)βq(x)dμ(x).

That is, KL((pq)α:(pq)β)log(1β)logβ=log1β(1β). Notice that we allow α{0,1} but not β to take the extreme values (i.e., β(0,1)).  □

In fact, it is known that for both α,β(0,1), KL(pq)α:(pq)β amount to compute a Bregman divergence for the Shannon negentropy generator, since {(pq)γ:γ(0,1)} defines a mixture family [38] of order 1 in information geometry. Hence, it is always finite, as Bregman divergences are always finite (but not necessarily bounded).

By using the fact that

JSα,w(p:q)=i=1kwiKL(pq)αi:(pq)α¯, (53)

we conclude that the vector-skew Jensen–Shannon divergence is upper-bounded:

Lemma 2

(Bounded (w,α)-Jensen–Shannon divergence). JSα,w is bounded by log1α¯(1α¯) where α¯=i=1kwiαi(0,1).

Proof. 

We have JSα,w(p:q)=iwiKL(pq)αi:(pq)α¯. Since 0KL(pq)αi:(pq)α¯log1α¯(1α¯), it follows that we have

0JSα,w(p:q)log1α¯(1α¯).

Notice that we also have

JSα,w(p:q)iwilogmax1αi1α¯,αiα¯.

 □

The vector-skew Jensen–Shannon divergence is symmetric if and only if for each index i[k] there exists a matching index σ(i) such that ασ(i)=1αi and wσ(i)=wi.

For example, we may define the symmetric scalar α-skew Jensen–Shannon divergence as

JSsα(p,q)=12KL((pq)α:(pq)12)+12KL((pq)1α:(pq)12), (54)
=12(pq)αlog(pq)α(pq)12dμ+12(pq)1αlog(pq)1α(pq)12dμ, (55)
=12(qp)1αlog(qp)1α(qp)12dμ++12(qp)αlog(qp)α(qp)12dμ, (56)
=h((pq)12)h((pq)α)+h((pq)1α)2, (57)
=:JSsα(q,p), (58)

since it holds that (ab)c=(ba)1c for any a,b,cR. Note that JSsα(p,q)JSα(p,q).

Remark 2.

We can always symmetrize a vector-skew Jensen–Shannon divergence by doubling the dimension of the skewing vector. Let α=(α1,,αk) and w be the vector parameters of an asymmetric vector-skew JSD, and consider α=(1α1,,1αk) and w to be the parameters of JSα,w. Then, JS(α,α),(w2,w2) is a symmetric skew-vector JSD:

JS(α,α),(w2,w2)(p:q):=12JSα,w(p:q)+12JSα,w(p:q), (59)
=12JSα,w(p:q)+12JSα,w(q:p)=JS(α,α),(w2,w2)(q:p). (60)

Since the vector-skew Jensen–Shannon divergence is an f-divergence for the generator fα,w (Theorem 1), we can take generator fw,αs(u)=fw,α(u)+fw,α*(u)2 to define the symmetrized f-divergence, where fw,α*(u)=ufw,α(1u) denotes the convex conjugate function. When fα,w yields a symmetric f-divergence Ifα,w, we can apply the generic upper bound of f-divergences (i.e., Iff(0)+f*(0)) to get the upper bound on the symmetric vector-skew Jensen–Shannon divergences:

Ifα,w(p:q)fα,w(0)+fα,w*(0), (61)
i=1kwi(1αi)log1αi1α¯+αilogαiα¯, (62)

since

fα,w*(u)=ufα,w1u, (63)
=i=1kwi((1αi)u+αi)log(1αi)u+αi(1α¯)u+α¯. (64)

For example, consider the ordinary Jensen–Shannon divergence with w=12,12 and α=(0,1). Then, we find JS(p:w)=If(0,1),12,12(p:q)12log2+12log2=log2, the usual upper bound of the JSD.

As a side note, let us notice that our notation (pq)α allows one to compactly write the following property:

Property 1.

We have q=(qq)λ for any λ[0,1], and ((p1p2)λ(q1q2)λ)α=((p1q1)α(p2q2)α)λ for any α,λ[0,1].

Proof. 

Clearly, q=(1λ)q+λq=:((qq)λ) for any λ[0,1]. Now, we have

((p1p2)λ(q1q2)λ)α=(1α)(p1p2)λ+α(q1q2)λ, (65)
=(1α)((1λ)p1+λp2)+α((1λ)q1+λq2), (66)
=(1λ)((1α)p1+αq1)+λ((1α)p2+αq2), (67)
=(1λ)(p1q1)α+λ(p2q2)α, (68)
=((p1q1)α(p2q2)α)λ. (69)

 □

2.3. Building Symmetric Families of Vector-Skewed Jensen–Shannon Divergences

We can build infinitely many vector-skew Jensen–Shannon divergences. For example, consider α=0,1,13 and w=13,13,13. Then, α¯=13+19=49, and

JSα,w(p:q)=h(pq)49h(p)+h(q)+h(pq)133JSα,w(q:p). (70)

Interestingly, we can also build infinitely many families of symmetric vector-skew Jensen–Shannon divergences. For example, consider these two examples that illustrate the construction process:

  • Consider k=2. Let (w,1w) denote the weight vector, and α=(α1,α2) the skewing vector. We have α¯=wα1+(1w)α2=α2+w(α1α2). The vector-skew JSD is symmetric iff. w=1w=12 (with α¯=α1+α22) and α2=1α1. In that case, we have α¯=12, and we obtain the following family of symmetric Jensen–Shannon divergences:
    JS(α,1α),(12,12)(p,q)=h(pq)12h((pq)α)+h((pq)1α)2, (71)
    =h(pq)12h((pq)α)+h((qp)α)2=JS(α,1α),(12,12)(q,p). (72)
  • Consider k=4, weight vector w=13,13,16,16, and skewing vector α=(α1,1α1,α2,1α2) for α1,α2(0,1). Then, α¯=12, and we get the following family of symmetric vector-skew JSDs:
    JS(α1,α2)(p,q)=h(pq)122h((pq)α1)+2h((pq)1α1)+h((pq)α2)+h((pq)1α2)6, (73)
    =h(pq)122h((pq)α1)+2h((qp)α1)+h((pq)α2)+h((qp)α2)6, (74)
    =JS(α1,α2)(q,p). (75)
  • We can similarly carry on the construction of such symmetric JSDs by increasing the dimensionality of the skewing vector.

In fact, we can define

JSsα,w(p,q):=h(pq)12i=1kwih((pq)αi)+h((pq)1αi)2=i=1kwiJSsαi(p,q), (76)

with

JSsα(p,q):=h(pq)12h((pq)α)+h((pq)1α)2. (77)

3. Jensen–Shannon Centroids on Mixture Families

3.1. Mixture Families and Jensen–Shannon Divergences

Consider a mixture family in information geometry [25]. That is, let us give a prescribed set of D+1 linearly independent probability densities p0(x),,pD(x) defined on the sample space X. A mixture family M of order D consists of all strictly convex combinations of these component densities:

M:=m(x;θ):=i=1Dθipi(x)+1i=1Dθip0(x):θi>0,i=1Dθi<1. (78)

For example, the family of categorical distributions (sometimes called “multinouilli” distributions) is a mixture family [25]:

M=mθ(x)=i=1Dθiδ(xxi)+1i=1Dθiδ(xx0), (79)

where δ(x) is the Dirac distribution (i.e., δ(x)=1 for x=0 and δ(x)=0 for x0). Note that the mixture family of categorical distributions can also be interpreted as an exponential family.

Notice that the linearly independent assumption on probability densities is to ensure to have an identifiable model: θm(x;θ).

The KL divergence between two densities of a mixture family M amounts to a Bregman divergence for the Shannon negentropy generator F(θ)=h(mθ) (see [38]):

KL(mθ1:mθ2)=BF(θ1:θ2)=Bh(mθ)(θ1:θ2). (80)

On a mixture manifold M, the mixture density (1α)mθ1+αmθ2 of two mixtures mθ1 and mθ2 of M also belongs to M:

(1α)mθ1+αmθ2=m(θ1θ2)αM, (81)

where we extend the notation (θ1θ2)α:=(1α)θ1+αθ2 to vectors θ1 and θ2: (θ1θ2)αi=(θ1iθ2i)α.

Thus, the vector-skew JSD amounts to a vector-skew Jensen diversity for the Shannon negentropy convex function F(θ)=h(mθ):

JSα,w(mθ1:mθ2)=i=1kwiKL(mθ1mθ2)αi:(mθ1mθ2)α¯, (82)
=i=1kwiKLm(θ1θ2)αi:m(θ1θ2)α¯, (83)
=i=1kwiBF(θ1θ2)αi:(θ1θ2)α¯, (84)
=JBFα,α¯,w(θ1:θ2), (85)
=i=1kwiF(θ1θ2)αiF(θ1θ2)α¯, (86)
=h(m(θ1θ2)α¯)i=1kwihm(θ1θ2)αi. (87)

3.2. Jensen–Shannon Centroids

Given a set of n mixture densities mθ1,,mθn of M, we seek to calculate the skew-vector Jensen–Shannon centroid (or barycenter for non-uniform weights) defined as mθ*, where θ* is the minimizer of the following objective function (or loss function):

L(θ):=j=1nωjJSα,w(mθk:mθ), (88)

where ωΔn is the weight vector of densities (uniform weight for the centroid and non-uniform weight for a barycenter). This definition of the skew-vector Jensen–Shannon centroid is a generalization of the Fréchet mean (the Fréchet mean may not be unique, as it is the case on the sphere for two antipodal points for which their Fréchet means with respect to the geodesic metric distance form a great circle) [39] to non-metric spaces. Since the divergence JSα,w is strictly separable convex, it follows that the Jensen–Shannon-type centroids are unique when they exist.

Plugging Equation (82) into Equation (88), we get that the calculation of the Jensen–Shannon centroid amounts to the following minimization problem:

L(θ)=j=1nωji=1kwiF((θjθ)αi)F(θjθ)α¯. (89)

This optimization is a Difference of Convex (DC) programming optimization, for which we can use the ConCave–Convex procedure [27,40] (CCCP). Indeed, let us define the following two convex functions:

A(θ)=j=1ni=1kωjwiF((θjθ)αi), (90)
B(θ)=j=1nωjF(θjθ)α¯. (91)

Both functions A(θ) and B(θ) are convex since F is convex. Then, the minimization problem of Equation (89) to solve can be rewritten as:

minθA(θ)B(θ). (92)

This is a DC programming optimization problem which can be solved iteratively by initializing θ to an arbitrary value θ(0) (say, the centroid of the θis), and then by updating the parameter at step t using the CCCP [27] as follows:

θ(t+1)=(B)1(A(θ(t))). (93)

Compared to a gradient descent local optimization, there is no required step size (also called “learning” rate) in CCCP.

We have A(θ)=j=1ni=1kωjwiαiF((θjθ)αi) and B(θ)=j=1nωjα¯F(θjθ)α¯.

The CCCP converges to a local optimum θ* where the support hyperplanes of the function graphs of A and B at θ* are parallel to each other, as depicted in Figure 1. The set of stationary points is {θ:A(θ)=B(θ)}. In practice, the delicate step is to invert B. Next, we show how to implement this algorithm for the Jensen–Shannon centroid of a set of categorical distributions (i.e., normalized histograms with all non-empty bins).

Figure 1.

Figure 1

The Convex–ConCave Procedure (CCCP) iteratively updates the parameter θ by aligning the support hyperplanes at θ. In the limit case of convergence to θ*, the support hyperplanes at θ* are parallel to each other. CCCP finds a local minimum.

3.2.1. Jensen–Shannon Centroids of Categorical Distributions

To illustrate the method, let us consider the mixture family of categorical distributions [25]:

M=mθ(x)=i=1Dθiδ(xxi)+1i=1Dθiδ(xx0). (94)

The Shannon negentropy is

F(θ)=h(mθ)=i=1Dθilogθi+1i=1Dθilog1i=1Dθi. (95)

We have the partial derivatives

F(θ)=θii,θiF(θ)=logθi1j=1Dθj. (96)

Inverting the gradient F requires us to solve the equation F(θ)=η so that we get θ=(F)1(η). We find that

F*(η)=(F)1(η)=11+j=1Dexp(ηj)[exp(ηi)]i,θi=(F1(η))i=exp(ηi)1+j=1Dexp(ηj),i[D]. (97)

Table 1 summarizes the dual view of the family of categorical distributions, either interpreted as an exponential family or as a mixture family.

Table 1.

Two views of the family of categorical distributions with d choices: An exponential family or a mixture family of order D=d1. Note that the Bregman divergence associated to the exponential family view corresponds to the reverse Kullback–Leibler (KL) divergence, while the Bregman divergence associated to the mixture family view corresponds to the KL divergence.

Exponential Family Mixture Family
pdf pθ(x)=i=1dpiti(x),pi=Pr(x=ei),ti(x){0,1},i=1dti(x)=1 mθ(x)=i=1dpiδei(x)
primal θ θi=logpipd θi=pi
F(θ) log(1+i=1Dexp(θi)) θilogθi+(1i=1Dθi)log(1i=1Dθi)
dual η=F(θ) eθi1+j=1Dexp(θj) logθi1j=1Dθj
primal θ=F*(η) logηi1j=1Dηj eθi1+j=1Dexp(θj)
F*(η) i=1Dηilogηi+(1j=1Dηj)log(1j=1Dηj) log(1+i=1Dexp(ηi))
Bregman divergence BF(θ:θ)=KL*(pθ:pθ) BF(θ:θ)=KL(mθ:mθ)
=KL(pθ:pθ)

We have JS(p1,p2)=JF(θ1,θ2) for p1=mθ1 and p2=mθ2, where

JF(θ1:θ2)=F(θ1)+F(θ2)2Fθ1+θ22, (98)

is the Jensen divergence [40]. Thus, to compute the Jensen–Shannon centroid of a set of n densities p1,,pn of a mixture family (with pi=mθi), we need to solve the following optimization problem for a density p=mθ:

minpiJS(pi,p), (99)
minθiJF(θi,θ), (100)
minθiF(θi)+F(θ)2Fθi+θ2, (101)
minθ12F(θ)1niFθi+θ2:=E(θ). (102)

The CCCP algorithm for the Jensen–Shannon centroid proceeds by initializing θ(0)=1niθi (center of mass of the natural parameters), and iteratively updates as follows:

θ(t+1)=(F)11niFθi+θ(t)2. (103)

We iterate until the absolute difference |E(θ(t))E(θ(t+1))| between two successive θ(t) and θ(t+1) goes below a prescribed threshold value. The convergence of the CCCP algorithm is linear [41] to a local minimum that is a fixed point of the equation

θ=MHθ1+θ2,,θn+θ2, (104)

where MH(v1,,vn):=H1(i=1nH(vi)) is a vector generalization of the formula of the quasi-arithmetic means [30,40] obtained for the generator H=F. Algorithm 1 summarizes the method for approximating the Jensen–Shannon centroid of a given set of categorical distributions (given a prescribed number of iterations). In the pseudo-code, we used the notation (t+1)θ instead of θ(t+1) in order to highlight the conversion procedures of the natural parameters to/from the mixture weight parameters by using superscript notations for coordinates.

Algorithm 1: The CCCP algorithm for computing the Jensen–Shannon centroid of a set of categorical distributions.
graphic file with name entropy-22-00221-i001.jpg

Figure 2 displays the results of the calculations of the Jeffreys centroid [18] and the Jensen–Shannon centroid for two normalized histograms obtained from grey-valued images of Lena and Barbara. Figure 3 show the Jeffreys centroid and the Jensen–Shannon centroid for the Barbara image and its negative image. Figure 4 demonstrates that the Jensen–Shannon centroid is well defined even if the input histograms do not have coinciding supports. Notice that on the parts of the support where only one distribution is defined, the JS centroid is a scaled copy of that defined distribution.

Figure 2.

Figure 2

The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for two grey normalized histograms of the Lena image (red histogram) and the Barbara image (blue histogram). Although these Jeffreys and Jensen–Shannon centroids look quite similar, observe that there is a major difference between them in the range [0,20] where the blue histogram is zero.

Figure 3.

Figure 3

The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for the grey normalized histogram of the Barbara image (red histogram) and its negative image (blue histogram which corresponds to the reflection around the vertical axis x=128 of the red histogram).

Figure 4.

Figure 4

Jensen–Shannon centroid (black histogram) for the clamped grey normalized histogram of the Lena image (red histograms) and the clamped gray normalized histogram of Barbara image (blue histograms). Notice that on the part of the sample space where only one distribution is non-zero, the JS centroid scales that histogram portion.

3.2.2. Special Cases

Let us now consider two special cases:

  • For the special case of D=1, the categorical family is the Bernoulli family, and we have F(θ)=θlogθ+(1θ)log(1θ) (binary negentropy), F(θ)=logθ1θ (and F(θ)=1θ(1θ)>0) and (F)1(η)=eη1+eη. The CCCP update rule to compute the binary Jensen–Shannon centroid becomes
    θ(t+1)=(F)1iwiFθ(t)+θi2. (105)
  • Since the skew-vector Jensen–Shannon divergence formula holds for positive densities:
    JS+α,w(p˜:q˜)=i=1kwiKL+((p˜q˜)αi:((p˜q˜)α¯), (106)
    =i=1kwiKL((p˜q˜)αi:((p˜q˜)α¯)+(p˜q˜)α¯dμi=1kwi(p˜q˜)αidμ=(p˜q˜)α¯dμ, (107)
    =JSα,w(p˜:q˜), (108)
    we can relax the computation of the Jensen–Shannon centroid by considering 1D separable minimization problems. We then normalize the positive JS centroids to get an approximation of the probability JS centroids. This approach was also considered when dealing with the Jeffreys’ centroid [18]. In 1D, we have F(θ)=θlogθθ, F(θ)=logθ and (F)1(η)=eη.

In general, calculating the negentropy for a mixture family with continuous densities sharing the same support is not tractable because of the log-sum term of the differential entropy. However, the following remark emphasizes an extension of the mixture family of categorical distributions:

3.2.3. Some Remarks and Properties

Remark 3.

Consider a mixture family m(θ)=i=1Dθipi(x)+1i=1Dθip0(x) (for a parameter θ belonging to the D-dimensional standard simplex) of probability densities p0(x),,pD(x) defined respectively on the supports X0,X1,,XD. Let θ0:=1i=1Dθi. Assume that the support Xis of the pis are mutually non-intersecting(XiXj= for all ij implying that the D+1 densities are linearly independent) so that mθ(x)=θipi(x) for all xXi, and let X=iXi. Consider Shannon negative entropy F(θ)=h(mθ) as a strictly convex function. Then, we have

F(θ)=h(mθ)=Xmθ(x)logmθ(x), (109)
=i=0DθiXipi(x)log(θipi(x))dμ(x), (110)
=i=0Dθilogθii=0Dθih(pi). (111)

Note that the term iθih(pi) is affine in θ, and Bregman divergences are defined up to affine terms so that the Bregman generator F is equivalent to the Bregman generator of the family of categorical distributions. This example generalizes the ordinary mixture family of categorical distributions where the pis are distinct Dirac distributions. Note that when the support of the component distributions are not pairwise disjoint, the (neg)entropy may not be analytic [42] (e.g., mixture of the convex weighting of two prescribed distinct Gaussian distributions). This contrasts with the fact that the cumulant function of an exponential family is always real-analytic [43]. Observe that the term iθih(pi) can be interpreted as a conditional entropy: iθih(pi)=h(X|Θ) where Pr(Θ=i)=θi and Pr(XS|Θ=i)=Spi(x)dμ(x).

Notice that we can truncate an exponential family [25] to get a (potentially non-regular [44]) exponential family for defining the pis on mutually non-intersecting domains Xis. The entropy of a natural exponential family {e(x:θ)=exp(xθψ(θ)):θΘ} with cumulant function ψ(θ) and natural parameter space Θ is ψ*(η), where η=ψ(θ), and ψ* is the Legendre convex conjugate [45]: h(e(x:θ))=ψ*(ψ(θ)).

In general, the entropy and cross-entropy between densities of a mixture family (whether the distributions have disjoint supports or not) can be calculated in closed-form.

Property 2.

The entropy of a density belonging to a mixture family M is h(mθ)=F(θ), and the cross-entropy between two mixture densities mθ1 and mθ2 is h×(mθ1:mθ2)=F(θ2)(θ1θ2)η2=F*(η2)θ1η2.

Proof. 

Let us write the KLD as the difference between the cross-entropy minus the entropy [4]:

KL(mθ1:mθ2)=h×(mθ1:mθ2)h(mθ1), (112)
=BF(θ1:θ2), (113)
=F(θ1)F(θ2)(θ1θ2)F(θ2). (114)

Following [45], we deduce that h(mθ)=F(θ)+c and h×(mθ1:mθ2)=F(θ2)(θ1θ2)η2c for a constant c. Since F(θ)=h(mθ) by definition, it follows that c=0 and that h×(mθ1:mθ2)=F(θ2)(θ1θ2)η2=F*(η2)θ1η2 where η=F(θ).  □

Thus, we can numerically compute the Jensen–Shannon centroids (or barycenters) of a set of densities belonging to a mixture family. This includes the case of categorical distributions and the case of Gaussian Mixture Models (GMMs) with prescribed Gaussian components [38] (although in this case, the negentropy needs to be stochastically approximated using Monte Carlo techniques [46]). When the densities do not belong to a mixture family (say, the Gaussian family, which is an exponential family [25]), we face the problem that the mixture of two densities does not belong to the family anymore. One way to tackle this problem is to project the mixture onto the Gaussian family. This corresponds to an m-projection (mixture projection) which can be interpreted as a Maximum Entropy projection of the mixture [25,47]).

Notice that we can perform fast k-means clustering without centroid calculations using a generalization of the k-means++ probabilistic initialization [48,49]. See [50] for details of the generalized k-means++ probabilistic initialization defined according to an arbitrary divergence.

Finally, let us notice some decompositions of the Jensen–Shannon divergence and the skew Jensen divergences.

Remark 4.

We have the following decomposition for the Jensen–Shannon divergence:

JS(p1,p2)=hp1+p22h(p1)+h(p2)2, (115)
=hJS×(p1:p2)hJS(p2)0, (116)

where

hJS×(p1:p2)=hp1+p2212h(p1), (117)

and hJS(p2)=hJS×(p2:p2)=h(p2)12h(p2)=12h(p2). This decomposition bears some similarity with the KLD decomposition viewed as the cross-entropy minus the entropy (with the cross-entropy always upper-bounding the entropy).

Similarly, the α-skew Jensen divergence

JFα(θ1:θ2):=(F(θ1)F(θ2))αF(θ1θ2)α,α(0,1) (118)

can be decomposed as the sum of the information IFα(θ1)=(1α)F(θ1) minus the cross-information CFα(θ1:θ2):=F(θ1θ2)ααF(θ2):

JFα(θ1:θ2)=IFα(θ1)CFα(θ1:θ2)0. (119)

Notice that the information IFα(θ1) is the self cross-information: IFα(θ1)=CFα(θ1:θ1)=(1α)F(θ1). Recall that the convex information is the negentropy where the entropy is concave. For the Jensen–Shannon divergence on the mixture family of categorical distributions, the convex generator F(θ)=h(mθ)=i=1Dθilogθi is the Shannon negentropy.

Finally, let us briefly mention the Jensen–Shannon diversity [30] which extends the Jensen–Shannon divergence to a weighted set of densities as follows:

JS(p1,,pk;w1,,wk):=i=1kwiKL(pi:p¯), (120)

where p¯=i=1kwipi. The Jensen–Shannon diversity plays the role of the variance of a cluster with respect to the KLD. Indeed, let us state the compensation identity [51]: For any q, we have

i=1kwiKL(pi:q)=i=1kwiKL(pi:p¯)+KL(p¯:q). (121)

Thus, the cluster center defined as the minimizer of i=1kwiKL(pi:q) is the centroid p¯, and

i=1kwiKL(pi:p¯)=JS(p1,,pk;w1,,wk). (122)

4. Conclusions and Discussion

The Jensen–Shannon divergence [6] is a renown symmetrization of the Kullback–Leibler oriented divergence that enjoys the following three essential properties:

  1. It is always bounded,

  2. it applies to densities with potentially different supports, and

  3. it extends to unnormalized densities while enjoying the same formula expression.

This JSD plays an important role in machine learning and in deep learning for studying Generative Adversarial Networks (GANs) [52]. Traditionally, the JSD has been skewed with a scalar parameter [19,53] α(0,1). In practice, it has been experimentally demonstrated that skewing divergences may significantly improve the performance of some tasks (e.g., [21,54]).

In general, we can symmetrize the KLD KL(p:q) by taking an abstract mean (we require a symmetric mean M(x,y)=M(y,x) with the in-betweenness property: min{x,y}M(x,y)max{x,y}) M between the two orientations KL(p:q) and KL(q:p):

KLM(p,q):=M(KL(p:q),KL(q:p)). (123)

We recover the Jeffreys divergence by taking the arithmetic mean twice (i.e., J(p,q)=2A(KL(p:q),KL(q:p)) where A(x,y)=x+y2), and the resistor average divergence [55] by taking the harmonic mean (i.e., RKL(p,q)=H(KL(p:q),KL(q:p))=2KL(p:q)KL(q:p)KL(p:q)+KL(q:p) where H(x,y)=21x+1y). When we take the limit of Hölder power means, we get the following extremal symmetrizations of the KLD:

KLmin(p:q)=min{KL(p:q),KL(q:p)}=KLmin(q:p), (124)
KLmax(p:q)=max{KL(p:q),KL(q:p)}=KLmax(q:p). (125)

In this work, we showed how to vector-skew the JSD while preserving the above three properties. These new families of weighted vector-skew Jensen–Shannon divergences may allow one to fine-tune the dissimilarity in applications by replacing the skewing scalar parameter of the JSD by a vector parameter (informally, adding some “knobs” for tuning a divergence). We then considered computing the Jensen–Shannon centroids of a set of densities belonging to a mixture family [25] by using the convex–concave procedure [27].

In general, we can vector-skew any arbitrary divergence D by using two k-dimensional vectors α[0,1]k and β[0,1]k (with αβ) by building a weighted separable divergence as follows:

Dα,β,w(p:q):=i=1kwiD(pq)αi:(pq)βi=D1kα,1kβ,w(q:p),αβ. (126)

This bi-vector-skew divergence unifies the Jeffreys divergence with the Jensen–Shannon α-skew divergence by setting the following parameters:

KL(0,1),(1,0),(1,1)(p:q)=KL(p:q)+KL(q:p)=J(p,q), (127)
KL(0,α),(1,1α),(12,12)(p:q)=12KL(p:(pq)α)+12KL(q:(pq)α). (128)

We have shown in this paper that interesting properties may occur when the skewing vector β is purposely correlated to the skewing vector α: Namely, for the bi-vector-skew Bregman divergences with β=(α¯,,α¯) and α¯=iwiαi, we obtain an equivalent Jensen diversity for the Jensen–Bregman divergence, and, as a byproduct, a vector-skew generalization of the Jensen–Shannon divergence.

Acknowledgments

The author is very grateful to the two Reviewers and the Academic Editor for their careful reading, helpful comments, and suggestions which led to this improved manuscript. In particular, Reviewer 2 kindly suggested the stronger bound of Lemma 1 and hinted at Theorem 1.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Billingsley P. Probability and Measure. John Wiley & Sons; Hoboken, NJ, USA: 2008. [Google Scholar]
  • 2.Deza M.M., Deza E. Encyclopedia of Distances. Springer; Berlin/Heidelberg, Germany: 2009. [Google Scholar]
  • 3.Basseville M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013;93:621–633. doi: 10.1016/j.sigpro.2012.09.003. [DOI] [Google Scholar]
  • 4.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]
  • 5.Nielsen F. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy. 2019;21:485. doi: 10.3390/e21050485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory. 1991;37:145–151. doi: 10.1109/18.61115. [DOI] [Google Scholar]
  • 7.Sason I. Tight bounds for symmetric divergence measures and a new inequality relating f-divergences; Proceedings of the 2015 IEEE Information Theory Workshop (ITW); Jerusalem, Israel. 26 April–1 May 2015; pp. 1–5. [Google Scholar]
  • 8.Wong A.K., You M. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1985;7:599–609. doi: 10.1109/TPAMI.1985.4767707. [DOI] [PubMed] [Google Scholar]
  • 9.Endres D.M., Schindelin J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory. 2003;49:1858–1860. doi: 10.1109/TIT.2003.813506. [DOI] [Google Scholar]
  • 10.Kafka P., Österreicher F., Vincze I. On powers of f-divergences defining a distance. Stud. Sci. Math. Hung. 1991;26:415–422. [Google Scholar]
  • 11.Fuglede B. Spirals in Hilbert space: With an application in information theory. Expo. Math. 2005;23:23–45. doi: 10.1016/j.exmath.2005.01.014. [DOI] [Google Scholar]
  • 12.Acharyya S., Banerjee A., Boley D. Bregman divergences and triangle inequality; Proceedings of the 2013 SIAM International Conference on Data Mining; Austin, TX, USA. 2–4 May 2013; pp. 476–484. [Google Scholar]
  • 13.Naghshvar M., Javidi T., Wigger M. Extrinsic Jensen–Shannon divergence: Applications to variable-length coding. IEEE Trans. Inf. Theory. 2015;61:2148–2164. doi: 10.1109/TIT.2015.2401004. [DOI] [Google Scholar]
  • 14.Bigi B. European Conference on Information Retrieval. Springer; Berlin/Heidelberg, Germany: 2003. Using Kullback-Leibler distance for text categorization; pp. 305–319. [Google Scholar]
  • 15.Chatzisavvas K.C., Moustakidis C.C., Panos C. Information entropy, information distances, and complexity in atoms. J. Chem. Phys. 2005;123:174111. doi: 10.1063/1.2121610. [DOI] [PubMed] [Google Scholar]
  • 16.Yurdakul B. Ph.D. Thesis. Western Michigan University; Kalamazoo, MI, USA: 2018. Statistical Properties of Population Stability Index. [Google Scholar]
  • 17.Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
  • 18.Nielsen F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013;20:657–660. doi: 10.1109/LSP.2013.2260538. [DOI] [Google Scholar]
  • 19.Lee L. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99. Association for Computational Linguistics; Stroudsburg, PA, USA: 1999. Measures of Distributional Similarity; pp. 25–32. [DOI] [Google Scholar]
  • 20.Nielsen F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv. 20101009.4004 [Google Scholar]
  • 21.Lee L. On the effectiveness of the skew divergence for statistical language analysis; Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AISTATS 2001); Key West, FL, USA. 4–7 January 2001. [Google Scholar]
  • 22.Csiszár I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967;2:229–318. [Google Scholar]
  • 23.Ali S.M., Silvey S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966;28:131–142. doi: 10.1111/j.2517-6161.1966.tb00626.x. [DOI] [Google Scholar]
  • 24.Sason I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy. 2018;20:383. doi: 10.3390/e20050383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Amari S.I. Information Geometry and Its Applications. Springer; Berlin/Heidelberg, Germany: 2016. [Google Scholar]
  • 26.Jiao J., Courtade T.A., No A., Venkat K., Weissman T. Information measures: The curious case of the binary alphabet. IEEE Trans. Inf. Theory. 2014;60:7616–7626. doi: 10.1109/TIT.2014.2360184. [DOI] [Google Scholar]
  • 27.Yuille A.L., Rangarajan A. The concave-convex procedure (CCCP); Proceedings of the Neural Information Processing Systems 2002; Vancouver, BC, Canada. 9–14 December 2002; pp. 1033–1040. [Google Scholar]
  • 28.Nielsen F., Nock R. Transactions on Computational Science XIV. Springer; Berlin/Heidelberg, Germany: 2011. Skew Jensen-Bregman Voronoi diagrams; pp. 102–128. [Google Scholar]
  • 29.Banerjee A., Merugu S., Dhillon I.S., Ghosh J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005;6:1705–1749. [Google Scholar]
  • 30.Nielsen F., Nock R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory. 2009;55:2882–2904. doi: 10.1109/TIT.2009.2018176. [DOI] [Google Scholar]
  • 31.Melbourne J., Talukdar S., Bhaban S., Madiman M., Salapaka M.V. On the Entropy of Mixture distributions. [(accessed on 16 February 2020)]; Available online: http://box5779.temp.domains/~jamesmel/publications/
  • 32.Guntuboyina A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory. 2011;57:2386–2399. doi: 10.1109/TIT.2011.2110791. [DOI] [Google Scholar]
  • 33.Sason I., Verdu S. f-divergence Inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
  • 34.Melbourne J., Madiman M., Salapaka M.V. Relationships between certain f-divergences; Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton); Monticello, IL, USA . 24–27 September 2019; pp. 1068–1073. [Google Scholar]
  • 35.Sason I. On Data-Processing and Majorization Inequalities for f-Divergences with Applications. Entropy. 2019;21:1022. doi: 10.3390/e21101022. [DOI] [Google Scholar]
  • 36.Van Erven T., Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]
  • 37.Xu P., Melbourne J., Madiman M. Infinity-Rényi entropy power inequalities; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 2985–2989. [Google Scholar]
  • 38.Nielsen F., Nock R. On the geometry of mixtures of prescribed distributions; Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Calgary, AB, Canada. 15–20 April 2018; pp. 2861–2865. [Google Scholar]
  • 39.Fréchet M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. De L’institut Henri PoincarÉ. 1948;10:215–310. [Google Scholar]
  • 40.Nielsen F., Boltz S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory. 2011;57:5455–5466. doi: 10.1109/TIT.2011.2159046. [DOI] [Google Scholar]
  • 41.Lanckriet G.R., Sriperumbudur B.K. On the convergence of the concave-convex procedure; Proceedings of the Advances in Neural Information Processing Systems 22 (NIPS 2009); Vancouver, BC, Canada. 7–10 December 2009; pp. 1759–1767. [Google Scholar]
  • 42.Nielsen F., Sun K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy. 2016;18:442. doi: 10.3390/e18120442. [DOI] [Google Scholar]
  • 43.Springer Verlag GmbH, European Mathematical Society Encyclopedia of Mathematics. [(accessed on 19 December 2019)]; Available online: https://www.encyclopediaofmath.org/
  • 44.Del Castillo J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994;46:57–66. doi: 10.1007/BF00773592. [DOI] [Google Scholar]
  • 45.Nielsen F., Nock R. Entropies and cross-entropies of exponential families; Proceedings of the 2010 IEEE International Conference on Image Processing; Hong Kong, China. 26–29 September 2010; pp. 3621–3624. [Google Scholar]
  • 46.Nielsen F., Hadjeres G. Monte Carlo information geometry: The dually flat case. arXiv. 20181803.07225 [Google Scholar]
  • 47.Schwander O., Nielsen F. Matrix Information Geometry. Springer; Berlin/Heidelberg, Germany: 2013. Learning mixtures by simplifying kernel density estimators; pp. 403–426. [Google Scholar]
  • 48.Arthur D., Vassilvitskii S. k-means++: The advantages of careful seeding; Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07); New Orleans, LA, USA. 7–9 January 2007; pp. 1027–1035. [Google Scholar]
  • 49.Nielsen F., Nock R., Amari S.I. On clustering histograms with k-means by using mixed α-divergences. Entropy. 2014;16:3273–3301. doi: 10.3390/e16063273. [DOI] [Google Scholar]
  • 50.Nielsen F., Nock R. Total Jensen divergences: Definition, properties and clustering; Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Brisbane, QLD, Australia. 19–24 April 2015; pp. 2016–2020. [Google Scholar]
  • 51.Topsøe F. Basic concepts, identities and inequalities-the toolkit of information theory. Entropy. 2001;3:162–190. doi: 10.3390/e3030162. [DOI] [Google Scholar]
  • 52.Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial nets; Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014); Montreal, QC, Canada. 8–13 December 2014; pp. 2672–2680. [Google Scholar]
  • 53.Yamano T. Some bounds for skewed α-Jensen-Shannon divergence. Results Appl. Math. 2019;3:100064. doi: 10.1016/j.rinam.2019.100064. [DOI] [Google Scholar]
  • 54.Kotlerman L., Dagan I., Szpektor I., Zhitomirsky-Geffet M. Directional distributional similarity for lexical inference. Nat. Lang. Eng. 2010;16:359–389. doi: 10.1017/S1351324910000124. [DOI] [Google Scholar]
  • 55.Johnson D., Sinanovic S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory. 2001:1–8. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES