Skip to main content
Entropy logoLink to Entropy
. 2024 Feb 23;26(3):193. doi: 10.3390/e26030193

Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity

Frank Nielsen 1
Editor: Carlo Cafaro1
PMCID: PMC10968918  PMID: 38539705

Abstract

Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the cumulant and partition functions are strictly convex and smooth functions inducing corresponding pairs of Bregman and Jensen divergences. It is well known that skewed Bhattacharyya distances between the probability densities of an exponential family amount to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and that in limit cases the sided Kullback–Leibler divergences amount to reverse-sided Bregman divergences. In this work, we first show that the α-divergences between non-normalized densities of an exponential family amount to scaled α-skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetical means allows both convex functions and their arguments to be deformed, thereby defining dually flat spaces with corresponding divergences when ordinary convexity is preserved.

Keywords: convex duality, exponential family, Bregman divergence, Jensen divergence, Bhattacharyya distance, Rényi divergence, α-divergences, comparative convexity, log convexity, exponential convexity, quasi-arithmetic means, information geometry

1. Introduction

In information geometry [1], any strictly convex and smooth function induces a dually flat space (DFS) with a canonical divergence which can be expressed in charts either as dual Bregman divergences [2] or equivalently as dual Fenchel–Young divergences [3]. For example, the cumulant function of an exponential family [4] (also called the free energy) generates a DFS, that is, an exponential family manifold [5] with the canonical divergence yielding the reverse Kullback–Leibler divergence. Another typical example of a strictly convex and smooth function generating a DFS is the negative entropy of a mixture family, that is, a mixture family manifold with the canonical divergence yielding the (forward) Kullback–Leibler divergence [3]. In addition, any strictly convex and smooth function induces a family of scaled skewed Jensen divergences [6,7], which in limit cases includes the sided forward and reverse Bregman divergences.

In Section 2, we present two equivalent approaches to normalizing an exponential family: first by its cumulant function, and second by its partition function. Because both the cumulant and partition functions are strictly convex and smooth, they induce corresponding families of scaled skewed Jensen divergences and Bregman divergences, with corresponding dually flat spaces and related statistical divergences.

In Section 3, we recall the well-known result that the statistical α-skewed Bhattacharyya distances between the probability densities of an exponential family amount to a scaled α-skewed Jensen divergence between their natural parameters. In Section 4, we prove that the α-divergences [8] between the unnormalized densities of a exponential family amount to scaled α-skewed Jensen divergence between their natural parameters (Proposition 5). More generally, we explain in Section 5 how to deform a convex function using comparative convexity [9]: When the ordinary convexity of the deformed convex function is preserved, we obtain new skewed Jensen divergences and Bregman divergences with corresponding dually flat spaces. Finally, Section 6 concludes this work with a discussion.

2. Dual Subtractive and Divisive Normalizations of Exponential Families

2.1. Natural Exponential Families

Let (X,A,μ) be a measure space [10], where X denotes the sample set (e.g., finite alphabet, N, Rd, space of positive-definite matrices Sym++(d), etc.), A a σ-algebra on X (e.g., power set 2X, Borel σ-algebra B(X), etc.), and μ a positive measure (e.g., counting measure or Lebesgue measure) on the measurable space (X,A).

A natural exponential family [4,11] (commonly abbreviated as NEF [12]) is a set of probability distributions P={Pθ:θΘ} all dominated by μ such that their Radon–Nikodym densities pθ(x)=dPθdμ(x) can be expressed canonically as

pθ(x)p˜θ(x)=expi=1mθixi, (1)

where θ is called the natural parameter and x=(x1,,xm) denotes the linear sufficient statistic vector [11]. The order of the NEF [13] is m. When the parameter θ ranges in the full natural parameter space

Θ=θ:Xp˜θ(x)dμ(x)<Rm,

the family is called full. The NEF is said to be regular when Θ is topologically open.

The unnormalized positive density p˜θ(x) is indicated with a tilde notation and the corresponding normalized probability density is obtained as pθ(x)=1Z(θ)p˜θ(x), where Z(θ)=p˜θ(x)dμ(x) is the Laplace transform of μ (the density normalizer). For example, the family of exponential distributions E={λeλx:λ>0} is an NEF with densities defined on the support X=R0={xR:x0}, natural parameter θ=λ in Θ=R<0={θR:θ<0}, sufficient linear statistic x, and normalizer Z(θ)=1θ.

2.2. Exponential Families

More generally, exponential families include many well known distributions after reparameterization [4] of their ordinary parameter λ by θ(λ). The general canonical form of the densities of an exponential family is

pλ(x)p˜λ(x)=expθ(λ),t(x)h(x), (2)

where t(x)=(t1(x),,tm(x)) are the sufficient statistic vector (such that 1,t1(x),,tm(x) are linearly independent), h(x) is an auxiliary term used to define the base measure with respect to μ, and ·,· is an inner product (e.g., scalar product of Rm, trace product of symmetric matrices, etc.). By defining a new measure ν such that dμdν(x)=h(x), we may consider without loss of generality the densities p¯λ(x)=dPλdν(x) with h(x)=1.

For example, the Bernoulli distributions, Gaussian or normal distributions, Gamma and Beta distributions, Poisson distributions, Rayleigh distributions, and Weibull distributions with prescribed shape parameter are just a few examples of exponential families with the inner product on Rm defined as the scalar product. The categorical distributions (i.e., discrete distributions on a finite alphabet sample space) form an exponential family as well [1]. Zero-centered Gaussian distributions and Wishart distributions are examples of exponential families parameterized by positive-definite matrices with inner products defined by the matrix trace product, which is A,B=tr(AB).

Exponential families abound in statistics and machine learning. Any two probability measures Q and R with densities q and r with respect to a dominating measure, say, μ=Q+R2, define an exponential family

PQ,R=pλ(x)qλ(x)r1λ(x):λ(0,1),

which is called the likelihood ratio exponential family [14], as the sufficient statistic is t(x)=logq(x)r(x) (with auxiliary carrier term h(x)=r(x)), or the Bhattacharyya arc, as the cumulant function of PQ,R is expressed as the negative of the skewed Bhattacharyya distances [7,15].

In machine learning, undirected graphical models [16] and energy-based models [17], including Markov random fields [18] and conditional random fields, are exponential families [19]. Exponential families are universal approximators of smooth densities [20].

From a theoretical standpoint, it is often enough to consider (without loss of generality) natural exponential families with densities expressed as in Equation (1). However, here we consider generic exponential families with the densities expressed in Equation (2) in order to report common examples encountered in practice, such as the multivariate Gaussian family [21].

When the natural parameter space Θ is not full but rather parameterized by λ=c(λ) for λΛ with dim(Λ)<m and a smooth function c(u), the exponential family is called a curved exponential family [1]. For example, the special family of normal distributions {pμ,σ2=μ2:μR} is a curved exponential family with u=μ and c(u)=(u,u2) [1].

2.3. Normalizations of Exponential Families

Recall that p˜θ(x)=expθ,t(x)h(x) denotes the unnormalized density expressed using the natural parameter θ=θ(λ). We can normalize p˜θ(x) using either the partition function Z(θ) or equivalently using the cumulant function F(θ), as follows:

pθ(x)=exp(θ,t(x))Z(θ)h(x), (3)
=expθ,t(x)F(θ)+k(x), (4)

where h(x)=exp(k(x)), Z(θ)=p˜θ(x)dμ(x), and F(θ)=logZ(θ)=logp˜θ(x)dμ(x). Thus, the logarithm and exponential functions allow conversion to and from the dual normalizers Z and F:

Z(θ)=exp(F(θ))F(θ)=logZ(θ).

We may view Equation (3) as an exponential tilting [13] of density h(x)dμ(x).

In the context of λ-deformed exponential families [22] which generalize exponential families, the function Z(θ) is called the divisive normalization factor (Equation (3)) and the function F(θ) is called the subtractive normalization factor (Equation (4)). Notice that F(θ) is called the cumulant function because when Xpθ(x) is a random variable following a probability distribution of an exponential family, the function F(θ) appears in the cumulant generating function of X: KX(t)=logEX[et,X]=F(θ+t)F(θ). In statistical physics, the cumulant function is called the log-normalizer or log-partition function. Because Z>0 and F=logZ, we can deduce that FZ, as logxx for x>0.

It is well known that the cumulant function F(θ) is a strictly convex function and that the partition function Z(θ) is strictly log-convex [11].

Proposition 1

([11]). The natural parameter space Θ of an exponential family is convex.

Proposition 2

([11]). The cumulant function F(θ) is strictly convex and the partition function Z(θ) is positive and strictly log-convex.

It can be shown that the cumulant and partition functions are smooth C analytic functions [4]. A remarkable property is that strictly log-convex functions are also strictly convex.

Proposition 3

([23], Section 3.5). A strictly log-convex function Z:ΘRmR is strictly convex.

The converse of Proposition 3 is not necessarily true, however; certain convex functions are not log-convex, and as such the class of strictly log-convex functions is a proper subclass of strictly convex functions. For example, θ2 is convex but log-concave, as (logθ2)=2θ2<0 (Figure 1).

Figure 1.

Figure 1

Strictly log-convex functions form a proper subset of strictly convex functions.

Remark 1.

Because Z=exp(F) is strictly convex (Proposition 3), F is exponentially convex.

Definition 1.

The cumulant function F and partition function Z of a regular exponential family are both strictly convex and smooth functions inducing a pair of dually flat spaces with corresponding Bregman divergences [2] BF (i.e., BlogZ) and BZ (i.e., BexpF):

BZ(θ1:θ2)=Z(θ1)Z(θ2)θ1θ2,Z(θ2)0, (5)
BlogZ(θ1:θ2)=logZ(θ1)Z(θ2)θ1θ2,Z(θ2)Z(θ2)0, (6)

along with a pair of families of skewed Jensen divergences JF,α and JZ,α:

JZ,α(θ1:θ2)=αZ(θ1)+(1α)Z(θ2)Z(αθ1+(1α)θ2)0, (7)
JlogZ,α(θ1:θ2)=logZ(θ1)αZ(θ2)1αZ(αθ1+(1α)θ2)0. (8)

For a strictly convex function F(θ), we define the symmetric Jensen divergence as follows:

JF(θ1,θ2)=JF,12(θ1:θ2)=F(θ1)+F(θ2)2Fθ1+θ22.

Let BΘ denote the set of real-valued strictly convex and differentiable functions defined on an open set Θ, called Bregman generators. We may equivalently consider the set of strictly concave and differentiable functions G(θ) and let F(θ)=G(θ); see [24] (Equation (1)).

Remark 2.

The non-negativeness of the Bregman divergences for the cumulant and partition functions define the criteria for checking the strict convexity or log-convexity of a C1 function:

F(θ) is strictly convexθ1θ2,BF(θ1:θ2)>0,θ1θ2,F(θ1)>F(θ2)+θ1θ2,F(θ),

and

Z(θ) is strictly logconvexθ1θ2,BlogZ(θ1:θ2)>0,θ1θ2,logZ(θ1)>logZ(θ2)+θ1θ2,Z(θ2)Z(θ2).

The forward Bregman divergence BF(θ1:θ2) and reverse Bregman divergence BF(θ2:θ1) can be unified with the α-skewed Jensen divergences by rescaling JF,α and allowing α to range in R [6,7]:

JF,αs(θ1:θ2)=1α(1α)JF,α(θ1:θ2),αR{0,1},BF(θ1:θ2),α=0,4JF(θ1,θ2),α=12,BF*(θ1:θ2)=BF(θ2:θ1),α=1., (9)

where BF* denotes the reverse Bregman divergence obtained by swapping the parameter order (reference duality [6]): BF*(θ1:θ2)=BF(θ2:θ1).

Remark 3.

Alternatively, we may rescale JF by a factor κ(α)=1α(1α)44α(1α), i.e., JF,αs¯(θ1:θ2)=κ(α)JF,α(θ1:θ2) such that κ(12)=1 and JF,12s¯(θ1:θ2)=JF(θ1,θ2).

Next, in Section 3 we first recall the connections between these Jensen and Bregman divergences, which are divergences between parameters, and the statistical divergence counterparts between probability densities. Then, in Section 4 we introduce the novel connections between these parameter divergences and α-divergences between unnormalized densities.

3. Divergences Related to the Cumulant Function

Consider the scaled α-skewed Bhattacharyya distances [7,15] between two probability densities p(x) and q(x):

DB,αs(p:q)=1α(1α)logpαq1αdμ,αR{0,1}.

The scaled α-skewed Bhattacharyya distances can additionally be interpreted as Rényi divergences [25] scaled by 1α: DB,αs(p:q)=1αDR,α(p:q), where the Rényi α-divergences are defined by

DR,α(p:q)=1α1logpαq1αdμ.

The Bhattacharyya distance DB(p,q)=logpqdμ corresponds to one-fourth of DB,12s(p:q): DB(p,q)=14DB,12s(p:q). Because DB,αs tends to the Kullback–Leibler divergence DKL when α1 and to the reverse Kullback–Leibler divergence DKL* when α0, we have

DB,αs(p:q)=1α(1α)logpαq1αdμ,αR{0,1},DKL(p:q),α=1,4DB(p,q)α=12,DKL*(p:q)=DKL(q:p)α=0.

When both probability densities belong to the same exponential family E={pθ(x):θΘ} with cumulant F(θ), we have the following proposition.

Proposition 4

([7]). The scaled α-skewed Bhattacharyya distances between two probability densities pθ1 and pθ2 of an exponential family amount to the scaled α-skewed Jensen divergence between their natural parameters:

DB,αs(pθ1:pθ2)=JF,αs(θ1,θ2). (10)

Proof. 

The proof follows by first considering the α-skewed Bhattacharyya similarity coefficient ρα(p,q)=pαq1αdμ.

ρα(pθ1:pθ2)=expθ1,xF(θ1)αexpθ2,xF(θ2)1αdμ,=exp(αθ1+(1α)θ2),x)exp(αF(θ1)+(1α)F(θ2))dμ.

Multiplying the last equation by exp(F(αθ1+(1α)θ2))exp(F(αθ1+(1α)θ2))=exp(0)=1 with θ¯=αθ1+(1α)θ2, we obtain

ρα(pθ1:pθ2)=exp((αF(θ1)+(1α)F(θ2))exp(F(θ¯))exp(θ¯,xF(θ¯))dμ.

Because θ¯Θ, we have exp(θ¯,xF(θ¯))dμ=1; therefore, we obtain

ρα(pθ1:pθ2)=exp(JF,α(θ1:θ2)).

For practitioners in machine learning, it is well known that the Kullback–Leibler divergence between two probability densities pθ1 and pθ2 of an exponential family amounts to a Bregman divergence for the cumulant generator on a swapped parameter order (e.g., [26,27]):

DKL(pθ1:pθ2)=BF(θ2:θ1).

This is a particular instance of Equation (10) obtained for α=1:

DB,1s(pθ1:pθ2)=JF,1s(θ1,θ2).

This formula has been further generalized in [28] by considering truncations of exponential family densities. Let X1X2X and E1={1X1(x)pθ(x)}, E2={1X2(x)qθ(x)} be two truncated families of X with corresponding cumulant functions

F1(θ)=logX1exp(t(x),θ)dμ

and

F2(θ)=logX2exp(t(x),θ)dμF1(θ).

Then, we have

DKL(pθ1:qθ2)=BF2,F1(θ2:θ1),=F2(θ2)F1(θ1)θ2θ1,F1(θ1).

Truncated exponential families are normalized exponential families which may not be regular [29], i.e., the parameter space Θ may not be open.

4. Divergences Related to the Partition Function

Certain exponential families have intractable cumulant/partition functions (e.g., exponential families with sufficient statistics t(x)=(x,x2,,xm) for high degrees m [20]) or cumulant/partition functions which require exponential time to compute [30] (e.g., graphical models [16], high-dimensional grid sample spaces, energy-based models [17] in deep learning, etc.). In such cases, the maximum likelihood estimator (MLE) cannot be used to infer the natural parameter of exponential densities. Many alternative methods have been proposed to handle such exponential families with untractable partition functions, e.g., score matching [31] or divergence-based inference [32,33]). Thus, it is important to consider dissimilarities between non-normalized statistical models.

The squared Hellinger distance [1] between two positive potentially unnormalized densities p˜ and q˜ is defined by

DH2(p˜,q˜)=12(p˜q˜)2dμ,=p˜dμ+q˜dμ2p˜q˜dμ.

Notice that the Hellinger divergence can be interpreted as the integral of the difference between the arithmetical mean A(p˜,q˜)=p˜+q˜2 minus the geometrical mean G(p˜,q˜)=p˜q˜ of the densities: DH2(p˜,q˜)=(A(p˜,q˜)G(p˜,q˜))dμ. This further proves that DH(p˜,q˜)0, as AG. The Hellinger distance DH satisfies the metric axioms of distances.

When considering unnormalized densities p˜θ1=exp(t(x),θ1) and p˜θ2=exp(t(x),θ2) of an exponential family E with a partition function Z(θ)=p˜θdμ, we obtain

DH2(p˜θ1,p˜θ2)=Z(θ1)+Z(θ2)2Zθ1+θ22=JZ(θ1,θ2), (11)

as p˜θ1p˜θ2=p˜θ1+θ22.

The Kullback–Leibler divergence [1] as extended to two positive densities p˜ and q˜ is defined by

DKL(p˜:q˜)=p˜logp˜q˜+q˜p˜dμ. (12)

When considering unnormalized densities p˜θ1 and p˜θ2 of E, we obtain

DKL(p˜θ1:p˜θ2)=p˜θ1(x)logp˜θ1(x)p˜θ2(x)+p˜θ2(x)p˜θ1(x)dμ(x), (13)
=et(x),θ1θ1θ2,t(x)+et(x),θ2et(x),θ1dμ(x), (14)
=t(x)et(x),θ1dμ(x),θ1θ2+Z(θ2)Z(θ1), (15)
=θ1θ2,Z(θ1)+Z(θ2)Z(θ1)=BZ(θ2:θ1), (16)

as Z(θ)=t(x)p˜θ(x)dμ(x). Let DKL*(p˜:q˜)=DKL(q˜:p˜) denote the reverse KLD.

More generally, the family of α-divergences [1] between the unnormalized densities p˜ and q˜ is defined for αR by

Dα(p˜:q˜)=1α(1α)αp˜+(1α)q˜p˜αq˜1αdμ,α{0,1}DKL*(p˜:q˜)=DKL(q˜:p˜)α=0,4DH2(p˜,q˜)α=12,DKL(p˜:q˜)α=1.

We now have Dα*(p˜:q˜)=Dα(q˜:p˜)=D1α(p˜:q˜), and the α-divergences are homogeneous divergences of degree 1. For all λ>0, we have Dα(λq˜:λp˜)=λDα(q˜:p˜). Moreoever, because αp˜+(1α)q˜p˜αq˜1α can be expressed as the difference of the weighted arithmetic mean minus the weighted geometric mean A(p˜,q˜;α;1α)G(p˜,q˜;α;1α), it follows from the arithmetical–geometrical mean inequality that we have Dα(p˜:q˜)0.

When considering unnormalized densities p˜θ1 and p˜θ2 of E, we obtain

Dα(p˜θ1:p˜θ2)=1α(1α)JZ,α(θ1:θ2),α{0,1}BZ(θ1:θ2)α=0,4JZ(θ1,θ2)α=12,BZ*(θ1:θ2)=BZ(θ2:θ1)α=1.

Proposition 5.

The α-divergences between the unnormalized densities of an exponential family amount to scaled α-Jensen divergences between their natural parameters for the partition function

Dα(p˜θ1:p˜θ2)=JZ,αs(θ1:θ2).

When α{0,1}, the oriented Kullback–Leibler divergences between unnormalized exponential family densities amount to reverse Bregman divergences on their corresponding natural parameters for the partition function

DKL(p˜:q˜)=BZ(θ2:θ1).

Proof. 

For α{0,1}, consider

Dα(p˜θ1:p˜θ2)=1α(1α)αp˜θ1+(1α)p˜θ2p˜θ1αp˜θ21αdμ.

Here, we have αp˜θ1dμ=αZ(θ1), (1α)p˜θ2dμ=(1α)Z(θ2) and p˜θ1αp˜θ21αdμ=p˜αθ1+(1α)θ2dμ=Z(αθ1+(1α)θ2). It follows that

Dα(p˜θ1:p˜θ2)=1α(1α)JZ,α(θ1:θ2)=JZ,αs(θ1:θ2).

Notice that the KLD extended to unnormalized densities can be written as a generalized relative entropy, i.e., it can be obtained as the difference of the extended cross-entropy minus the extended entropy (self cross-entropy):

DKL(p˜:q˜)=H×(p˜:q˜)H(p˜),=p˜logp˜q˜+q˜p˜dμ

with

H×(p˜:q˜)=p˜(x)log1q˜(x)+q˜(x)dμ(x)1

and

H(p˜)=H×(p˜:p˜)=p˜(x)log1p˜(x)+p˜(x)dμ(x)1.

Remark 4.

In general, we can consider two unnormalized positive densities p˜(x) and q˜(x). Let p(x)=p˜(x)Zp and q(x)=q˜(x)Zq denote their corresponding normalized densities (with normalizing factors Zp=p˜dμ and Zq=q˜dμ); then, the KLD between p˜ and q˜ can be expressed using the KLD between their normalized densities and normalizing factors, as follows:

DKL(p˜:q˜)=ZpDKL(p:q)+logZpZq+ZqZp. (17)

Similarly, we have

H×(p˜:q˜)=ZpH×(p:q)ZplogZq+Zq1, (18)
H(p˜)=ZpH(p)ZplogZp+Zp1, (19)

and DKL(p˜:q˜)=H×(p˜:q˜)H(p˜).

Notice that Equation (17) allows us to derive the following identity between BZ and BF:

BZ(θ2:θ1)=Z(θ1)BF(θ2:θ1)+Z(θ1)logZ(θ1)Z(θ2)+Z(θ2)Z(θ1), (20)
=exp(F(θ1))BF(θ2:θ1)+(expF(θ1))(F(θ1)F(θ2))+exp(F(θ2))exp(F(θ1)). (21)

Let Dskl(a:b)=alogab+ba be the scalar KLD for a>0 and b>0. Then, we can rewrite Equation (17) as

DKL(p˜:q˜)=ZpDKL(p:q)+Dskl(Zp:Zq),

and we have

BZ(θ2:θ1)=Z(θ1)BF(θ2:θ1)+Dskl(Z(θ1):Z(θ2)).

In addition, the KLD between the unnormalized densities p˜ and q˜ with support X can be written as a definite integral of a scalar Bregman divergence:

DKL(p˜:q˜)=XDskl(p˜(x):q˜(x))dμ(x)=XBfskl(p˜(x):q˜(x))dμ(x),

where fskl(x)=xlogxx. Because Bfskl(a:b)0a>0,b>0, we can deduce that DKL(p˜:q˜)0 with equality iff p˜(x)=q˜(x)μ almost everywhere.

Notice that BZ(θ2:θ1)=Z(θ1)BF(θ2:θ1)+Dskl(Z(θ1):Z(θ2)) can be interpreted as the sum of two divergences, that is, a conformal Bregman divergence with a scalar Bregman divergence.

Remark 5.

Consider the KLD between the normalized pθ1 and unnormalized p˜θ2 densities of the same exponential family. In this case, we have

DKL(pθ1:p˜θ2)=BF(θ2:θ1)logZ(θ2)+Z(θ2)1,=Z(θ2)1F(θ1)θ2θ1,F(θ2), (22)
=BZ1,F(θ2:θ1). (23)

The divergence BZ1,F is a dual Bregman pseudo-divergence [28]:

BF1,F2(θ1:θ2)=F1(θ1)F2(θ2)θ1θ2,F2(θ2),

for F1 and F2 that are two strictly convex and smooth functions such that F1F2. Indeed, we can check that generators F1(θ)=Z(θ)1 and F2(θ)=F(θ) are both Bregman generators; then, we have F1(θ)F2(θ), as exx+1 for all x (with equality when x=0), i.e., Z(θ)1F(θ).

More generally, the α-divergences between pθ1 and p˜θ2 can be written as

Dα(pθ1:p˜θ2)=1α(1α)αZ(θ1)+(1α)Z(αθ1+(1α)θ2)Z(θ2), (24)

with the (signed) α-skewed Bhattacharyya distances provided by

DB,α(pθ1:p˜θ2)=logZ(θ2)logZ(αθ1+(1α)θ2).

Let us illustrate Proposition 5 with some examples.

Example 1.

Consider the family of exponential distributions E={pλ(x)=1x0λexp(λx)}, where E is an exponential family with a natural parameter θ=λ, parameter space Θ=R>0, sufficient statistic t(x)=x. The partition function is Z(θ)=1θ, with Z(θ)=1θ2 and Z(θ)=2λ3>0, while the cumulant function is F(θ)=logZ(θ)=logθ with moment parameter η=Epλ[t(x)]=F(θ)=1θ. The α-divergences between two unnormalized exponential distributions are

Dα(p˜λ1:p˜λ2)=1α(1α)JZ,α(θ1:θ2)=(λ1λ2)2)αλ12λ2+(1α)λ1λ22α{0,1}DKL(p˜λ2:p˜λ1)=BZ(θ1:θ2)=(λ1λ2)2λ1λ22α=0,4JZ(θ1,θ2)=(λ1λ2)22(λ1λ22+λ12λ2)α=12,DKL(p˜λ1:p˜λ2)=BZ(θ2:θ1)=(λ1λ2)2λ2λ12α=1. (25)

Example 2.

Consider the family of univariate centered normal distributions with p˜σ2(x)exp(x22σ2) and partition function Z(σ2)=2πσ2 such that pσ2(x)=1Z(σ2)p˜σ2(x)=12πσ2exp(x22σ2). Here, we have a natural parameter θ=1σ2Θ=R>0 and sufficient statistic t(x)=x22. The partition function expressed with the natural parameter is Z(θ)=2πθ, with Z(θ)=π2θ32 and Z(θ)=3π232θ52>0 (strictly convex on Θ). The unnormalized KLD between p˜σ12 and p˜σ22 is

DKL(p˜σ12:p˜σ22)=BZ(θ2:θ1)=π22σ23σ1+σ13σ22.

We can check that we have DKL(p˜σ2:p˜σ2)=0.

For the Hellinger divergence, we have

DH2(p˜σ12:p˜σ22)=JZ(θ1,θ2)=π2(σ1+σ2)2πσ1σ2σ12+σ22,

and we can check that DH(p˜σ2:p˜σ2)=0.

Consider the family of the d-variate case of centered normal distributions with unnormalized density

p˜Σ(x)exp(12xΣ1x)=exp12tr(xΣ1x)=exp12tr(xxΣ1)

obtained using the matrix trace cyclic property, where Σ is the covariance matrix. Here, we have θ=Σ1 (precision matrix) and Θ=Sym++(d) for t(x)=12xx, with the matrix inner product A,B=tr(AB). The partition function Z(Σ)=(2π)d2det(Σ) expressed with the natural parameter is Z(θ)=(2π)d21det(θ). This is a convex function with

Z(θ)=12(2π)d2θdet(θ)det(θ)32=12(2π)d2θ1det(θ)12,

as θdet(θ)=det(θ)θ using matrix calculus.

Now, consider the family of univariate normal distributions

E=pμ,σ2(x)=12πσ2exp12xμσ2.

Let θ=θ1=1σ2,θ2=μσ2 and

Z(θ1,θ2)=2πθ1exp12θ22θ1.

The unnormalized densities are p˜θ(x)=expθ1x22+xθ2, and we have

Z(θ)=π2(θ1+θ22)expθ222θ1θ1522πθ2expθ222θ1θ132.

It follows that DKL[p˜θ:p˜θ]=BZ(θ:θ).

5. Deforming Convex Functions and Their Induced Dually Flat Spaces

5.1. Comparative Convexity

The log-convexity can be interpreted as a special case of comparative convexity with respect to a pair (M,N) of comparable weighted means [9], as follows.

A function Z is (M,N)-convex if and only if for α[0,1] we have

Z(M(x,y;α,1α))N(Z(x),Z(y);α,1α), (26)

and is strictly (M,N)-convex iff we have strict inequality for α(0,1) and xy. Furthermore, a function Z is (strictly) (M,N)-concave if Z is (strictly) (M,N)-convex.

Log-convexity corresponds to (A,G)-convexity, i.e., convexity with respect to the weighted arithmetical and geometrical means defined respectively by A(x,y;α,1α)=αx+(1α)y and G(x,y;α,1α)=xαy1α. Ordinary convexity is (A,A)-convexity.

A weighted quasi-arithmetical mean [34] (also called a Kolmogorov–Nagumo mean [35]) is defined for a continuous and strictly increasing function h by

Mh(x,y;α,1α)=h1(αh(x)+(1α)h(x)).

We let Mh(x,y)=Mhx,y;12,12. Quasi-arithmetical means include the arithmetical mean obtained for h(u)=id(u)=u and the geometrical mean for h(u)=log(u), and more generally power means

Mp(x,y;α,1α)=αxp+(1α)yp1p=Mhp(x,y;α,1α),p0,

which are quasi-arithmetical means obtained for the family of generators hp(u)=up1p with inverse hp1(u)=(1+up)1p. In the limit p0, we have M0(x,y)=G(x,y) for the generator limp0hp(u)=h0(u)=logu.

Proposition 6

([36,37]). A function Z(θ) is strictly (Mρ,Mτ)-convex with respect to two strictly increasing smooth functions ρ and τ if and only if the function F=τZρ1 is strictly convex.

Notice that the set of strictly increasing smooth functions form a non-Abelian group, with the group operation as the function composition, the neutral element as the identity function, and the inverse element as the functional inverse function.

Because log-convexity is (A=Mid,G=Mlog)-convexity, a function Z is strictly log-convex iff logZid1=logZ is strictly convex. We have

Z=τ1FρF=τZρ1.

Starting from a given convex function F(θ), we can deform the function F(θ) to obtain a function Z(θ) using two strictly monotone functions τ and ρ: Z(θ)=τ1(F(ρ(θ))).

For a (Mρ,Mτ)-convex function Z(θ) which is also strictly convex, we can define a pair of Bregman divergences BZ and BF with F(θ)=τ(Z(ρ1(θ))) and a corresponding pair of skewed Jensen divergences.

Thus, we have the following generic deformation scheme.

F=τZρ1(Mρ1,Mτ1)-convex when Z is convex  (ρ,τ)-deformation(ρ1,τ1)-deformation  Z=τ1Fρ(Mρ,Mτ)-convex when F is convex

In particular, when the function Z is deformed by strictly increasing the power functions hp1 and hp2 for p1 and p2 in R as

Zp1,p2=hp2Zhp11,

then Zp1,p2 is strictly convex when it is strictly (Mp1,Mp2)-convex, and as such induces corresponding Bregman and Jensen divergences.

Example 3.

Consider the partition function Z(θ)=1θ of the exponential distribution family (θ>0 with Θ=R>0). Let Zp(θ)=(hpZ)(θ)=θp1p; then, we have Zp(θ)=(1+p)1θ2+p>0 when p>1. Thus, we can deform Z smoothly by Zp while preserving the convexity by ranging p from 1 to +. In this way, we obtain a corresponding family of Bregman and Jensen divergences.

The proposed convex deformation using quasi-arithmetical mean generators differs from the interpolation of convex functions using the technique of proximal averaging [38].

Note that in [37] the comparative convexity with respect to a pair of quasi-arithmetical means (Mρ,Mτ) is used to define a (Mρ,Mτ)-Bregman divergence, which turns out to be equivalent to a conformal Bregman divergence on the ρ-embedding of the parameters.

5.2. Dually Flat Spaces

We start with a refinement of the class of convex functions used to generate dually flat spaces.

Definition 2

(Legendre type function [39]). (Θ,F) is of Legendre type if the function F:ΘR is strictly convex and differentiable with Θ and

limλ0ddλF(λθ+(1λ)θ¯)=,θΘ,θ¯Θ. (27)

Legendre-type functions F(Θ) admit a convex conjugate F*(η) via the Legendre transform F*(η)=supθΘθ,ηF(θ):

F*(η)=F1(η),ηF(F1(η)).

A smooth and strictly convex function (Θ,F(θ)) of Legendre type induces a dually flat space [1] M, i.e., a smooth Hessian manifold [40] with a single global chart (Θ,θ(·)) [1]. A canonical divergence D(p:q) between two points p and q of M is viewed as a single-parameter contrast function [41] D(rpq) on the product manifold M×M. The canonical divergence and its dual canonical divergence D*(rqp)=D(rpq) can be expressed equivalently as either dual Bregman divergences or dual Fenchel–Young divergences (Figure 2):

D(rpq)=BF(θ(p):θ(q))=YF,F*(θ(p):η(q)),=D*(rqp)=BF*(η(q):η(p))=YF*,F(η(q):θ(p)),

where YF,F* is the Fenchel–Young divergence:

YF,F*(θ(p):η(q))=F(θ(p))+F*(η(q))θ(p),η(q).

Figure 2.

Figure 2

The canonical divergence D and dual canonical divergence D* on a dually flat space M equipped with potential functions F and F* can be viewed as single-parameter contrast functions on the product manifold M×M: The divergence D can be expressed using either the θ×θ-coordinate system as a Bregman divergence or the mixed θ×η-coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence D can be expressed using either the η×η-coordinate system as a dual Bregman divergence or the mixed η×θ-coordinate system as a dual Fenchel–Young divergence.

We have the dual global coordinate system η=F(θ) and the domain H={F(θ):θΘ} which defines the dual Legendre-type potential function (H,F*(η)). The Legendre-type function ensures that F**=F (a sufficient condition is to have F be convex and lower semi-continuous [42]).

A manifold M is called dually flat, as the torsion-free affine connections ∇ and * induced by the potential functions F(θ) and F*(η) linked with the Legendre–Fenchel transformation are flat [1], that is, their Christoffel symbols vanishes in the dual coordinate system: Γ(θ)=0 and Γ*(η)=0.

The Legendre-type function (Θ,F(θ)) is not defined uniquely; the function F¯(θ¯)=F(Aθ+b)+Cθ+d with θ¯=Aθ+b for A and C invertible matrices and b and d vectors defines the same dually flat space with the same canonical divergence D(p,q):

D(p:q)=BF(θ(p):θ(q))=BF¯(θ¯(p):θ¯(q)).

Thus, a log-convex Legendre-type function Z(θ) induces two dually flat spaces by considering the DFSs induced by Z(θ) and F(θ)=logZ(θ). Let the gradient maps be η=Z(θ) and η˜=F(θ)=ηZ(θ).

When F(θ) is chosen as the cumulant function of an exponential family, the Bregman divergence BF(θ1:θ2) can be interpreted as a statistical divergence between corresponding probability densities, meaning that the Bregman divergence amounts to the reverse Kullback–Leibler divergence: BF(θ1:θ2)=DKL*(pθ1:pθ2), where DKL* is the reverse KLD.

Notice that deforming a convex function F(θ) into F(ρ(θ)) such that Fρ remains strictly convex has been considered by Yoshizawa and Tanabe [43] to build a two-parameter deformation ρα,β of the dually flat space induced by the cumulant function F(θ) of the multivariate normal family. Additionally, see the method of Hougaard [44] for obtaining other exponential families from a given exponential family.

Thus, in general, there are many more dually flat spaces with corresponding divergences and statistical divergences than the usually considered exponential family manifold [5] induced by the cumulant function. It is interesting to consider their use in information sciences.

6. Conclusions and Discussion

For machine learning practioners, it is well known that the Kullback–Leibler divergence (KLD) between two probability densities pθ1 and pθ2 of an exponential family with cumulant function F (free energy in thermodynamics) amounts to a reverse Bregman divergence [26] induced by F, or equivalently to a reverse Fenchel–Young divergence [27]

DKL(pθ1:pθ2)=BF(θ2:θ1)=YF,F*(θ2:η1),

where η=F(θ) is the dual moment or expectation parameter.

In this paper, we have shown that the KLD as extended to positive unnormalized densities p˜θ1 and p˜θ2 of an exponential family with a convex partition function Z(θ) (Laplace transform) amounts to a reverse Bregman divergence induced by Z, or equivalently to a reverse Fenchel–Young divergence

DKL(p˜θ1:p˜θ2)=BZ(θ2:θ1)=YZ,Z*(θ2:η˜1),

where η˜=Z(θ).

More generally, we have shown that the scaled α-skewed Jensen divergences induced by the cumulant and partition functions between natural parameters coincide with the scaled α-skewed Bhattacharyya distances between probability densities and the α-divergences between unnormalized densities, respectively:

DB,αs(pθ1:pθ2)=JF,αs(θ1:θ2),Dα(p˜θ1:p˜θ2)=JZ,αs(θ1:θ2).

We have noted that the partition functions Z of exponential families are both convex and log-convex, and that the corresponding cumulant functions are both convex and exponentially convex.

Figure 3 summarizes the relationships between statistical divergences and between the normalized and unnormalized densities of an exponential family, as well as the corresponding divergences between their natural parameters. Notice that Brekelmans and Nielsen [45] considered deformed uni-order likelihood ratio exponential families (LREFs) for annealing paths and obtained an identity for the α-divergences between unnormalized densities and Bregman divergences induced by multiplicatively scaled partition functions.

Figure 3.

Figure 3

Statistical divergences between normalized pθ and unnormalized p˜θ densities of an exponential family E with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e., t(x)=x and k(x)=0) with cumulant function F and partition function Z, with JF and BF respectively denoting the Jensen and Bregman divergences induced by the generator F. The statistical divergences DR,α and DB,α denote the Rényi α-divergences and skewed α-Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor 1α(1α), while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.

Because the log-convex partition function is also convex, we have generalized the principle of building pairs of convex generators using the comparative convexity with respect to a pair of quasi-arithmetical means, and have further discussed the induced dually flat spaces and divergences. In particular, by considering the convexity-preserving deformations obtained by power mean generators, we have shown how to obtain a family of convex generators and dually flat spaces. Notice that some parametric families of Bregman divergences, such as the α-divergences [46], β-divergences [47], and V-geometry [48] of symmetric positive-definite matrices, yield families of dually flat spaces.

Banerjee et al. [49] proved a duality between regular exponential families and a subclass of Bregman divergences, which they accordingly termed regular Bregman divergences. In particular, this duality allows the Maximum Likelihood Estimator (MLE) of an exponential family with a cumulant function F to be viewed as a right-sided Bregman centroid with respect to the Legendre–Fenchel dual F*. In [50], the scope of this duality was further extended for arbitrary Bregman divergences by introducing a class of generalized exponential families.

Concave deformations have been recently studied in [51], where the authors introduced the logϕ-concavity induced by a positive continuous function ϕ generating a deformed logarithm logϕ as the (A,logϕ)-comparative concavity (Definition 1.2 in [51]), as well as the weaker notion of F-concavity which corresponds to the (A,F)-concavity (Definition 2.1 in [51], requiring strictly increasing functions F). Our deformation framework Z=τ1Fρ is more general, as it is double-sided. We jointly deform the function F by Fτ=τ1F and its argument θ by θρ=ρ(θ).

Exponentially concave functions have been considered as generators of L-divergences in [24]; α-exponentially concave functions G such that exp(αG) are concave for α>0 generalize the L-divergences to Lα-divergences, which can be expressed equivalently using a generalization of the Fenchel–Young divergence based on the c-transforms [24]. When α<0, exponentially convex functions are considered instead of exponentially concave functions. The information geometry induced by Lα-divergences are dually projectively flat with constant curvature, and reciprocally possess a dually projectively flat structure with constant curvature, inducing (locally) a canonical Lα-divergence. Wong and Zhang [52] investigated a one-parameter deformation of convex duality, called λ-duality, by considering functions f such that 1λ(eλf1) are convex for λ0. They defined the λ-conjugate transform as a particular case of the c-transform [24] and studie the information geometry of the induced λ-logarithmic divergences. The λ-duality yields a generalization of exponential and mixture families to λ-exponential and λ-mixture families related to the Rényi divergence.

Finally, certain statistical divergences, called projective divergences, are invariant under rescaling, and as such can define dissimilarities between non-normalized densities. For example, the γ-divergences [32] Dγ are such that Dγ(p:q)=Dγ(p˜:q˜) (with γ-divergences tending to the KLD when γ0) or the Cauchy–Schwarz divergence [53].

Acknowledgments

The author heartily thanks the three reviewers for their helpful comments which led to this improved paper.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

Author Frank Nielsen is employed by the company Sony Computer Science Laboratories Inc. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

Funding Statement

This research received no external funding.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Amari S.I. Information Geometry and Its Applications. Springer; Tokyo, Japan: 2016. Applied Mathematical Sciences. [Google Scholar]
  • 2.Bregman L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967;7:200–217. doi: 10.1016/0041-5553(67)90040-7. [DOI] [Google Scholar]
  • 3.Nielsen F., Hadjeres G. Geometric Structures of Information. Springer; Berlin/Heidelberg, Germany: 2019. Monte Carlo information-geometric structures; pp. 69–103. [Google Scholar]
  • 4.Brown L.D. Lecture Notes-Monograph Series. Volume 9 Cornell University; Ithaca, NY, USA: 1986. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. [Google Scholar]
  • 5.Scarfone A.M., Wada T. Legendre structure of κ-thermostatistics revisited in the framework of information geometry. J. Phys. Math. Theor. 2014;47:275002. doi: 10.1088/1751-8113/47/27/275002. [DOI] [Google Scholar]
  • 6.Zhang J. Divergence function, duality, and convex analysis. Neural Comput. 2004;16:159–195. doi: 10.1162/08997660460734047. [DOI] [PubMed] [Google Scholar]
  • 7.Nielsen F., Boltz S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory. 2011;57:5455–5466. doi: 10.1109/TIT.2011.2159046. [DOI] [Google Scholar]
  • 8.Cichocki A., Amari S.I. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy. 2010;12:1532–1568. doi: 10.3390/e12061532. [DOI] [Google Scholar]
  • 9.Niculescu C., Persson L.E. Convex Functions and Their Applications. 2nd ed. Volume 23. Springer; Berlin/Heidelberg, Germany: 2018. first edition published in 2006. [Google Scholar]
  • 10.Billingsley P. Probability and Measure. John Wiley & Sons; Hoboken, NJ, USA: 2017. [Google Scholar]
  • 11.Barndorff-Nielsen O. Information and Exponential Families. John Wiley & Sons; Hoboken, NJ, USA: 2014. [Google Scholar]
  • 12.Morris C.N. Natural exponential families with quadratic variance functions. Ann. Stat. 1982;10:65–80. doi: 10.1214/aos/1176345690. [DOI] [Google Scholar]
  • 13.Efron B. Exponential Families in Theory and Practice. Cambridge University Press; Cambridge, UK: 2022. [Google Scholar]
  • 14.Grünwald P.D. The Minimum Description Length Principle. MIT Press; Cambridge, MA, USA: 2007. [Google Scholar]
  • 15.Kailath T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967;15:52–60. doi: 10.1109/TCOM.1967.1089532. [DOI] [Google Scholar]
  • 16.Wainwright M.J., Jordan M.I. Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 2008;1:1–305. [Google Scholar]
  • 17.LeCun Y., Chopra S., Hadsell R., Ranzato M., Huang F. Predicting Structured Data. Volume 1 University of Toronto; Toronto, ON, USA: 2006. A tutorial on energy-based learning. [Google Scholar]
  • 18.Kindermann R., Snell J.L. Markov Random Fields and Their Applications. Volume 1 American Mathematical Society; Providence, RI, USA: 1980. [Google Scholar]
  • 19.Dai B., Liu Z., Dai H., He N., Gretton A., Song L., Schuurmans D. Advances in Neural Information Processing Systems. Volume 32 MIT Press; Cambridge, MA, USA: 2019. Exponential family estimation via adversarial dynamics embedding. [Google Scholar]
  • 20.Cobb L., Koppstein P., Chen N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983;78:124–130. doi: 10.1080/01621459.1983.10477940. [DOI] [Google Scholar]
  • 21.Garcia V., Nielsen F. Simplification and hierarchical representations of mixtures of exponential families. Signal Process. 2010;90:3197–3212. doi: 10.1016/j.sigpro.2010.05.024. [DOI] [Google Scholar]
  • 22.Zhang J., Wong T.K.L. Handbook of Statistics. Volume 45. Elsevier; Amsterdam, The Netherlands: 2021. λ-Deformed probability families with subtractive and divisive normalizations; pp. 187–215. [Google Scholar]
  • 23.Boyd S.P., Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
  • 24.Wong T.K.L. Logarithmic divergences from optimal transport and Rényi geometry. Inf. Geom. 2018;1:39–78. doi: 10.1007/s41884-018-0012-6. [DOI] [Google Scholar]
  • 25.Van Erven T., Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]
  • 26.Azoury K.S., Warmuth M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001;43:211–246. doi: 10.1023/A:1010896012157. [DOI] [Google Scholar]
  • 27.Amari S.I. Differential-Geometrical Methods in Statistics. 1st ed. Volume 28 Springer Science & Business Media; Berlin/Heidelberg, Germany: 2012. [Google Scholar]
  • 28.Nielsen F. Statistical divergences between densities of truncated exponential families with nested supports: Duo Bregman and duo Jensen divergences. Entropy. 2022;24:421. doi: 10.3390/e24030421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Del Castillo J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994;46:57–66. doi: 10.1007/BF00773592. [DOI] [Google Scholar]
  • 30.Wainwright M.J., Jaakkola T.S., Willsky A.S. A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory. 2005;51:2313–2335. doi: 10.1109/TIT.2005.850091. [DOI] [Google Scholar]
  • 31.Hyvärinen A., Dayan P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005;6:695–709. [Google Scholar]
  • 32.Fujisawa H., Eguchi S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008;99:2053–2081. doi: 10.1016/j.jmva.2008.02.004. [DOI] [Google Scholar]
  • 33.Eguchi S., Komori O. Minimum Divergence Methods in Statistical Machine Learning. Springer; Berlin/Heidelberg, Germany: 2022. [Google Scholar]
  • 34.Kolmogorov A. Sur la Notion de la Moyenne. Cold Spring Harbor Laboratory; Cold Spring Harbor, NY, USA: 1930. [Google Scholar]
  • 35.Komori O., Eguchi S. A unified formulation of k-Means, fuzzy c-Means and Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy. 2021;23:518. doi: 10.3390/e23050518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Aczél J. A generalization of the notion of convex functions. Det K. Nor. Vidensk. Selsk. Forh. Trondheim. 1947;19:87–90. [Google Scholar]
  • 37.Nielsen F., Nock R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017;24:1123–1127. doi: 10.1109/LSP.2017.2712195. [DOI] [Google Scholar]
  • 38.Bauschke H.H., Goebel R., Lucet Y., Wang X. The proximal average: Basic theory. SIAM J. Optim. 2008;19:766–785. doi: 10.1137/070687542. [DOI] [Google Scholar]
  • 39.Rockafellar R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967;19:200–205. doi: 10.4153/CJM-1967-012-4. [DOI] [Google Scholar]
  • 40.Shima H. The Geometry of Hessian Structures. World Scientific; Singapore: 2007. [Google Scholar]
  • 41.Eguchi S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985;15:341–391. doi: 10.32917/hmj/1206130775. [DOI] [Google Scholar]
  • 42.Rockafellar R. Convex Analysis. Princeton University Press; Princeton, NJ, USA: 1997. Princeton Landmarks in Mathematics and Physics. [Google Scholar]
  • 43.Yoshizawa S., Tanabe K. Dual differential geometry associated with the Kullbaek-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999;35:113–137. doi: 10.55937/sut/991985432. [DOI] [Google Scholar]
  • 44.Hougaard P. Convex Functions in Exponential Families. Department of Mathematical Sciences, University of Copenhagen; Copenhagen, Denmark: 1983. [Google Scholar]
  • 45.Brekelmans R., Nielsen F. Variational representations of annealing paths: Bregman information under monotonic embeddings. Inf. Geom. 2024 doi: 10.1007/s41884-023-00129-6. [DOI] [Google Scholar]
  • 46.Amari S.I. α-Divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory. 2009;55:4925–4931. doi: 10.1109/TIT.2009.2030485. [DOI] [Google Scholar]
  • 47.Hennequin R., David B., Badeau R. Beta-divergence as a subclass of Bregman divergence. IEEE Signal Process. Lett. 2010;18:83–86. doi: 10.1109/LSP.2010.2096211. [DOI] [Google Scholar]
  • 48.Ohara A., Eguchi S. Group invariance of information geometry on q-Gaussian distributions induced by Beta-divergence. Entropy. 2013;15:4732–4747. doi: 10.3390/e15114732. [DOI] [Google Scholar]
  • 49.Banerjee A., Merugu S., Dhillon I.S., Ghosh J., Lafferty J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005;6:1705–1749. [Google Scholar]
  • 50.Frongillo R., Reid M.D. Convex Found. Gen. Maxent Model. 2014;1636:11–16. [Google Scholar]
  • 51.Ishige K., Salani P., Takatsu A. Hierarchy of deformations in concavity. Inf. Geom. 2022;7:251–269. doi: 10.1007/s41884-022-00088-4. [DOI] [Google Scholar]
  • 52.Zhang J., Wong T.K.L. λ-Deformation: A canonical framework for statistical manifolds of constant curvature. Entropy. 2022;24:193. doi: 10.3390/e24020193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Jenssen R., Principe J.C., Erdogmus D., Eltoft T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006;343:614–629. doi: 10.1016/j.jfranklin.2006.03.018. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES