Abstract
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the cumulant and partition functions are strictly convex and smooth functions inducing corresponding pairs of Bregman and Jensen divergences. It is well known that skewed Bhattacharyya distances between the probability densities of an exponential family amount to skewed Jensen divergences induced by the cumulant function between their corresponding natural parameters, and that in limit cases the sided Kullback–Leibler divergences amount to reverse-sided Bregman divergences. In this work, we first show that the -divergences between non-normalized densities of an exponential family amount to scaled -skewed Jensen divergences induced by the partition function. We then show how comparative convexity with respect to a pair of quasi-arithmetical means allows both convex functions and their arguments to be deformed, thereby defining dually flat spaces with corresponding divergences when ordinary convexity is preserved.
Keywords: convex duality, exponential family, Bregman divergence, Jensen divergence, Bhattacharyya distance, Rényi divergence, α-divergences, comparative convexity, log convexity, exponential convexity, quasi-arithmetic means, information geometry
1. Introduction
In information geometry [1], any strictly convex and smooth function induces a dually flat space (DFS) with a canonical divergence which can be expressed in charts either as dual Bregman divergences [2] or equivalently as dual Fenchel–Young divergences [3]. For example, the cumulant function of an exponential family [4] (also called the free energy) generates a DFS, that is, an exponential family manifold [5] with the canonical divergence yielding the reverse Kullback–Leibler divergence. Another typical example of a strictly convex and smooth function generating a DFS is the negative entropy of a mixture family, that is, a mixture family manifold with the canonical divergence yielding the (forward) Kullback–Leibler divergence [3]. In addition, any strictly convex and smooth function induces a family of scaled skewed Jensen divergences [6,7], which in limit cases includes the sided forward and reverse Bregman divergences.
In Section 2, we present two equivalent approaches to normalizing an exponential family: first by its cumulant function, and second by its partition function. Because both the cumulant and partition functions are strictly convex and smooth, they induce corresponding families of scaled skewed Jensen divergences and Bregman divergences, with corresponding dually flat spaces and related statistical divergences.
In Section 3, we recall the well-known result that the statistical -skewed Bhattacharyya distances between the probability densities of an exponential family amount to a scaled -skewed Jensen divergence between their natural parameters. In Section 4, we prove that the -divergences [8] between the unnormalized densities of a exponential family amount to scaled -skewed Jensen divergence between their natural parameters (Proposition 5). More generally, we explain in Section 5 how to deform a convex function using comparative convexity [9]: When the ordinary convexity of the deformed convex function is preserved, we obtain new skewed Jensen divergences and Bregman divergences with corresponding dually flat spaces. Finally, Section 6 concludes this work with a discussion.
2. Dual Subtractive and Divisive Normalizations of Exponential Families
2.1. Natural Exponential Families
Let be a measure space [10], where denotes the sample set (e.g., finite alphabet, , , space of positive-definite matrices , etc.), a -algebra on (e.g., power set , Borel -algebra , etc.), and a positive measure (e.g., counting measure or Lebesgue measure) on the measurable space .
A natural exponential family [4,11] (commonly abbreviated as NEF [12]) is a set of probability distributions all dominated by such that their Radon–Nikodym densities can be expressed canonically as
| (1) |
where is called the natural parameter and denotes the linear sufficient statistic vector [11]. The order of the NEF [13] is m. When the parameter ranges in the full natural parameter space
the family is called full. The NEF is said to be regular when is topologically open.
The unnormalized positive density is indicated with a tilde notation and the corresponding normalized probability density is obtained as , where is the Laplace transform of (the density normalizer). For example, the family of exponential distributions is an NEF with densities defined on the support , natural parameter in , sufficient linear statistic x, and normalizer .
2.2. Exponential Families
More generally, exponential families include many well known distributions after reparameterization [4] of their ordinary parameter by . The general canonical form of the densities of an exponential family is
| (2) |
where are the sufficient statistic vector (such that are linearly independent), is an auxiliary term used to define the base measure with respect to , and is an inner product (e.g., scalar product of , trace product of symmetric matrices, etc.). By defining a new measure such that , we may consider without loss of generality the densities with .
For example, the Bernoulli distributions, Gaussian or normal distributions, Gamma and Beta distributions, Poisson distributions, Rayleigh distributions, and Weibull distributions with prescribed shape parameter are just a few examples of exponential families with the inner product on defined as the scalar product. The categorical distributions (i.e., discrete distributions on a finite alphabet sample space) form an exponential family as well [1]. Zero-centered Gaussian distributions and Wishart distributions are examples of exponential families parameterized by positive-definite matrices with inner products defined by the matrix trace product, which is .
Exponential families abound in statistics and machine learning. Any two probability measures Q and R with densities q and r with respect to a dominating measure, say, , define an exponential family
which is called the likelihood ratio exponential family [14], as the sufficient statistic is (with auxiliary carrier term ), or the Bhattacharyya arc, as the cumulant function of is expressed as the negative of the skewed Bhattacharyya distances [7,15].
In machine learning, undirected graphical models [16] and energy-based models [17], including Markov random fields [18] and conditional random fields, are exponential families [19]. Exponential families are universal approximators of smooth densities [20].
From a theoretical standpoint, it is often enough to consider (without loss of generality) natural exponential families with densities expressed as in Equation (1). However, here we consider generic exponential families with the densities expressed in Equation (2) in order to report common examples encountered in practice, such as the multivariate Gaussian family [21].
When the natural parameter space is not full but rather parameterized by for with and a smooth function , the exponential family is called a curved exponential family [1]. For example, the special family of normal distributions is a curved exponential family with and [1].
2.3. Normalizations of Exponential Families
Recall that denotes the unnormalized density expressed using the natural parameter . We can normalize using either the partition function or equivalently using the cumulant function , as follows:
| (3) |
| (4) |
where , , and . Thus, the logarithm and exponential functions allow conversion to and from the dual normalizers Z and F:
We may view Equation (3) as an exponential tilting [13] of density .
In the context of -deformed exponential families [22] which generalize exponential families, the function is called the divisive normalization factor (Equation (3)) and the function is called the subtractive normalization factor (Equation (4)). Notice that is called the cumulant function because when is a random variable following a probability distribution of an exponential family, the function appears in the cumulant generating function of X: . In statistical physics, the cumulant function is called the log-normalizer or log-partition function. Because and , we can deduce that , as for .
It is well known that the cumulant function is a strictly convex function and that the partition function is strictly log-convex [11].
Proposition 1
([11]). The natural parameter space Θ of an exponential family is convex.
Proposition 2
([11]). The cumulant function is strictly convex and the partition function is positive and strictly log-convex.
It can be shown that the cumulant and partition functions are smooth analytic functions [4]. A remarkable property is that strictly log-convex functions are also strictly convex.
Proposition 3
([23], Section 3.5). A strictly log-convex function is strictly convex.
The converse of Proposition 3 is not necessarily true, however; certain convex functions are not log-convex, and as such the class of strictly log-convex functions is a proper subclass of strictly convex functions. For example, is convex but log-concave, as (Figure 1).
Figure 1.
Strictly log-convex functions form a proper subset of strictly convex functions.
Remark 1.
Because is strictly convex (Proposition 3), F is exponentially convex.
Definition 1.
The cumulant function F and partition function Z of a regular exponential family are both strictly convex and smooth functions inducing a pair of dually flat spaces with corresponding Bregman divergences [2] (i.e., ) and (i.e., ):
(5)
(6) along with a pair of families of skewed Jensen divergences and :
(7)
(8)
For a strictly convex function , we define the symmetric Jensen divergence as follows:
Let denote the set of real-valued strictly convex and differentiable functions defined on an open set , called Bregman generators. We may equivalently consider the set of strictly concave and differentiable functions and let ; see [24] (Equation (1)).
Remark 2.
The non-negativeness of the Bregman divergences for the cumulant and partition functions define the criteria for checking the strict convexity or log-convexity of a function:
and
The forward Bregman divergence and reverse Bregman divergence can be unified with the -skewed Jensen divergences by rescaling and allowing to range in [6,7]:
| (9) |
where denotes the reverse Bregman divergence obtained by swapping the parameter order (reference duality [6]): .
Remark 3.
Alternatively, we may rescale by a factor , i.e., such that and .
Next, in Section 3 we first recall the connections between these Jensen and Bregman divergences, which are divergences between parameters, and the statistical divergence counterparts between probability densities. Then, in Section 4 we introduce the novel connections between these parameter divergences and -divergences between unnormalized densities.
3. Divergences Related to the Cumulant Function
Consider the scaled -skewed Bhattacharyya distances [7,15] between two probability densities and :
The scaled -skewed Bhattacharyya distances can additionally be interpreted as Rényi divergences [25] scaled by : , where the Rényi -divergences are defined by
The Bhattacharyya distance corresponds to one-fourth of : . Because tends to the Kullback–Leibler divergence when and to the reverse Kullback–Leibler divergence when , we have
When both probability densities belong to the same exponential family with cumulant , we have the following proposition.
Proposition 4
([7]). The scaled α-skewed Bhattacharyya distances between two probability densities and of an exponential family amount to the scaled α-skewed Jensen divergence between their natural parameters:
(10)
Proof.
The proof follows by first considering the -skewed Bhattacharyya similarity coefficient .
Multiplying the last equation by with , we obtain
Because , we have ; therefore, we obtain
□
For practitioners in machine learning, it is well known that the Kullback–Leibler divergence between two probability densities and of an exponential family amounts to a Bregman divergence for the cumulant generator on a swapped parameter order (e.g., [26,27]):
This is a particular instance of Equation (10) obtained for :
This formula has been further generalized in [28] by considering truncations of exponential family densities. Let and , be two truncated families of with corresponding cumulant functions
and
Then, we have
Truncated exponential families are normalized exponential families which may not be regular [29], i.e., the parameter space may not be open.
4. Divergences Related to the Partition Function
Certain exponential families have intractable cumulant/partition functions (e.g., exponential families with sufficient statistics for high degrees m [20]) or cumulant/partition functions which require exponential time to compute [30] (e.g., graphical models [16], high-dimensional grid sample spaces, energy-based models [17] in deep learning, etc.). In such cases, the maximum likelihood estimator (MLE) cannot be used to infer the natural parameter of exponential densities. Many alternative methods have been proposed to handle such exponential families with untractable partition functions, e.g., score matching [31] or divergence-based inference [32,33]). Thus, it is important to consider dissimilarities between non-normalized statistical models.
The squared Hellinger distance [1] between two positive potentially unnormalized densities and is defined by
Notice that the Hellinger divergence can be interpreted as the integral of the difference between the arithmetical mean minus the geometrical mean of the densities: . This further proves that , as . The Hellinger distance satisfies the metric axioms of distances.
When considering unnormalized densities and of an exponential family with a partition function , we obtain
| (11) |
as .
The Kullback–Leibler divergence [1] as extended to two positive densities and is defined by
| (12) |
When considering unnormalized densities and of , we obtain
| (13) |
| (14) |
| (15) |
| (16) |
as . Let denote the reverse KLD.
More generally, the family of -divergences [1] between the unnormalized densities and is defined for by
We now have , and the -divergences are homogeneous divergences of degree 1. For all , we have . Moreoever, because can be expressed as the difference of the weighted arithmetic mean minus the weighted geometric mean , it follows from the arithmetical–geometrical mean inequality that we have .
When considering unnormalized densities and of , we obtain
Proposition 5.
The α-divergences between the unnormalized densities of an exponential family amount to scaled α-Jensen divergences between their natural parameters for the partition function
When , the oriented Kullback–Leibler divergences between unnormalized exponential family densities amount to reverse Bregman divergences on their corresponding natural parameters for the partition function
Proof.
For , consider
Here, we have , and . It follows that
□
Notice that the KLD extended to unnormalized densities can be written as a generalized relative entropy, i.e., it can be obtained as the difference of the extended cross-entropy minus the extended entropy (self cross-entropy):
with
and
Remark 4.
In general, we can consider two unnormalized positive densities and . Let and denote their corresponding normalized densities (with normalizing factors and ); then, the KLD between and can be expressed using the KLD between their normalized densities and normalizing factors, as follows:
(17) Similarly, we have
(18)
(19) and .
Notice that Equation (17) allows us to derive the following identity between and :
| (20) |
| (21) |
Let be the scalar KLD for and . Then, we can rewrite Equation (17) as
and we have
In addition, the KLD between the unnormalized densities and with support can be written as a definite integral of a scalar Bregman divergence:
where . Because , we can deduce that with equality iff almost everywhere.
Notice that can be interpreted as the sum of two divergences, that is, a conformal Bregman divergence with a scalar Bregman divergence.
Remark 5.
Consider the KLD between the normalized and unnormalized densities of the same exponential family. In this case, we have
(22)
(23) The divergence is a dual Bregman pseudo-divergence [28]:
for and that are two strictly convex and smooth functions such that . Indeed, we can check that generators and are both Bregman generators; then, we have , as for all x (with equality when ), i.e., .
More generally, the α-divergences between and can be written as
(24) with the (signed) α-skewed Bhattacharyya distances provided by
Let us illustrate Proposition 5 with some examples.
Example 1.
Consider the family of exponential distributions , where is an exponential family with a natural parameter , parameter space , sufficient statistic . The partition function is , with and , while the cumulant function is with moment parameter . The α-divergences between two unnormalized exponential distributions are
(25)
Example 2.
Consider the family of univariate centered normal distributions with and partition function such that . Here, we have a natural parameter and sufficient statistic . The partition function expressed with the natural parameter is , with and (strictly convex on Θ). The unnormalized KLD between and is
We can check that we have .
For the Hellinger divergence, we have
and we can check that .
Consider the family of the d-variate case of centered normal distributions with unnormalized density
obtained using the matrix trace cyclic property, where Σ is the covariance matrix. Here, we have (precision matrix) and for , with the matrix inner product . The partition function expressed with the natural parameter is . This is a convex function with
as using matrix calculus.
Now, consider the family of univariate normal distributions
Let and
The unnormalized densities are , and we have
It follows that .
5. Deforming Convex Functions and Their Induced Dually Flat Spaces
5.1. Comparative Convexity
The log-convexity can be interpreted as a special case of comparative convexity with respect to a pair of comparable weighted means [9], as follows.
A function Z is -convex if and only if for we have
| (26) |
and is strictly -convex iff we have strict inequality for and . Furthermore, a function Z is (strictly) -concave if is (strictly) -convex.
Log-convexity corresponds to -convexity, i.e., convexity with respect to the weighted arithmetical and geometrical means defined respectively by and . Ordinary convexity is -convexity.
A weighted quasi-arithmetical mean [34] (also called a Kolmogorov–Nagumo mean [35]) is defined for a continuous and strictly increasing function h by
We let . Quasi-arithmetical means include the arithmetical mean obtained for and the geometrical mean for , and more generally power means
which are quasi-arithmetical means obtained for the family of generators with inverse . In the limit , we have for the generator .
Proposition 6
([36,37]). A function is strictly -convex with respect to two strictly increasing smooth functions ρ and τ if and only if the function is strictly convex.
Notice that the set of strictly increasing smooth functions form a non-Abelian group, with the group operation as the function composition, the neutral element as the identity function, and the inverse element as the functional inverse function.
Because log-convexity is -convexity, a function Z is strictly log-convex iff is strictly convex. We have
Starting from a given convex function , we can deform the function to obtain a function using two strictly monotone functions and : .
For a -convex function which is also strictly convex, we can define a pair of Bregman divergences and with and a corresponding pair of skewed Jensen divergences.
Thus, we have the following generic deformation scheme.
In particular, when the function Z is deformed by strictly increasing the power functions and for and in as
then is strictly convex when it is strictly -convex, and as such induces corresponding Bregman and Jensen divergences.
Example 3.
Consider the partition function of the exponential distribution family ( with ). Let ; then, we have when . Thus, we can deform Z smoothly by while preserving the convexity by ranging p from to . In this way, we obtain a corresponding family of Bregman and Jensen divergences.
The proposed convex deformation using quasi-arithmetical mean generators differs from the interpolation of convex functions using the technique of proximal averaging [38].
Note that in [37] the comparative convexity with respect to a pair of quasi-arithmetical means is used to define a -Bregman divergence, which turns out to be equivalent to a conformal Bregman divergence on the -embedding of the parameters.
5.2. Dually Flat Spaces
We start with a refinement of the class of convex functions used to generate dually flat spaces.
Definition 2
(Legendre type function [39]). is of Legendre type if the function is strictly convex and differentiable with and
(27)
Legendre-type functions admit a convex conjugate via the Legendre transform :
A smooth and strictly convex function of Legendre type induces a dually flat space [1] , i.e., a smooth Hessian manifold [40] with a single global chart [1]. A canonical divergence between two points p and q of is viewed as a single-parameter contrast function [41] on the product manifold . The canonical divergence and its dual canonical divergence can be expressed equivalently as either dual Bregman divergences or dual Fenchel–Young divergences (Figure 2):
where is the Fenchel–Young divergence:
Figure 2.
The canonical divergence and dual canonical divergence on a dually flat space equipped with potential functions and can be viewed as single-parameter contrast functions on the product manifold : The divergence can be expressed using either the -coordinate system as a Bregman divergence or the mixed -coordinate system as a Fenchel–Young divergence. Similarly, the dual divergence can be expressed using either the -coordinate system as a dual Bregman divergence or the mixed -coordinate system as a dual Fenchel–Young divergence.
We have the dual global coordinate system and the domain which defines the dual Legendre-type potential function . The Legendre-type function ensures that (a sufficient condition is to have F be convex and lower semi-continuous [42]).
A manifold is called dually flat, as the torsion-free affine connections ∇ and induced by the potential functions and linked with the Legendre–Fenchel transformation are flat [1], that is, their Christoffel symbols vanishes in the dual coordinate system: and .
The Legendre-type function is not defined uniquely; the function with for A and C invertible matrices and b and d vectors defines the same dually flat space with the same canonical divergence :
Thus, a log-convex Legendre-type function induces two dually flat spaces by considering the DFSs induced by and . Let the gradient maps be and .
When is chosen as the cumulant function of an exponential family, the Bregman divergence can be interpreted as a statistical divergence between corresponding probability densities, meaning that the Bregman divergence amounts to the reverse Kullback–Leibler divergence: , where is the reverse KLD.
Notice that deforming a convex function into such that remains strictly convex has been considered by Yoshizawa and Tanabe [43] to build a two-parameter deformation of the dually flat space induced by the cumulant function of the multivariate normal family. Additionally, see the method of Hougaard [44] for obtaining other exponential families from a given exponential family.
Thus, in general, there are many more dually flat spaces with corresponding divergences and statistical divergences than the usually considered exponential family manifold [5] induced by the cumulant function. It is interesting to consider their use in information sciences.
6. Conclusions and Discussion
For machine learning practioners, it is well known that the Kullback–Leibler divergence (KLD) between two probability densities and of an exponential family with cumulant function F (free energy in thermodynamics) amounts to a reverse Bregman divergence [26] induced by F, or equivalently to a reverse Fenchel–Young divergence [27]
where is the dual moment or expectation parameter.
In this paper, we have shown that the KLD as extended to positive unnormalized densities and of an exponential family with a convex partition function (Laplace transform) amounts to a reverse Bregman divergence induced by Z, or equivalently to a reverse Fenchel–Young divergence
where .
More generally, we have shown that the scaled -skewed Jensen divergences induced by the cumulant and partition functions between natural parameters coincide with the scaled -skewed Bhattacharyya distances between probability densities and the -divergences between unnormalized densities, respectively:
We have noted that the partition functions Z of exponential families are both convex and log-convex, and that the corresponding cumulant functions are both convex and exponentially convex.
Figure 3 summarizes the relationships between statistical divergences and between the normalized and unnormalized densities of an exponential family, as well as the corresponding divergences between their natural parameters. Notice that Brekelmans and Nielsen [45] considered deformed uni-order likelihood ratio exponential families (LREFs) for annealing paths and obtained an identity for the -divergences between unnormalized densities and Bregman divergences induced by multiplicatively scaled partition functions.
Figure 3.
Statistical divergences between normalized and unnormalized densities of an exponential family with corresponding divergences between their natural parameters. Without loss of generality, we consider a natural exponential family (i.e., and ) with cumulant function F and partition function Z, with and respectively denoting the Jensen and Bregman divergences induced by the generator F. The statistical divergences and denote the Rényi -divergences and skewed -Bhattacharyya distances, respectively. The superscript “s” indicates rescaling by the multiplicative factor , while the superscript “*” denotes the reverse divergence obtained by swapping the parameter order.
Because the log-convex partition function is also convex, we have generalized the principle of building pairs of convex generators using the comparative convexity with respect to a pair of quasi-arithmetical means, and have further discussed the induced dually flat spaces and divergences. In particular, by considering the convexity-preserving deformations obtained by power mean generators, we have shown how to obtain a family of convex generators and dually flat spaces. Notice that some parametric families of Bregman divergences, such as the -divergences [46], -divergences [47], and V-geometry [48] of symmetric positive-definite matrices, yield families of dually flat spaces.
Banerjee et al. [49] proved a duality between regular exponential families and a subclass of Bregman divergences, which they accordingly termed regular Bregman divergences. In particular, this duality allows the Maximum Likelihood Estimator (MLE) of an exponential family with a cumulant function F to be viewed as a right-sided Bregman centroid with respect to the Legendre–Fenchel dual . In [50], the scope of this duality was further extended for arbitrary Bregman divergences by introducing a class of generalized exponential families.
Concave deformations have been recently studied in [51], where the authors introduced the -concavity induced by a positive continuous function generating a deformed logarithm as the -comparative concavity (Definition 1.2 in [51]), as well as the weaker notion of F-concavity which corresponds to the -concavity (Definition 2.1 in [51], requiring strictly increasing functions F). Our deformation framework is more general, as it is double-sided. We jointly deform the function F by and its argument by .
Exponentially concave functions have been considered as generators of L-divergences in [24]; -exponentially concave functions G such that are concave for generalize the L-divergences to -divergences, which can be expressed equivalently using a generalization of the Fenchel–Young divergence based on the c-transforms [24]. When , exponentially convex functions are considered instead of exponentially concave functions. The information geometry induced by -divergences are dually projectively flat with constant curvature, and reciprocally possess a dually projectively flat structure with constant curvature, inducing (locally) a canonical -divergence. Wong and Zhang [52] investigated a one-parameter deformation of convex duality, called -duality, by considering functions f such that are convex for . They defined the -conjugate transform as a particular case of the c-transform [24] and studie the information geometry of the induced -logarithmic divergences. The -duality yields a generalization of exponential and mixture families to -exponential and -mixture families related to the Rényi divergence.
Finally, certain statistical divergences, called projective divergences, are invariant under rescaling, and as such can define dissimilarities between non-normalized densities. For example, the -divergences [32] are such that (with -divergences tending to the KLD when ) or the Cauchy–Schwarz divergence [53].
Acknowledgments
The author heartily thanks the three reviewers for their helpful comments which led to this improved paper.
Informed Consent Statement
Not applicable.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article.
Conflicts of Interest
Author Frank Nielsen is employed by the company Sony Computer Science Laboratories Inc. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.
Funding Statement
This research received no external funding.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Amari S.I. Information Geometry and Its Applications. Springer; Tokyo, Japan: 2016. Applied Mathematical Sciences. [Google Scholar]
- 2.Bregman L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967;7:200–217. doi: 10.1016/0041-5553(67)90040-7. [DOI] [Google Scholar]
- 3.Nielsen F., Hadjeres G. Geometric Structures of Information. Springer; Berlin/Heidelberg, Germany: 2019. Monte Carlo information-geometric structures; pp. 69–103. [Google Scholar]
- 4.Brown L.D. Lecture Notes-Monograph Series. Volume 9 Cornell University; Ithaca, NY, USA: 1986. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. [Google Scholar]
- 5.Scarfone A.M., Wada T. Legendre structure of κ-thermostatistics revisited in the framework of information geometry. J. Phys. Math. Theor. 2014;47:275002. doi: 10.1088/1751-8113/47/27/275002. [DOI] [Google Scholar]
- 6.Zhang J. Divergence function, duality, and convex analysis. Neural Comput. 2004;16:159–195. doi: 10.1162/08997660460734047. [DOI] [PubMed] [Google Scholar]
- 7.Nielsen F., Boltz S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory. 2011;57:5455–5466. doi: 10.1109/TIT.2011.2159046. [DOI] [Google Scholar]
- 8.Cichocki A., Amari S.I. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy. 2010;12:1532–1568. doi: 10.3390/e12061532. [DOI] [Google Scholar]
- 9.Niculescu C., Persson L.E. Convex Functions and Their Applications. 2nd ed. Volume 23. Springer; Berlin/Heidelberg, Germany: 2018. first edition published in 2006. [Google Scholar]
- 10.Billingsley P. Probability and Measure. John Wiley & Sons; Hoboken, NJ, USA: 2017. [Google Scholar]
- 11.Barndorff-Nielsen O. Information and Exponential Families. John Wiley & Sons; Hoboken, NJ, USA: 2014. [Google Scholar]
- 12.Morris C.N. Natural exponential families with quadratic variance functions. Ann. Stat. 1982;10:65–80. doi: 10.1214/aos/1176345690. [DOI] [Google Scholar]
- 13.Efron B. Exponential Families in Theory and Practice. Cambridge University Press; Cambridge, UK: 2022. [Google Scholar]
- 14.Grünwald P.D. The Minimum Description Length Principle. MIT Press; Cambridge, MA, USA: 2007. [Google Scholar]
- 15.Kailath T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967;15:52–60. doi: 10.1109/TCOM.1967.1089532. [DOI] [Google Scholar]
- 16.Wainwright M.J., Jordan M.I. Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 2008;1:1–305. [Google Scholar]
- 17.LeCun Y., Chopra S., Hadsell R., Ranzato M., Huang F. Predicting Structured Data. Volume 1 University of Toronto; Toronto, ON, USA: 2006. A tutorial on energy-based learning. [Google Scholar]
- 18.Kindermann R., Snell J.L. Markov Random Fields and Their Applications. Volume 1 American Mathematical Society; Providence, RI, USA: 1980. [Google Scholar]
- 19.Dai B., Liu Z., Dai H., He N., Gretton A., Song L., Schuurmans D. Advances in Neural Information Processing Systems. Volume 32 MIT Press; Cambridge, MA, USA: 2019. Exponential family estimation via adversarial dynamics embedding. [Google Scholar]
- 20.Cobb L., Koppstein P., Chen N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983;78:124–130. doi: 10.1080/01621459.1983.10477940. [DOI] [Google Scholar]
- 21.Garcia V., Nielsen F. Simplification and hierarchical representations of mixtures of exponential families. Signal Process. 2010;90:3197–3212. doi: 10.1016/j.sigpro.2010.05.024. [DOI] [Google Scholar]
- 22.Zhang J., Wong T.K.L. Handbook of Statistics. Volume 45. Elsevier; Amsterdam, The Netherlands: 2021. λ-Deformed probability families with subtractive and divisive normalizations; pp. 187–215. [Google Scholar]
- 23.Boyd S.P., Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
- 24.Wong T.K.L. Logarithmic divergences from optimal transport and Rényi geometry. Inf. Geom. 2018;1:39–78. doi: 10.1007/s41884-018-0012-6. [DOI] [Google Scholar]
- 25.Van Erven T., Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]
- 26.Azoury K.S., Warmuth M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001;43:211–246. doi: 10.1023/A:1010896012157. [DOI] [Google Scholar]
- 27.Amari S.I. Differential-Geometrical Methods in Statistics. 1st ed. Volume 28 Springer Science & Business Media; Berlin/Heidelberg, Germany: 2012. [Google Scholar]
- 28.Nielsen F. Statistical divergences between densities of truncated exponential families with nested supports: Duo Bregman and duo Jensen divergences. Entropy. 2022;24:421. doi: 10.3390/e24030421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Del Castillo J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994;46:57–66. doi: 10.1007/BF00773592. [DOI] [Google Scholar]
- 30.Wainwright M.J., Jaakkola T.S., Willsky A.S. A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory. 2005;51:2313–2335. doi: 10.1109/TIT.2005.850091. [DOI] [Google Scholar]
- 31.Hyvärinen A., Dayan P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005;6:695–709. [Google Scholar]
- 32.Fujisawa H., Eguchi S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008;99:2053–2081. doi: 10.1016/j.jmva.2008.02.004. [DOI] [Google Scholar]
- 33.Eguchi S., Komori O. Minimum Divergence Methods in Statistical Machine Learning. Springer; Berlin/Heidelberg, Germany: 2022. [Google Scholar]
- 34.Kolmogorov A. Sur la Notion de la Moyenne. Cold Spring Harbor Laboratory; Cold Spring Harbor, NY, USA: 1930. [Google Scholar]
- 35.Komori O., Eguchi S. A unified formulation of k-Means, fuzzy c-Means and Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy. 2021;23:518. doi: 10.3390/e23050518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Aczél J. A generalization of the notion of convex functions. Det K. Nor. Vidensk. Selsk. Forh. Trondheim. 1947;19:87–90. [Google Scholar]
- 37.Nielsen F., Nock R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017;24:1123–1127. doi: 10.1109/LSP.2017.2712195. [DOI] [Google Scholar]
- 38.Bauschke H.H., Goebel R., Lucet Y., Wang X. The proximal average: Basic theory. SIAM J. Optim. 2008;19:766–785. doi: 10.1137/070687542. [DOI] [Google Scholar]
- 39.Rockafellar R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967;19:200–205. doi: 10.4153/CJM-1967-012-4. [DOI] [Google Scholar]
- 40.Shima H. The Geometry of Hessian Structures. World Scientific; Singapore: 2007. [Google Scholar]
- 41.Eguchi S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985;15:341–391. doi: 10.32917/hmj/1206130775. [DOI] [Google Scholar]
- 42.Rockafellar R. Convex Analysis. Princeton University Press; Princeton, NJ, USA: 1997. Princeton Landmarks in Mathematics and Physics. [Google Scholar]
- 43.Yoshizawa S., Tanabe K. Dual differential geometry associated with the Kullbaek-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999;35:113–137. doi: 10.55937/sut/991985432. [DOI] [Google Scholar]
- 44.Hougaard P. Convex Functions in Exponential Families. Department of Mathematical Sciences, University of Copenhagen; Copenhagen, Denmark: 1983. [Google Scholar]
- 45.Brekelmans R., Nielsen F. Variational representations of annealing paths: Bregman information under monotonic embeddings. Inf. Geom. 2024 doi: 10.1007/s41884-023-00129-6. [DOI] [Google Scholar]
- 46.Amari S.I. α-Divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory. 2009;55:4925–4931. doi: 10.1109/TIT.2009.2030485. [DOI] [Google Scholar]
- 47.Hennequin R., David B., Badeau R. Beta-divergence as a subclass of Bregman divergence. IEEE Signal Process. Lett. 2010;18:83–86. doi: 10.1109/LSP.2010.2096211. [DOI] [Google Scholar]
- 48.Ohara A., Eguchi S. Group invariance of information geometry on q-Gaussian distributions induced by Beta-divergence. Entropy. 2013;15:4732–4747. doi: 10.3390/e15114732. [DOI] [Google Scholar]
- 49.Banerjee A., Merugu S., Dhillon I.S., Ghosh J., Lafferty J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005;6:1705–1749. [Google Scholar]
- 50.Frongillo R., Reid M.D. Convex Found. Gen. Maxent Model. 2014;1636:11–16. [Google Scholar]
- 51.Ishige K., Salani P., Takatsu A. Hierarchy of deformations in concavity. Inf. Geom. 2022;7:251–269. doi: 10.1007/s41884-022-00088-4. [DOI] [Google Scholar]
- 52.Zhang J., Wong T.K.L. λ-Deformation: A canonical framework for statistical manifolds of constant curvature. Entropy. 2022;24:193. doi: 10.3390/e24020193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Jenssen R., Principe J.C., Erdogmus D., Eltoft T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006;343:614–629. doi: 10.1016/j.jfranklin.2006.03.018. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article.



