Skip to main content
Entropy logoLink to Entropy
. 2018 May 19;20(5):383. doi: 10.3390/e20050383

On f-Divergences: Integral Representations, Local Behavior, and Inequalities

Igal Sason 1
PMCID: PMC7512902  PMID: 33265473

Abstract

This paper is focused on f-divergences, consisting of three main contributions. The first one introduces integral representations of a general f-divergence by means of the relative information spectrum. The second part provides a new approach for the derivation of f-divergence inequalities, and it exemplifies their utility in the setup of Bayesian binary hypothesis testing. The last part of this paper further studies the local behavior of f-divergences.

Keywords: DeGroot statistical information, f-divergences, local behavior, relative information spectrum, Rényi divergence

1. Introduction

Probability theory, information theory, learning theory, statistical signal processing and other related disciplines, greatly benefit from non-negative measures of dissimilarity (a.k.a. divergence measures) between pairs of probability measures defined on the same measurable space (see, e.g., [1,2,3,4,5,6,7]). An axiomatic characterization of information measures, including divergence measures, was provided by Csiszár [8]. Many useful divergence measures belong to the set of f-divergences, independently introduced by Ali and Silvey [9], Csiszár [10,11,12,13], and Morimoto [14] in the early sixties. The family of f-divergences generalizes the relative entropy (a.k.a. the Kullback- Leibler divergence) while also satisfying the data processing inequality among other pleasing properties (see, e.g., [3] and references therein).

Integral representations of f-divergences serve to study properties of these information measures, and they are also used to establish relations among these divergences. An integral representation of f-divergences, expressed by means of the DeGroot statistical information, was provided in [3] with a simplified proof in [15]. The importance of this integral representation stems from the operational meaning of the DeGroot statistical information [16], which is strongly linked to Bayesian binary hypothesis testing. Some earlier specialized versions of this integral representation were introduced in [17,18,19,20,21], and a variation of it also appears in [22] Section 5.B. Implications of the integral representation of f-divergences, by means of the DeGroot statistical information, include an alternative proof of the data processing inequality, and a study of conditions for the sufficiency or ε-deficiency of observation channels [3,15].

Since many distance measures of interest fall under the paradigm of an f-divergence [23], bounds among f-divergences are very useful in many instances such as the analysis of rates of convergence and concentration of measure bounds, hypothesis testing, testing goodness of fit, minimax risk in estimation and modeling, strong data processing inequalities and contraction coefficients, etc. Earlier studies developed systematic approaches to obtain f-divergence inequalities while dealing with pairs of probability measures defined on arbitrary alphabets. A list of some notable existing f-divergence inequalities is provided, e.g., in [22] Section 1 and [23] Section 3. State-of-the-art techniques which serve to derive bounds among f-divergences include:

  • (1)

    Moment inequalities which rely on log-convexity arguments ([22] Section 5.D, [24,25,26,27,28]);

  • (2)

    Inequalities which rely on a characterization of the exact locus of the joint range of f-divergences [29];

  • (3)

    f-divergence inequalities via functional domination ([22] Section 3, [30,31,32]);

  • (4)

    Sharp f-divergence inequalities by using numerical tools for maximizing or minimizing an f-divergence subject to a finite number of constraints on other f-divergences [33];

  • (5)

    Inequalities which rely on powers of f-divergences defining a distance [34,35,36,37];

  • (6)

    Vajda and Pinsker-type inequalities for f-divergences ([4,10,13,22] Sections 6–7, [38,39]);

  • (7)

    Bounds among f-divergences when the relative information is bounded ([22] Sections 4–5, [40,41,42,43,44,45,46,47]), and reverse Pinsker inequalities ([22] Section 6, [40,48]);

  • (8)

    Inequalities which rely on the minimum of an f-divergence for a given total variation distance and related bounds [4,33,37,38,49,50,51,52,53];

  • (9)

    Bounds among f-divergences (or functions of f-divergences such as the Rényi divergence) via integral representations of these divergence measures [22] Section 8;

  • (10)

    Inequalities which rely on variational representations of f-divergences (e.g., [54] Section 2).

Following earlier studies of the local behavior of f-divergences and their asymptotic properties (see related results by Csiszár and Shields [55] Theorem 4.1, Pardo and Vajda [56] Section 3, and Sason and Vérdu [22] Section 3.F), it is known that the local behavior of f-divergences scales, such as the chi-square divergence (up to a scaling factor which depends on f) provided that the first distribution approaches the reference measure in a certain strong sense. The study of the local behavior of f-divergences is an important aspect of their properties, and we further study it in this work.

This paper considers properties of f-divergences, while first introducing in Section 2 the basic definitions and notation needed, and in particular the various measures of dissimilarity between probability measures used throughout this paper. The presentation of our new results is then structured as follows:

Section 3 is focused on the derivation of new integral representations of f-divergences, expressed as a function of the relative information spectrum of the pair of probability measures, and the convex function f. The novelty of Section 3 is in the unified approach which leads to integral representations of f-divergences by means of the relative information spectrum, where the latter cumulative distribution function plays an important role in information theory and statistical decision theory (see, e.g., [7,54]). Particular integral representations of the type of results introduced in Section 3 have been recently derived by Sason and Verdú in a case-by-case basis for some f-divergences (see [22] Theorems 13 and 32), while lacking the approach which is developed in Section 3 for general f-divergences. In essence, an f-divergence Df(PQ) is expressed in Section 3 as an inner product of a simple function of the relative information spectrum (depending only on the probability measures P and Q), and a non-negative weight function ωf:(0,)[0,) which only depends on f. This kind of representation, followed by a generalized result, serves to provide new integral representations of various useful f-divergences. This also enables in Section 3 to characterize the interplay between the DeGroot statistical information (or between another useful family of f-divergence, named the Eγ divergence with γ1) and the relative information spectrum.

Section 4 provides a new approach for the derivation of f-divergence inequalities, where an arbitrary f-divergence is lower bounded by means of the Eγ divergence [57] or the DeGroot statistical information [16]. The approach used in Section 4 yields several generalizations of the Bretagnole-Huber inequality [58], which provides a closed-form and simple upper bound on the total variation distance as a function of the relative entropy; the Bretagnole-Huber inequality has been proved to be useful, e.g., in the context of lower bounding the minimax risk in non-parametric estimation (see, e.g., [5] pp. 89–90, 94), and in the problem of density estimation (see, e.g., [6] Section 1.6). Although Vajda’s tight lower bound in [59] is slightly tighter everywhere than the Bretagnole-Huber inequality, our motivation for the generalization of the latter bound is justified later in this paper. The utility of the new inequalities is exemplified in the setup of Bayesian binary hypothesis testing.

Section 5 finally derives new results on the local behavior of f-divergences, i.e., the characterization of their scaling when the pair of probability measures are sufficiently close to each other. The starting point of our analysis in Section 5 relies on the analysis in [56] Section 3, regarding the asymptotic properties of f-divergences.

The reading of Section 3, Section 4 and Section 5 can be done in any order since the analysis in these sections is independent.

2. Preliminaries and Notation

We assume throughout that the probability measures P and Q are defined on a common measurable space (A,F), and PQ denotes that P is absolutely continuous with respect to Q, namely there is no event FF such that P(F)>0=Q(F).

Definition 1.

The relative information provided byaAaccording to(P,Q), wherePQ, is given by

ıPQ(a):=logdPdQ(a). (1)

More generally, even ifP¬Q, let R be an arbitrary dominating probability measure such thatP,QR(e.g.,R=12(P+Q)); irrespectively of the choice of R, the relative information is defined to be

ıPQ(a):=ıPR(a)ıQR(a),aA. (2)

The following asymmetry property follows from (2):

ıPQ=ıQP. (3)

Definition 2.

The relative information spectrum is the cumulative distribution function

FPQ(x)=PıPQ(X)x,xR,XP. (4)

The relative entropy is the expected valued of the relative information when it is distributed according to P:

D(PQ):=EıPQ(X),XP. (5)

Throughout this paper, C denotes the set of convex functions f:(0,)R with f(1)=0. Hence, the function f0 is in C; if fC, then afC for all a>0; and if f,gC, then f+gC. We next provide a general definition for the family of f-divergences (see [3] p. 4398).

Definition 3

(f-divergence [9,10,12]). Let P and Q be probability measures, let μ be a dominating measure of P and Q (i.e., P,Qμ; e.g., μ=P+Q), and let p:=dPdμ and q:=dQdμ. The f-divergence from P to Q is given, independently of μ, by

Df(PQ):=qfpqdμ, (6)

where

f(0):=limt0f(t), (7)
0f00:=0, (8)
0fa0:=limt0tfat=alimuf(u)u,a>0. (9)

We rely in this paper on the following properties of f-divergences:

Proposition 1.

Letf,gC. The following conditions are equivalent:

  • (1)
    Df(PQ)=Dg(PQ),P,Q; (10)
  • (2)
    there exists a constantcRsuch that
    f(t)g(t)=c(t1),t(0,). (11)

Proposition 2.

LetfC, and letf:(0,)Rbe the conjugate function, given by

f(t)=tf1t (12)

fort>0. Then,fC; f=f, and for every pair of probability measures(P,Q),

Df(PQ)=Df(QP). (13)

By an analytic extension of f in (12) at t=0, let

f(0):=limt0f(t)=limuf(u)u. (14)

Note that the convexity of f implies that f(0)(,]. In continuation to Definition 3, we get

(15)Df(PQ)=qfpqdμ(16)={pq>0}qfpqdμ+Q(p=0)f(0)+P(q=0)f(0)

with the convention in (16) that 0·=0, We refer in this paper to the following f-divergences:

  • (1)
    Relative entropy:
    D(PQ)=Df(PQ). (17)
    with
    f(t)=tlogt,t>0. (18)
  • (2)
    Jeffrey’s divergence [60]:
    J(PQ):=D(PQ)+D(QP) (19)
    =Df(PQ) (20)
    with
    f(t)=(t1)logt,t>0. (21)
  • (3)
    Hellinger divergence of orderα(0,1)(1,) [2] Definition 2.10:
    Hα(PQ)=Dfα(PQ) (22)
    with
    fα(t)=tα1α1,t>0. (23)
    Some of the significance of the Hellinger divergence stems from the following facts:
    • -
      The analytic extension of Hα(PQ) at α=1 yields
      D(PQ)=H1(PQ)loge. (24)
    • -
      The chi-squared divergence [61] is the second order Hellinger divergence (see, e.g., [62] p. 48), i.e.,
      χ2(PQ)=H2(PQ). (25)
      Note that, due to Proposition 1,
      χ2(PQ)=Df(PQ), (26)
      where f:(0,)R can be defined as
      f(t)=(t1)2,t>0. (27)
    • -
      The squared Hellinger distance (see, e.g., [62] p. 47), denoted by H2(PQ), satisfies the identity
      H2(PQ)=12H12(PQ). (28)
    • -
      The Bhattacharyya distance [63], denoted by B(PQ), satisfies
      B(PQ)=log11H2(PQ). (29)
    • -
      The Rényi divergence of order α(0,1)(1,) is a one-to-one transformation of the Hellinger divergence of the same order [11] (14):
      Dα(PQ)=1α1log1+(α1)Hα(PQ). (30)
    • -
      The Alpha-divergence of order α, as it is defined in [64] and ([65] (4)), is a generalized relative entropy which (up to a scaling factor) is equal to the Hellinger divergence of the same order α. More explicitly,
      DA(α)(PQ)=1αHα(PQ), (31)
      where DA(α)(··) denotes the Alpha-divergence of order α. Note, however, that the Beta and Gamma-divergences in [65], as well as the generalized divergences in [66,67], are not f-divergences in general.
  • (4)
    χsdivergence for s1 [2] (2.31), and the total variation distance: The function
    fs(t)=|t1|s,t>0 (32)
    results in
    χs(PQ)=Dfs(PQ). (33)
    Specifically, for s=1, let
    f(t):=f1(t)=|t1|,t>0, (34)
    and the total variation distance is expressed as an f-divergence:
    |PQ|=Df(PQ). (35)
  • (5)
    Triangular Discrimination [39] (a.k.a. Vincze-Le Cam distance):
    Δ(PQ)=Df(PQ) (36)
    with
    f(t)=(t1)2t+1,t>0. (37)
    Note that
    12Δ(PQ)=χ2(P12P+12Q)=χ2(Q12P+12Q). (38)
  • (6)
    Lin’s measure [68] (4.1):
    (39)Lθ(PQ):=HθP+(1θ)QθH(P)(1θ)H(Q)(40)=θDPθP+(1θ)Q+(1θ)DQθP+(1θ)Q,
    for θ[0,1]. This measure can be expressed by the following f-divergence:
    Lθ(PQ)=Dfθ(PQ), (41)
    with
    fθ(t):=θtlogtθt+1θlogθt+1θ,t>0. (42)
    The special case of (41) with θ=12 gives the Jensen-Shannon divergence (a.k.a. capacitory discrimination):
    (43)JS(PQ):=L12(PQ)(44)=12DP12P+12Q+12DQ12P+12Q.
  • (7)
    Eγdivergence [57] p. 2314: For γ1,
    (45)Eγ(PQ):=maxUFP(U)γQ(U)(46)=P[ıPQ(X)>logγ]γP[ıPQ(Y)>logγ]
    with XP and YQ, and where (46) follows from the Neyman-Pearson lemma. The Eγ divergence can be identified as an f-divergence:
    Eγ(PQ)=Dfγ(PQ) (47)
    with
    fγ(t):=(tγ)+,t>0 (48)
    where (x)+:=max{x,0}. The following relation to the total variation distance holds:
    E1(PQ)=12|PQ|. (49)
  • (8)
    DeGroot statistical information [3,16]: For ω(0,1),
    Iω(PQ)=Dϕω(PQ) (50)
    with
    ϕω(t)=min{ω,1ω}min{ωt,1ω},t>0. (51)
    The following relation to the total variation distance holds:
    I12(PQ)=14|PQ|, (52)
    and the DeGroot statistical information and the Eγ divergence are related as follows [22] (384):
    Iω(PQ)={ωE1ωω(PQ),ω0,12,(1ω)Eω1ω(QP),ω(12,1). (53)

3. New Integral Representations of f-Divergences

The main result in this section provides new integral representations of f-divergences as a function of the relative information spectrum (see Definition 2). The reader is referred to other integral representations (see [15] Section 2, [4] Section 5, [22] Section 5.B, and references therein), expressing a general f-divergence by means of the DeGroot statistical information or the Eγ divergence.

Lemma 1.

LetfCbe a strictly convex function at 1. Letg:RRbe defined as

g(x):=exp(x)fexp(x)f+(1)1exp(x),xR (54)

wheref+(1)denotes the right-hand derivative of f at 1 (due to the convexity of f on(0,), it exists and it is finite). Then, the function g is non-negative, it is strictly monotonically decreasing on(,0], and it is strictly monotonically increasing on[0,)withg(0)=0.

Proof. 

For any function uC, let u˜C be given by

u˜(t)=u(t)u+(1)(t1),t(0,), (55)

and let uC be the conjugate function, as given in (12). The function g in (54) can be expressed in the form

g(x)=(f˜)exp(x),xR, (56)

as it is next verified. For t>0, we get from (12) and (55),

(f˜)(t)=tf˜1t=tf1t+f+(1)(t1), (57)

and the substitution t:=exp(x) for xR yields (56) in view of (54).

By assumption, fC is strictly convex at 1, and therefore these properties are inherited to f˜. Since also f˜(1)=f˜(1)=0, it follows from [3] Theorem 3 that both f˜ and f˜ are non-negative on (0,), and they are also strictly monotonically decreasing on (0,1]. Hence, from (12), it follows that the function (f˜) is strictly monotonically increasing on [1,). Finally, the claimed properties of the function g follow from (56), and in view of the fact that the function (f˜) is non-negative with (f˜)(1)=0, strictly monotonically decreasing on (0,1] and strictly monotonically increasing on [1,). ☐

Lemma 2.

LetfCbe a strictly convex function at 1, and letg:RRbe as in (54). Let

a:=limxg(x)(0,], (58)
b:=limxg(x)(0,], (59)

and let 1:[0,a)[0,) and 2:[0,b)(,0] be the two inverse functions of g. Then,

Df(PQ)=0a1FPQ1(t)dt+0bFPQ2(t)dt. (60)

Proof. 

In view of Lemma 1, it follows that 1:[0,a)[0,) is strictly monotonically increasing and 2:[0,b)(,0] is strictly monotonically decreasing with 1(0)=2(0)=0.

Let XP, and let V:=expıPQ(X). Then, we have

(61)Df(PQ)=Df˜(PQ)(62)=D(f˜)(QP)(63)=(f˜)expıQP(x)dP(x)(64)=(f˜)expıPQ(x)dP(x)(65)=gıPQ(x)dP(x)(66)=Eg(V)(67)=0Pg(V)>tdt(68)=0aPV0,g(V)>tdt+0bPV<0,g(V)>tdt(69)=0aPV>1(t)dt+0bPV2(t)dt(70)=0a1FPQ1(t)dt+0bFPQ2(t)dt

where (61) relies on Proposition 1; (62) relies on Proposition 2; (64) follows from (3); (65) follows from (56); (66) holds by the definition of the random variable V; (67) holds since, in view of Lemma 1, Z:=g(V)0, and E[Z]=0P[Z>t]dt for any non-negative random variable Z; (68) holds in view of the monotonicity properties of g in Lemma 1, the definition of a and b in (58) and (59), and by expressing the event {g(V)>t} as a union of two disjoint events; (69) holds again by the monotonicity properties of g in Lemma 1, and by the definition of its two inverse functions 1 and 2 as above; in (67)–(69) we are free to substitute > by ≥, and < by ≤; finally, (70) holds by the definition of the relative information spectrum in (4). ☐

Remark 1.

The functiong:RRin (54) is invariant to the mapping f(t)f(t)+c(t1), for t>0, with an arbitrary cR. This invariance of g (and, hence, also the invariance of its inverse functions 1 and 2) is well expected in view of Proposition 1 and Lemma 2.

Example 1.

For the chi-squared divergence in (26), letting f be as in (27), it follows from (54) that

g(x)=4sinh212logex,xR, (71)

which yields, from (58) and (59), a=b=. Calculation of the two inverse functions of g, as defined in Lemma 2, yields the following closed-form expression:

1,2(u)=±2logu+u+42,u0. (72)

Substituting (72) into (60) provides an integral representation of χ2(PQ).

Lemma 3.

0FPQ(logβ)β2dβ=1. (73)

Proof. 

Let XP. Then, we have

(74)0FPQ(logβ)β2dβ=01β2P[ıPQ(X)logβ]dβ(75)=01β2PexpıQP(X)1βdβ(76)=0PexpıQP(X)udu(77)=EexpıQP(X)(78)=1,

where (74) holds by (4); (75) follows from (3); (76) holds by the substitution u:=1β; (77) holds since expıQP(X)0, and finally (78) holds since XP. ☐

Remark 2.

Unlike Example 1, in general, the inverse functions1and2in Lemma 2 are not expressible in closed form, motivating our next integral representation in Theorem 1.

The following theorem provides our main result in this section.

Theorem 1.

The following integral representations of an f-divergence, by means of the relative information spectrum, hold:

  • (1) 
    Let
    • -
      fCbe differentiable on(0,);
    • -
      wf:(0,)[0,)be the non-negative weight function given, forβ>0, by
      wf(β):=1βf(β)f(β)+f(1)β; (79)
    • -
      the function GPQ:(0,)[0,1] be given by
      GPQ(β):={1FPQ(logβ),β[1,),FPQ(logβ),β(0,1). (80)
    Then,
    Df(PQ)=wf,GPQ=0wf(β)GPQ(β)dβ. (81)
  • (2) 
    More generally, for an arbitrarycR, letw˜f,c:(0,)Rbe a modified real-valued function defined as
    w˜f,c(β):=wf(β)+cβ21{β1}1{0<β<1}. (82)
    Then,
    Df(PQ)=w˜f,c,GPQ. (83)

Proof. 

We start by proving the special integral representation in (81), and then extend our proof to the general representation in (83).

  • (1)
    We first assume an additional requirement that f is strictly convex at 1. In view of Lemma 2,
    1g(u)=u,u[0,), (84)
    2g(u)=u,u(,0]. (85)
    Since by assumption fC is differentiable on (0,) and strictly convex at 1, the function g in (54) is differentiable on R. In view of (84) and (85), substituting t:=glogβ in (60) for β>0 implies that
    Df(PQ)=11FPQlogβw¯f(β)dβ01FPQlogβw¯f(β)dβ, (86)
    where w¯f:(0,)R is given by
    (87)w¯f(β):=glogββloge(88)=1βf(β)f(β)+f(1)β
    for β>0, where (88) follows from (54). Due to the monotonicity properties of g in Lemma 1, (87) implies that w¯f(β)0 for β1, and w¯f(β)<0 for β(0,1). Hence, the weight function wf in (79) satisfies
    wf(β)=w¯f(β)=w¯f(β)1{β1}1{0<β<1},β>0. (89)

    The combination of (80), (86) and (89) gives the required result in (81).

    We now extend the result in (81) when fC is differentiable on (0,), but not necessarily strictly convex at 1. To that end, let s:(0,)R be defined as
    s(t):=f(t)+(t21),t>0. (90)
    This implies that sC is differentiable on (0,), and it is also strictly convex at 1. In view of the proof of (81) when f is strict convexity of f at 1, the application of this result to the function s in (90) yields
    Ds(PQ)=ws,GPQ. (91)
    In view of (6), (22), (23), (25) and (90),
    Ds(PQ)=Df(PQ)+χ2(PQ); (92)
    from (79), (89), (90) and the convexity and differentiability of fC, it follows that the weight function ws(0,)[0,) satisfies
    ws(β)=wf(β)+11β21{β1}1{0<β<1} (93)
    for β>0. Furthermore, by applying the result in (81) to the chi-squared divergence χ2(PQ) in (25) whose corresponding function f2(t):=t21 for t>0 is strictly convex at 1, we obtain
    χ2(PQ)=011β21{β1}1{0<β<1}GPQ(β)dβ. (94)

    Finally, the combination of (91)–(94), yields Df(PQ)=wf,GPQ; this asserts that (81) also holds by relaxing the condition that f is strictly convex at 1.

  • (2)
    In view of (80)–(82), in order to prove (83) for an arbitrary cR, it is required to prove the identity
    11FPQlogββ2dβ=01FPQlogββ2dβ. (95)

    Equality (95) can be verified by Lemma 3: by rearranging terms in (95), we get the identity in (73) (since 1dββ2=1). ☐

Remark 3.

Due to the convexity of f, the absolute value in the right side of (79) is only needed for β(0,1) (see (88) and (89)). Also, wf(1)=0 since f(1)=0.

Remark 4.

The weight functionwfonly depends on f, and the functionGPQonly depends on the pair of probability measures P and Q. In view of Proposition 1, it follows that, forf,gC, the equalitywf=wgholds on(0,)if and only if (11) is satisfied with an arbitrary constant cR. It is indeed easy to verify that (11) yields wf=wg on (0,).

Remark 5.

An equivalent way to writeGPQin (80) is

GPQ(β)={PdPdQ(X)>β,β[1,)PdPdQ(X)β,β(0,1) (96)

where XP. Hence, the function GPQ:(0,)[0,1] is monotonically increasing in (0,1), and it is monotonically decreasing in [1,); note that this function is in general discontinuous at 1 unless FPQ(0)=12. If PQ, then

limβ0GPQ(β)=limβGPQ(β)=0. (97)

Note that ifP=Q, thenGPQis zero everywhere, which is consistent with the fact thatDf(PQ)=0.

Remark 6.

In the proof of Theorem 1-(1), the relaxation of the condition of strict convexity at 1 for a differentiable function fC is crucial, e.g., for the χs divergence with s>2. To clarify this claim, note that in view of (32), the function fs:(0,)R is differentiable if s>1, and fsC with fs(1)=0; however, fs(1)=0 if s>2, so fs in not strictly convex at 1 unless s[1,2].

Remark 7.

Theorem 1-(1) with c0 enables, in some cases, to simplify integral representations of f-divergences. This is next exemplified in the proof of Theorem 2.

Theorem 1 yields integral representations for various f-divergences and related measures; some of these representations were previously derived by Sason and Verdú in [22] in a case by case basis, without the unified approach of Theorem 1. We next provide such integral representations. Note that, for some f-divergences, the function fC is not differentiable on (0,); hence, Theorem 1 is not necessarily directly applicable.

Theorem 2.

The following integral representations hold as a function of the relative information spectrum:

  • (1) 
    Relative entropy [22] (219):
    1logeD(PQ)=11FPQ(logβ)βdβ01FPQ(logβ)βdβ. (98)
  • (2) 
    Hellinger divergence of order α(0,1)(1,) [22] (434) and (437):
    Hα(PQ)={11α0βα2FPQ(logβ)dβ,α(0,1)0βα21FPQ(logβ)dβ1α1,α(1,). (99)
    In particular, the chi-squared divergence, squared Hellinger distance and Bhattacharyya distance satisfy
    χ2(PQ)=01FPQ(logβ)dβ1; (100)
    H2(PQ)=1120β32FPQ(logβ)dβ; (101)
    B(PQ)=log2log0β32FPQ(logβ)dβ, (102)
    where (100) appears in [22] (439).
  • (3) 
    Rényi divergence [22] (426) and (427): For α(0,1)(1,),
    Dα(PQ)={1α1log(1α)0βα2FPQ(logβ)dβ,α(0,1)1α1log(α1)0βα21FPQ(logβ)dβ,α(1,). (103)
  • (4) 
    χs divergence: For s1
    χs(PQ)=11βs1+1β(β1)s11FPQ(logβ)dβ+011βs1+1β(1β)s1FPQ(logβ)dβ. (104)
    In particular, the following identities hold for the total variation distance:
    (105)|PQ|=211FPQ(logβ)β2dβ(106)=201FPQ(logβ)β2dβ,
    where (105) appears in [22] (214).
  • (5) 
    DeGroot statistical information:
    Iw(PQ)={(1w)01wwFPQ(logβ)β2dβ,w(12,1)(1w)1ww1FPQ(logβ)β2dβ,w0,12. (107)
  • (6) 
    Triangular discrimination:
    Δ(PQ)=401FPQ(logβ)(β+1)2dβ2. (108)
  • (7) 
    Lin’s measure: For θ[0,1],
    Lθ(PQ)=h(θ)(1θ)0log1+θβ1θβ2FPQ(logβ)dβ, (109)
    where h:[0,1][0,log2] denotes the binary entropy function. Specifically, the Jensen-Shannon divergence admits the integral representation:
    JS(PQ)=log20log(β+1)2β2FPQ(logβ)dβ. (110)
  • (8) 
    Jeffrey’s divergence:
    J(PQ)=11FPQ(logβ)logeβ+logββ2dβ01FPQ(logβ)logeβ+logββ2dβ. (111)
  • (9) 
    Eγ divergence: For γ1,
    Eγ(PQ)=γγ1FPQ(logβ)β2dβ. (112)

Proof. 

See Appendix A. ☐

An application of (112) yields the following interplay between the Eγ divergence and the relative information spectrum.

Theorem 3.

Let XP, and let the random variable ıPQ(X) have no probability masses. Denote

A1:=Eγ(PQ):γ1, (113)
A2:=Eγ(QP):γ>1. (114)

Then,

  • Eγ(PQ) is a continuously differentiable function of γ on (1,), and Eγ(PQ)0;

  • the sets A1 and A2 determine, respectively, the relative information spectrum FPQ(·) on [0,) and (,0);

  • for γ>1,
    FPQ(+logγ)=1Eγ(PQ)+γEγ(PQ), (115)
    FPQ(logγ)=Eγ(QP), (116)
    (117)FPQ(0)=1E1(PQ)+limγ1Eγ(PQ)(118)=limγ1Eγ(QP).

Proof. 

We start by proving the first item. By our assumption, FPQ(·) is continuous on R. Hence, it follows from (112) that Eγ(PQ) is continuously differentiable in γ(1,); furthermore, (45) implies that Eγ(PQ) is monotonically decreasing in γ, which yields Eγ(PQ)0.

We next prove the second and third items together. Let XP and YQ. From (112), for γ>1,

ddγEγ(PQ)γ=1FPQ(logγ)γ2, (119)

which yields (115). Due to the continuity of FPQ(·), it follows that the set A1 determines the relative information spectrum on [0,).

To prove (116), we have

(120)Eγ(QP)=P[ıQP(Y)>logγ]γP[ıQP(X)>logγ](121)=1FQP(logγ)γP[ıQP(X)>logγ](122)=Eγ(QP)γEγ(QP)γP[ıQP(X)>logγ](123)=Eγ(QP)γEγ(QP)γP[ıPQ(X)<logγ](124)=Eγ(QP)γEγ(QP)γFPQ(logγ)

where (120) holds by switching P and Q in (46); (121) holds since YQ; (122) holds by switching P and Q in (115) (correspondingly, also XP and YQ are switched); (123) holds since ıQP=ıPQ; (124) holds by the assumption that dPdQ(X) has no probability masses, which implies that the sign < can be replaced with ≤ at the term P[ıPQ(X)<logγ] in the right side of (123). Finally, (116) readily follows from (120)–(124), which implies that the set A2 determines FPQ(·) on (,0).

Equalities (117) and (117) finally follows by letting γ1, respectively, on both sides of (115) and (116). ☐

A similar application of (107) yields an interplay between DeGroot statistical information and the relative information spectrum.

Theorem 4.

Let XP, and let the random variable ıPQ(X) have no probability masses. Denote

B1:=Iω(PQ):ω0,12, (125)
B2:=Iω(PQ):ω12,1. (126)

Then,

  • Iω(PQ) is a continuously differentiable function of ω on (0,12)(12,1),
    limω12Iω(PQ)limω12Iω(PQ)=2, (127)
    and Iω(PQ) is, respectively, non-negative or non-positive on 0,12 and 12,1;
  • the sets B1 and B2 determine, respectively, the relative information spectrum FPQ(·) on [0,) and (,0);

  • for ω0,12
    FPQlog1ωω=1Iω(PQ)(1ω)Iω(PQ), (128)
    for ω12,1
    FPQlog1ωω=Iω(PQ)(1ω)Iω(PQ), (129)
    and
    FPQ(0)=I12(PQ)12limω12Iω(PQ). (130)

Remark 8.

By relaxing the condition in Theorems 3 and 4 where dPdQ(X) has no probability masses with XP, it follows from the proof of Theorem 3 that each one of the sets

A:=A1A2=Eγ(PQ),Eγ(QP):γ1, (131)
B:=B1B2=Iω(PQ):ω(0,1) (132)

determines FPQ(·) at every point on R where this relative information spectrum is continuous. Note that, as a cumulative distribution function, FPQ(·) is discontinuous at a countable number of points. Consequently, under the condition that fC is differentiable on (0,), the integral representations of Df(PQ) in Theorem 1 are not affected by the countable number of discontinuities for FPQ(·).

In view of Theorems 1, 3 and 4 and Remark 8, we get the following result.

Corollary 1.

Let fC be a differentiable function on (0,), and let PQ be probability measures. Then, each one of the sets A and B in (131) and (132), respectively, determines Df(PQ).

Remark 9.

Corollary 1 is supported by the integral representation of Df(PQ) in [3] Theorem 11, expressed as a function of the set of values in B, and its analogous representation in [22] Proposition 3 as a function of the set of values in A. More explicitly, [3] Theorem 11 states that if fC, then

Df(PQ)=01Iω(PQ)dΓf(ω) (133)

where Γf is a certain σ-finite measure defined on the Borel subsets of (0,1); it is also shown in [3] (80) that if fC is twice differentiable on (0,), then

Df(PQ)=01Iω(PQ)1ω3fω1ωdω. (134)

4. New f-Divergence Inequalities

Various approaches for the derivation of f-divergence inequalities were studied in the literature (see Section 1 for references). This section suggests a new approach, leading to a lower bound on an arbitrary f-divergence by means of the Eγ divergence of an arbitrary order γ1 (see (45)) or the DeGroot statistical information (see (50)). This approach leads to generalizations of the Bretagnole-Huber inequality [58], whose generalizations are later motivated in this section. The utility of the f-divergence inequalities in this section is exemplified in the setup of Bayesian binary hypothesis testing.

In the following, we provide the first main result in this section for the derivation of new f-divergence inequalities by means of the Eγ divergence. Generalizing the total variation distance, the Eγ divergence in (45)–(47) is an f-divergence whose utility in information theory has been exemplified in [17] Chapter 3, [54],[57] p. 2314 and [69]; the properties of this measure were studied in [22] Section 7 and [54] Section 2.B.

Theorem 5.

Let fC, and let fC be the conjugate convex function as defined in (12). Let P and Q be probability measures. Then, for all γ[1,),

Df(PQ)f1+1γEγ(PQ)+f1γ1Eγ(PQ)f1γ. (135)

Proof. 

Let p=dPdμ and q=dQdμ be the densities of P and Q with respect to a dominating measure μ(P,Qμ). Then, for an arbitrary aR,

(136)Df(PQ)=Df(QP)(137)=pfqpdμ(138)=pfmaxa,qp+fmina,qpf(a)dμ(137)fpmaxa,qpdμ+fpmina,qpdμf(a)

where (139) follows from the convexity of f and by invoking Jensen’s inequality.

Setting a:=1γ with γ[1,) gives

(140)pmaxa,qpdμ=maxpγ,qdμ(141)=qdμ+maxpγq,0dμ(142)=1+1γqmaxpqγ,0dμ(143)=1+1γEγ(PQ),

and

(143)pmina,qpdμ=pa+qpmaxa,qpdμ(145)=a+1pmaxa,qpdμ(146)=1γ1Eγ(PQ)

where (146) follows from (143) by setting a:=1γ. Substituting (143) and (146) into the right side of (139) gives (135). ☐

An application of Theorem 5 gives the following lower bounds on the Hellinger and Rényi divergences with arbitrary positive orders, expressed as a function of the Eγ divergence with an arbitrary order γ1.

Corollary 2.

For all α>0 and γ1,

Hα(PQ){1α11+1γEγ(PQ)1α+1Eγ(PQ)γ1α1γα1,α1loge1+1γEγ(PQ)1Eγ(PQ),α=1, (147)

and

Dα(PQ){1α1log1+1γEγ(PQ)1α+γα11Eγ(PQ)1α1,α1log1+1γEγ(PQ)1Eγ(PQ),α=1. (148)

Proof. 

Inequality (147), for α(0,1)(1,), follows from Theorem 5 and (22); for α=1, it holds in view of Theorem 5, and equalities (17) and (24). Inequality (148), for α(0,1)(1,), follows from (30) and (147); for α=1, it holds in view of (24), (147) and since D1(PQ)=D(PQ). ☐

Specialization of Corollary 2 for α=2 in (147) and α=1 in (148) gives the following result.

Corollary 3.

For γ[1,), the following upper bounds on Eγ divergence hold as a function of the relative entropy and χ2 divergence:

Eγ(PQ)121γ+(γ1)2+4γχ2(PQ)1+γ+χ2(PQ), (149)
Eγ(PQ)121γ+(γ1)2+4γ1exp(D(PQ)). (150)

Remark 10.

From [4] (58),

χ2(PQ){|PQ|2,|PQ|[0,1)|PQ|2|PQ|,|PQ|[1,2) (151)

is a tight lower bound on the chi-squared divergence as a function of the total variation distance. In view of (49), we compare (151) with the specialized version of (149) when γ=1. The latter bound is expected to be looser than the tight bound in (151), as a result of the use of Jensen’s inequality in the proof of Theorem 5; however, it is interesting to examine how much we loose in the tightness of this specialized bound with γ=1. From (49), the substitution of γ=1 in (149) gives

χ2(PQ)2|PQ|24|PQ|2,|PQ|[0,2), (152)

and, it can be easily verified that

  • if |PQ|[0,1), then the lower bound in the right side of (152) is at most twice smaller than the tight lower bound in the right side of (151);

  • if |PQ|[1,2), then the lower bound in the right side of (152) is at most 32 times smaller than the tight lower bound in the right side of (151).

Remark 11.

Setting γ=1 in (150), and using (49), specializes to the Bretagnole-Huber inequality [58]:

|PQ|21expD(PQ). (153)

Inequality (153) forms a counterpart to Pinsker’s inequality:

12|PQ|2logeD(PQ), (154)

proved by Csiszár [12] and Kullback [70], with Kemperman [71] independently a bit later. As upper bounds on the total variation distance, (154) outperforms (153) if D(PQ)1.594 nats, and (153) outperforms (154) for larger values of D(PQ).

Remark 12.

In [59] (8), Vajda introduced a lower bound on the relative entropy as a function of the total variation distance:

D(PQ)log2+|PQ|2|PQ|2|PQ|loge2+|PQ|,|PQ|[0,2). (155)

The lower bound in the right side of (155) is asymptotically tight in the sense that it tends to ∞ if |PQ|2, and the difference between D(PQ) and this lower bound is everywhere upper bounded by 2|PQ|3(2+|PQ|)24 (see [59] (9)). The Bretagnole-Huber inequality in (153), on the other hand, is equivalent to

D(PQ)log114|PQ|2,|PQ|[0,2). (156)

Although it can be verified numerically that the lower bound on the relative entropy in (155) is everywhere slightly tighter than the lower bound in (156) (for |PQ|[0,2)), both lower bounds on D(PQ) are of the same asymptotic tightness in a sense that they both tend to ∞ as |PQ|2 and their ratio tends to 1. Apart of their asymptotic tightness, the Bretagnole-Huber inequality in (156) is appealing since it provides a closed-form simple upper bound on |PQ| as a function of D(PQ) (see (153)), whereas such a closed-form simple upper bound cannot be obtained from (155). In fact, by the substitution v:=2|PQ|2+|PQ| and the exponentiation of both sides of (155), we get the inequality vev1eexpD(PQ) whose solution is expressed by the Lambert W function [72]; it can be verified that (155) is equivalent to the following upper bound on the total variation distance as a function of the relative entropy:

|PQ|21+W(z)1W(z), (157)
z:=1eexpD(PQ), (158)

where W in the right side of (157) denotes the principal real branch of the Lambert W function. The difference between the upper bounds in (153) and (157) can be verified to be marginal if D(PQ) is large (e.g., if D(PQ)=4 nats, then the upper bounds on |PQ| are respectively equal to 1.982 and 1.973), though the former upper bound in (153) is clearly more simple and amenable to analysis.

The Bretagnole-Huber inequality in (153) is proved to be useful in the context of lower bounding the minimax risk (see, e.g., [5] pp. 89–90, 94), and the problem of density estimation (see, e.g., [6] Section 1.6). The utility of this inequality motivates its generalization in this section (see Corollaries 2 and 3, and also see later Theorem 7 followed by Example 2).

In [22] Section 7.C, Sason and Verdú generalized Pinsker’s inequality by providing an upper bound on the Eγ divergence, for γ>1, as a function of the relative entropy. In view of (49) and the optimality of the constant in Pinsker’s inequality (154), it follows that the minimum achievable D(PQ) is quadratic in E1(PQ) for small values of E1(PQ). It has been proved in [22] Section 7.C that this situation ceases to be the case for γ>1, in which case it is possible to upper bound Eγ(PQ) as a constant times D(PQ) where this constant tends to infinity as we let γ1. We next cite the result in [22] Theorem 30, extending (154) by means of the Eγ divergence for γ>1, and compare it numerically to the bound in (150).

Theorem 6.

([22] Theorem 30) For every γ>1,

supEγ(PQ)D(PQ)=cγ (159)

where the supremum is over PQ,PQ, and cγ is a universal function (independent of (P,Q)), given by

cγ=tγγtγlogtγ+(1tγ)loge, (160)
tγ=γW11γe1γ (161)

where W1 in (161) denotes the secondary real branch of the Lambert W function [72].

As an immediate consequence of (159), it follows that

Eγ(PQ)cγD(PQ), (162)

which forms a straight-line bound on the Eγ divergence as a function of the relative entropy for γ>1. Similarly to the comparison of the Bretagnole-Huber inequality (153) and Pinsker’s inequality (154), we exemplify numerically that the extension of Pinsker’s inequality to the Eγ divergence in (162) forms a counterpart to the generalized version of the Bretagnole-Huber inequality in (150).

Figure 1 plots an upper bound on the Eγ divergence, for γ{1.1,2.0,3.0,4.0}, as a function of the relative entropy (or, alternatively, a lower bound on the relative entropy as a function of the Eγ divergence). The upper bound on Eγ(PQ) for γ>1, as a function of D(PQ), is composed of the following two components:

  • the straight-line bound, which refers to the right side of (162), is tighter than the bound in the right side of (150) if the relative entropy is below a certain value that is denoted by d(γ) in nats (it depends on γ);

  • the curvy line, which refers to the bound in the right side of (150), is tighter than the straight-line bound in the right side of (162) for larger values of the relative entropy.

Figure 1.

Figure 1

Upper bounds on the Eγ divergence, for γ>1, as a function of the relative entropy (the curvy and straight lines follow from (150) and (162), respectively).

It is supported by Figure 1 that d:(1,)(0,) is positive and monotonically increasing, and limγ1d(γ)=0; e.g., it can be verified that d(1.1)0.02, d(2)0.86, d(3)1.61, and d(4)2.10 (see Figure 1).

Bayesian Binary Hypothesis Testing

The DeGroot statistical information [16] has the following meaning: consider two hypotheses H0 and H1, and let P[H0]=ω and P[H1]=1ω with ω(0,1). Let P and Q be probability measures, and consider an observation Y where Y|H0P, and Y|H1Q. Suppose that one wishes to decide which hypothesis is more likely given the observation Y. The operational meaning of the DeGroot statistical information, denoted by Iω(PQ), is that this measure is equal to the minimal difference between the a-priori error probability (without side information) and a posteriori error probability (given the observation Y). This measure was later identified as an f-divergence by Liese and Vajda [3] (see (50) here).

Theorem 7.

The DeGroot statistical information satisfies the following upper bound as a function of the chi-squared divergence:

Iω(PQ){ω12+14ω(1ω)1+ωχ2(PQ),ω0,12,12ω+14ω(1ω)1+ωχ2(QP),ω12,1, (163)

and the following bounds as a function of the relative entropy:

  • (1) 
    Iω(PQ){ωc1ωωD(PQ),ω0,12,18logeminD(PQ),D(QP),ω=12,(1ω)cω1ωD(QP),ω12,1, (164)
    where cγ for γ>1 is introduced in (160);
  • (2) 
    Iω(PQ){ω12+14ω(1ω)expD(PQ),ω0,12,12ω+14ω(1ω)expD(QP),ω12,1. (165)

Proof. 

The first bound in (163) holds by combining (53) and (149); the second bound in (164) follows from (162) and (53) for ω0,1212,1, and it follows from (52) and (154) when ω=12; finally, the third bound in (165) follows from (150) and (53). ☐

Remark 13.

The bound in (164) forms an extension of Pinsker’s inequality (154) when ω12 (i.e., in the asymmetric case where the hypotheses H0 and H1 are not equally probable). Furthermore, in view of (52), the bound in (165) is specialized to the Bretagnole-Huber inequality in (153) by letting ω=12.

Remark 14.

Numerical evidence shows that none of the bounds in (163)–(165) supersedes the others.

Remark 15.

The upper bounds on Iω(PμPλ) in (163) and (165) are asymptotically tight when we let D(PQ) and D(QP) tend to infinity. To verify this, first note that (see [23] Theorem 5)

D(PQ)log1+χ2(PQ), (166)

which implies that also χ2(PQ) and χ2(QP) tend to infinity. In this case, it can be readily verified that the bounds in (163) and (165) are specialized to Iω(PQ)min{ω,1ω}; this upper bound, which is equal to the a-priori error probability, is also equal to the DeGroot statistical information since the a-posterior error probability tends to zero in the considered extreme case where P and Q are sufficiently far from each other, so that H0 and H1 are easily distinguishable in high probability when the observation Y is available.

Remark 16.

Due to the one-to-one correspondence between the Eγ divergence and DeGroot statistical information in (53), which shows that the two measures are related by a multiplicative scaling factor, the numerical results shown in Figure 1 also apply to the bounds in (164) and (165); i.e., for ω12, the first bound in (164) is tighter than the second bound in (165) for small values of the relative entropy, whereas (165) becomes tighter than (164) for larger values of the relative entropy.

Corollary 4.

Let fC, and let fC be as defined in (12). Then,

  • (1) 
    for w(0,12],
    Df(PQ)f1+Iw(PQ)1w+fwIw(PQ)1wfw1w; (167)
  • (2) 
    for w12,1,
    Df(PQ)f1+Iw(QP)w+f1wIw(QP)wf1ww. (168)

Proof. 

Inequalities (167) and (168) follow by combining (135) and (53). ☐

We end this section by exemplifying the utility of the bounds in Theorem 7.

Example 2.

Let P[H0]=ω and P[H1]=1ω with ω(0,1), and assume that the observation Y given that the hypothesis is H0 or H1 is Poisson distributed with the positive parameter μ or λ, respectively:

Y|H0Pμ, (169)
Y|H1Pλ (170)

where

Pλ[k]=eλλkk!,k{0,1,}. (171)

Without any loss of generality, let ω0,12. The bounds on the DeGroot statistical information Iω(PμPλ) in Theorem 7 can be expressed in a closed form by relying on the following identities:

D(PμPλ)=μlogμλ+(λμ)loge, (172)
χ2(PμPλ)=e(μλ)2λ1. (173)

In this example, we compare the simple closed-form bounds on Iω(PμPλ) in (163)–(165) with its exact value

Iω(PμPλ)=min{ω,1ω}k=0minωPμ[k],(1ω)Pλ[k]. (174)

To simplify the right side of (174), let μ>λ, and define

k0=k0(λ,μ,ω):=μλ+ln1ωωlnμλ, (175)

where for xR, x denotes the largest integer that is smaller than or equal to x. It can be verified that

{ωPμ[k](1ω)Pλ[k],forkk0ωPμ[k]>(1ω)Pλ[k],fork>k0. (176)

Hence, from (174)–(176),

(177)Iω(PμPλ)=min{ω,1ω}ωk=0k0Pμ[k](1ω)k=k0+1Pλ[k](178)=min{ω,1ω}ωk=0k0Pμ[k](1ω)1k=0k0Pλ[k].

To exemplify the utility of the bounds in Theorem 7, suppose that μ and λ are close, and we wish to obtain a guarantee on how small Iω(PμPλ) is. For example, let λ=99, μ=101, and ω=110. The upper bounds on Iω(PμPλ) in (163)–(165) are, respectively, equal to 4.6·104, 5.8·104 and 2.2·103; we therefore get an informative guarantee by easily calculable bounds. The exact value of Iω(PμPλ) is, on the other hand, hard to compute since k0=209 (see (175)), and the calculation of the right side of (178) appears to be sensitive to the selected parameters in this setting.

5. Local Behavior of f-Divergences

This section studies the local behavior of f-divergences; the starting point relies on [56] Section 3 which studies the asymptotic properties of f-divergences. The reader is also referred to a related study in [22] Section 4.F.

Lemma 4.

Let

  • {Pn} be a sequence of probability measures on a measurable space (A,F);

  • the sequence {Pn} converge to a probability measure Q in the sense that
    limnesssupdPndQ(Y)=1,YQ (179)
    where PnQ for all sufficiently large n;
  • f,gC have continuous second derivatives at 1 and g(1)>0.

Then

limnDf(PnQ)Dg(PnQ)=f(1)g(1). (180)

Proof. 

The result in (180) follows from [56] Theorem 3, even without the additional restriction in [56] Section 3 which would require that the second derivatives of f and g are locally Lipschitz at a neighborhood of 1. More explicitly, in view of the analysis in [56] p. 1863, we get by relaxing the latter restriction that (cf. [56] (31))

Df(PnQ)12f(1)χ2(PnQ)12supy[1εn,1+εn]f(y)f(1)χ2(PnQ), (181)

where εn0 as we let n, and also

limnχ2(PnQ)=0. (182)

By our assumption, due to the continuity of f and g at 1, it follows from (181) and (182) that

limnDf(PnQ)χ2(PnQ)=12f(1), (183)
limnDg(PnQ)χ2(PnQ)=12g(1), (184)

which yields (180) (recall that, by assumption, g(1)>0). ☐

Remark 17.

Since f and g in Lemma 4 are assumed to have continuous second derivatives at 1, the left and right derivatives of the weight function wf in (79) at 1 satisfy, in view of Remark 3,

wf(1+)=wf(1)=f(1). (185)

Hence, the limit in the right side of (180) is equal to wf(1+)wg(1+) or also to wf(1)wg(1).

Lemma 5.

χ2(λP+(1λ)QQ)=λ2χ2(PQ),λ[0,1]. (186)

Proof. 

Let p=dPdμ and q=dQdμ be the densities of P and Q with respect to an arbitrary probability measure μ such that P,Qμ. Then,

(187)χ2(λP+(1λ)QQ)=(λp+(1λ)q)q2qdμ(188)=λ2(pq)2qdμ(189)=λ2χ2(PQ).

Remark 18.

The result in Lemma 5, for the chi-squared divergence, is generalized to the identity

χs(λP+(1λ)QQ)=λsχs(PQ),λ[0,1], (190)

for all s1 (see (33)). The special case of s=2 is required in the continuation of this section.

Remark 19.

The result in Lemma 5 can be generalized as follows: let P,Q,R be probability measures, and λ[0,1]. Let P,Q,Rμ for an arbitrary probability measure μ, and p:=dPdμ, q:=dQdμ, and r:=dRdμ be the corresponding densities with respect to μ. Calculation shows that

χ2(λP+(1λ)QR)χ2(QR)=cλ+χ2(PR)χ2(QR)cλ2 (191)

with

c:=(pq)qrdμ. (192)

If Q=R, then c=0 in (192), and (191) is specialized to (186). However, if QR, then c may be non-zero. This shows that, for small λ[0,1], the left side of (191) scales linearly in λ if c0, and it has a quadratic scaling in λ if c=0 and χ2(PR)χ2(QR) (e.g., if Q=R, as in Lemma 5). The identity in (191) yields

ddλχ2(λP+(1λ)QR)|λ=0=limλ0χ2(λP+(1λ)QR)χ2(QR)λ=c. (193)

We next state the main result in this section.

Theorem 8.

Let

  • P and Q be probability measures defined on a measurable space (A,F), YQ, and suppose that
    esssupdPdQ(Y)<; (194)
  • fC, and f be continuous at 1.

Then,

(195)limλ01λ2Df(λP+(1λ)QQ)=limλ01λ2Df(QλP+(1λ)Q)(196)=12f(1)χ2(PQ).

Proof. 

Let {λn}nN be a sequence in [0,1], which tends to zero. Define the sequence of probability measures

Rn:=λnP+(1λn)Q,nN. (197)

Note that PQ implies that RnQ for all nN. Since

dRndQ=λndPdQ+(1λn), (198)

it follows from (194) that

limnesssupdRndQ(Y)=1. (199)

Consequently, (183) implies that

limnDf(RnQ)χ2(RnQ)=12f(1) (200)

where {λn} in (197) is an arbitrary sequence which tends to zero. Hence, it follows from (197) and (200) that

limλ0Df(λP+(1λ)QQ)χ2(λP+(1λ)QQ)=12f(1), (201)

and, by combining (186) and (201), we get

limλ01λ2Df(λP+(1λ)QQ)=12f(1)χ2(PQ). (202)

We next prove the result for the limit in the right side of (195). Let f:(0,)R be the conjugate function of f, which is given in (12). By the assumption that f has a second continuous derivative, so is f and it is easy to verify that the second derivatives of f and f coincide at 1. Hence, from (13) and (202),

(203)limλ01λ2Df(QλP+(1λ)Q)=limλ01λ2Df(λP+(1λ)QQ)(204)=12f(1)χ2(PQ).

Remark 20.

Although an f-divergence is in general not symmetric, in the sense that the equality Df(PQ)=Df(QP) does not necessarily hold for all pairs of probability measures (P,Q), the reason for the equality in (195) stems from the fact that the second derivatives of f and f coincide at 1 when f is twice differentiable.

Remark 21.

Under the conditions in Theorem 8, it follows from (196) that

ddλDf(λP+(1λ)QQ)|λ=0=limλ01λDf(λP+(1λ)QQ)=0, (205)
limλ0d2dλ2Df(λP+(1λ)QQ)=2limλ01λ2Df(λP+(1λ)QQ)=f(1)χ2(PQ) (206)

where (206) relies on L’Hôpital’s rule. The convexity of Df(PQ) in (P,Q) also implies that, for all λ[0,1],

Df(λP+(1λ)QQ)λDf(PQ). (207)

The following result refers to the local behavior of Rényi divergences of an arbitrary non-negative order.

Corollary 5.

Under the condition in (194), for every α[0,],

(208)limλ01λ2Dα(λP+(1λ)QQ)=limλ01λ2Dα(QλP+(1λ)Q)(209)=12αχ2(PQ)loge.

Proof. 

Let α(0,1)(1,). In view of (23) and Theorem 8, it follows that the local behavior of the Hellinger divergence of order α satisfies

(210)limλ01λ2Hα(λP+(1λ)QQ)=limλ01λ2Hα(QλP+(1λ)Q)(211)=12αχ2(PQ).

The result now follows from (30), which implies that

(212)limλ0Dα(λP+(1λ)QQ)Hα(λP+(1λ)QQ)=limλ0Dα(QλP+(1λ)Q)Hα(QλP+(1λ)Q)(213)=1α1limu0log1+(α1)uu(214)=loge.

The result in (208) and (209), for α(0,1)(1,), follows by combining the equalities in (210)–(214).

Finally, the result in (208) and (209) for α{0,1,} follows from its validity for all α(0,1)(1,), and also due to the property where Dα(··) is monotonically increasing in α (see [73] Theorem 3). ☐

Acknowledgments

The author is grateful to Sergio Verdú and the two anonymous reviewers, whose suggestions improved the presentation in this paper.

Appendix A. Proof of Theorem 2

We prove in the following the integral representations of f-divergences and related measures in Theorem 2.

  • (1)
    Relative entropy: The function fC in (18) yields the following weight function in (79):
    wf(β)=1β1β21{β1}1{0<β<1}loge,β>0. (A1)
    Consequently, setting c:=loge in (82) yields
    w˜f,c(β)=1β1{β1}1{0<β<1}loge, (A2)
    for β>0. Equality (98) follows from the substitution of (A2) into the right side of (83).
  • (2)
    Hellinger divergence: In view of (22), for α(0,1)(1,), the weight function wfα:(0,)[0,) in (79) which corresponds to fα:(0,)R in (23) can be verified to be equal to
    wfα(β)=βα21β21{β1}1{0<β<1} (A3)
    for β>0. In order to simplify the integral representation of the Hellinger divergence Hα(PQ), we apply Theorem 1-(1). From (A3), setting c:=1 in (82) implies that w˜fα,1:(0,)R is given by
    w˜fα,1(β)=βα21{β1}1{0<β<1} (A4)
    for β>0. Hence, substituting (80) and (A4) into (83) yields
    Hα(PQ)=1βα21FPQ(logβ)dβ01βα2FPQ(logβ)dβ. (A5)
    For α>1, (A5) yields
    (A6)Hα(PQ)=0βα21FPQ(logβ)dβ01βα2dβ(A7)=0βα21FPQ(logβ)dβ1α1,
    and, for α(0,1), (A5) yields
    (A8)Hα(PQ)=1βα2dβ0βα2FPQ(logβ)dβ(A9)=11α0βα2FPQ(logβ)dβ.
    This proves (99). We next consider the following special cases:
    • -
      In view of (25), equality (100) readily follows from (99) with α=2.
    • -
      In view of (28), equality (101) readily follows from (99) with α=12.
    • -
      In view of (29), equality (102) readily follows from (101).
  • (3)

    Rényi divergence: In view of the one-to-one correspondence in (30) between the Rényi divergence and the Hellinger divergence of the same order, (103) readily follows from (99).

  • (4)
    χs divergence with s1: We first consider the case where s>1. From (33), the function fs:(0,)R in (32) is differentiable and fs(1)=0. Hence, the respective weight function wfs:(0,)(0,) can be verified from (79) to be given by
    wfs(β)=1βs1+1β|β1|s1,β>0. (A10)

    The result in (104), for s>1, follows readily from (33), (80), (81) and (A10).

    We next prove (104) with s=1. In view of (32), (34), (35) and the dominated convergence theorem,
    (A11)|PQ|=lims1χs(PQ)(A12)=11FPQ(logβ)β2dβ+01FPQ(logβ)β2dβ.

    This extends (104) for all s1, although f1(t)=|t1| for t>0 is not differentiable at 1. For s=1, in view of (95), the integral representation in the right side of (A12) can be simplified to (105) and (106).

  • (5)
    DeGroot statistical information: In view of (50)–(51), since the function ϕw:(0,)R is not differentiable at the point 1ωω(0,) for ω(0,1), Theorem 1 cannot be applied directly to get an integral representation of the DeGroot statistical information. To that end, for (ω,α)(0,1)2, consider the family of convex functions fω,α:(0,)R given by (see [3] (55))
    fω,α(t)=11α(ωt)1α+(1ω)1ααω1α+(1ω)1αα, (A13)
    for t>0. These differentiable functions also satisfy
    limα0fω,α(t)=ϕw(t), (A14)
    which holds due to the identities
    limα0a1α+b1αα=max{a,b},a,b0; (A15)
    min{a,b}=a+bmax{a,b},a,bR. (A16)
    The application of Theorem 1-(1) to the set of functions fω,αC with
    c:=(1ω)1αα1ω1α+(1ω)1αα1 (A17)
    yields
    w˜fω,α,c(β)=1ω1α1β21+ωβ1ω1αα11{0<β<1}1{β1}, (A18)
    for β>0, and
    Dfω,α(PQ)=0w˜fω,α,c(β)GPQ(β)dβ (A19)
    with GPQ(·) as defined in (80), and (ω,α)(0,1)2. From (A15) and (A18), it follows that
    limα0w˜fω,α,c(β)=1ωβ21{0<β<1}1{β1}121β=1ωω+10<β<1ωω, (A20)
    for β>0. In view of (50), (51), (80), (A14), (A19) and (A20), and the monotone convergence theorem,
    Iω(PQ)(A21)=Dϕω(PQ)(A22)=limα0Dfω,α(PQ)(A23)=(1ω)0min{1,1ωω}FPQ(logβ)β2dβ(1ω)1max{1,1ωω}1FPQ(logβ)β2dβ,
    for ω(0,1). We next simplify (A23) as follows:
    • -
      if ω1,1ωω, then 1ωω<1 and (A23) yields
      Iω(PQ)=(1ω)01ωωFPQ(logβ)β2dβ; (A24)
    • -
      if ω0,12, then 1ωω1 and (A23) yields
      (A25)Iω(PQ)=(1ω)01FPQ(logβ)β2dβ(1ω)11ωω1FPQ(logβ)β2dβ(A26)=(1ω)11FPQ(logβ)β2dβ(1ω)11ωω1FPQ(logβ)β2dβ(A27)=(1ω)1ωω1FPQ(logβ)β2dβ,
      where (A26) follows from (95) (or its equivalent from in (73)).

    This completes the proof of (107). Note that, due to (95), the integral representation of Iω(PQ) in (107) is indeed continuous at ω=12.

  • (6)
    Triangular discrimination: In view of (36)–(37), the corresponding function w˜f,1:(0,)R in (82) (i.e., with c:=1) can be verified to be given by
    w˜f,1(β)=4(β+1)21{β1}1{0<β<1} (A28)
    for β>0. Substituting (80) and (A28) into (83) proves (108) as follows:
    (A29)Δ(PQ)=411FPQ(logβ)(β+1)2dβ01FPQ(logβ)(β+1)2dβ(A30)=401FPQ(logβ)(β+1)2dβ011(β+1)2dβ(A31)=401FPQ(logβ)(β+1)2dβ2.
  • (7)
    Lin’s measure and the Jensen-Shannon divergence: Let θ(0,1) (if θ{0,1}, then (39) and (40) imply that Lθ(PQ)=0). In view of (41), the application of Theorem 1-(1) with the function fθ:(0,)R in (42) yields the weight function wfθ:(0,)[0,) defined as
    wfθ(β)=(1θ)log(θβ+1θ)β21{β1}1{0<β<1}. (A32)
    Consequently, we get
    (A33)Lθ(PQ)=(1θ)1log(θβ+1θ)β21FPQ(logβ)dβ01log(θβ+1θ)β2FPQ(logβ)dβ(A34)=(1θ)1log(θβ+1θ)β2dβ0log(θβ+1θ)β2FPQ(logβ)dβ(A35)=θlog1θ(1θ)0log(θβ+1θ)β2FPQ(logβ)dβ(A36)=h(θ)(1θ)01β2logθβ1θ+1FPQ(logβ)dβ
    where (A33) follows from (80), (81) and (A32); for θ(0,1), equality (A35) holds since
    1log(θβ+1θ)β2dβ=θ1θlog1θ; (A37)
    finally, (A36) follows from (73) where h:[0,1][0,log2] denotes the binary entropy function. This proves (109). In view of (43), the identity in (110) for the Jensen-Shannon divergence follows from (109) with θ=12.
  • (8)
    Jeffrey’s divergence: In view of (21)–(20), the corresponding weight function wf:(0,)[0,) in (79) can be verified to be given by
    wf(β)=logeβ+1β2logβe1{β1}1{0<β<1}. (A38)
    Hence, setting c:=loge in (82) implies that
    w˜f,c(β)=logeβ+logββ21{β1}1{0<β<1} (A39)
    for β>0. Substituting (80) and (A39) into (83) yields (111).
  • (9)
    Eγ divergence: Let γ1, and let ω0,12 satisfy 1ωω=γ; hence, ω=11+γ. From (53), we get
    Eγ(PQ)=(1+γ)I11+γ(PQ). (A40)
    The second line in the right side of (107) yields
    I11+γ(PQ)=γ1+γγ1FPQ(logβ)β2dβ. (A41)

    Finally, substituting (A41) into the right side of (A40) yields (112).

Remark A1.

In view of (95), the integral representation for the χs divergence in (104) specializes to (100), (105) and (106) by letting s=2 and s=1, respectively.

Remark A2.

In view of (49), the first identity for the total variation distance in (105) follows readily from (112) with γ=1. The second identity in (106) follows from (73) and (105), and since 1dββ2=1.

Conflicts of Interest

The author declares no conflict of interest.

References

  • 1.Basseville M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013;93:621–633. doi: 10.1016/j.sigpro.2012.09.003. [DOI] [Google Scholar]
  • 2.Liese F., Vajda I. Teubner-Texte Zur Mathematik. Volume 95 Springer; Leipzig, Germany: 1987. Convex Statistical Distances. [Google Scholar]
  • 3.Liese F., Vajda I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory. 2006;52:4394–4412. doi: 10.1109/TIT.2006.881731. [DOI] [Google Scholar]
  • 4.Reid M.D., Williamson R.C. Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 2011;12:731–817. [Google Scholar]
  • 5.Tsybakov A.B. Introduction to Nonparametric Estimation. Springer; New York, NY, USA: 2009. [Google Scholar]
  • 6.Vapnik V.N. Statistical Learning Theory. John Wiley & Sons; Hoboken, NJ, USA: 1998. [Google Scholar]
  • 7.Verdú S. Information Theory. 2018. Unpublished work.
  • 8.Csiszár I. Axiomatic characterization of information measures. Entropy. 2008;10:261–273. doi: 10.3390/e10030261. [DOI] [Google Scholar]
  • 9.Ali S.M., Silvey S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B. 1966;28:131–142. [Google Scholar]
  • 10.Csiszár I. Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizität von Markhoffschen Ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl. 1963;8:85–108. [Google Scholar]
  • 11.Csiszár I. A note on Jensen’s inequality. Stud. Sci. Math. Hung. 1966;1:185–188. [Google Scholar]
  • 12.Csiszár I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hung. 1967;2:299–318. [Google Scholar]
  • 13.Csiszár I. On topological properties of f-divergences. Stud. Sci. Math. Hung. 1967;2:329–339. [Google Scholar]
  • 14.Morimoto T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963;18:328–331. doi: 10.1143/JPSJ.18.328. [DOI] [Google Scholar]
  • 15.Liese F. φ-divergences, sufficiency, Bayes sufficiency, and deficiency. Kybernetika. 2012;48:690–713. [Google Scholar]
  • 16.DeGroot M.H. Uncertainty, information and sequential experiments. Ann. Math. Stat. 1962;33:404–419. doi: 10.1214/aoms/1177704567. [DOI] [Google Scholar]
  • 17.Cohen J.E., Kemperman J.H.B., Zbăganu G. Comparisons of Stochastic Matrices with Applications in Information Theory, Statistics, Economics and Population. Springer; Berlin, Germany: 1998. [Google Scholar]
  • 18.Feldman D., Österreicher F. A note on f-divergences. Stud. Sci. Math. Hung. 1989;24:191–200. [Google Scholar]
  • 19.Guttenbrunner C. On applications of the representation of f-divergences as averaged minimal Bayesian risk; Proceedings of the Transactions of the 11th Prague Conferences on Information Theory, Statistical Decision Functions, and Random Processes; Prague, Czechoslovakia. 26–31 August 1992; pp. 449–456. [Google Scholar]
  • 20.Österreicher F., Vajda I. Statistical information and discrimination. IEEE Trans. Inf. Theory. 1993;39:1036–1039. doi: 10.1109/18.256536. [DOI] [Google Scholar]
  • 21.Torgersen E. Comparison of Statistical Experiments. Cambridge University Press; Cambridge, UK: 1991. [Google Scholar]
  • 22.Sason I., Verdú S. f-divergence inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
  • 23.Gibbs A.L., Su F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002;70:419–435. doi: 10.1111/j.1751-5823.2002.tb00178.x. [DOI] [Google Scholar]
  • 24.Anwar M., Hussain S., Pecaric J. Some inequalities for Csiszár-divergence measures. Int. J. Math. Anal. 2009;3:1295–1304. [Google Scholar]
  • 25.Simic S. On logarithmic convexity for differences of power means. J. Inequal. Appl. 2007;2007:37359. doi: 10.1155/2007/37359. [DOI] [Google Scholar]
  • 26.Simic S. On a new moments inequality. Stat. Probab. Lett. 2008;78:2671–2678. doi: 10.1016/j.spl.2008.03.007. [DOI] [Google Scholar]
  • 27.Simic S. On certain new inequalities in information theory. Acta Math. Hung. 2009;124:353–361. doi: 10.1007/s10474-009-8205-z. [DOI] [Google Scholar]
  • 28.Simic S. Moment Inequalities of the Second and Third Orders. Preprint. [(accessed on 13 May 2016)]; Available online: http://arxiv.org/abs/1509.0851.
  • 29.Harremoës P., Vajda I. On pairs of f-divergences and their joint range. IEEE Trans. Inf. Theory. 2011;57:3230–3235. doi: 10.1109/TIT.2011.2137353. [DOI] [Google Scholar]
  • 30.Sason I., Verdú S. f-divergence inequalities via functional domination; Proceedings of the 2016 IEEE International Conference on the Science of Electrical Engineering; Eilat, Israel. 16–18 November 2016; pp. 1–5. [Google Scholar]
  • 31.Taneja I.J. Refinement inequalities among symmetric divergence measures. Aust. J. Math. Anal. Appl. 2005;2:1–23. [Google Scholar]
  • 32.Taneja I.J. Seven means, generalized triangular discrimination, and generating divergence measures. Information. 2013;4:198–239. doi: 10.3390/info4020198. [DOI] [Google Scholar]
  • 33.Guntuboyina A., Saha S., Schiebinger G. Sharp inequalities for f-divergences. IEEE Trans. Inf. Theory. 2014;60:104–121. doi: 10.1109/TIT.2013.2288674. [DOI] [Google Scholar]
  • 34.Endres D.M., Schindelin J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory. 2003;49:1858–1860. doi: 10.1109/TIT.2003.813506. [DOI] [Google Scholar]
  • 35.Kafka P., Östreicher F., Vincze I. On powers of f-divergences defining a distance. Stud. Sci. Math. Hung. 1991;26:415–422. [Google Scholar]
  • 36.Lu G., Li B. A class of new metrics based on triangular discrimination. Information. 2015;6:361–374. doi: 10.3390/info6030361. [DOI] [Google Scholar]
  • 37.Vajda I. On metric divergences of probability measures. Kybernetika. 2009;45:885–900. [Google Scholar]
  • 38.Gilardoni G.L. On Pinsker’s and Vajda’s type inequalities for Csiszár’s f-divergences. IEEE Trans. Inf. Theory. 2010;56:5377–5386. doi: 10.1109/TIT.2010.2068710. [DOI] [Google Scholar]
  • 39.Topsøe F. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theory. 2000;46:1602–1609. doi: 10.1109/18.850703. [DOI] [Google Scholar]
  • 40.Sason I., Verdú S. Upper bounds on the relative entropy and Rényi divergence as a function of total variation distance for finite alphabets; Proceedings of the 2015 IEEE Information Theory Workshop; Jeju Island, Korea. 11–15 October 2015; pp. 214–218. [Google Scholar]
  • 41.Dragomir S.S. Inequalities for Csiszár f-Divergence in Information Theory, RGMIA Monographs. Victoria University; Footscray, VIC, Australia: 2000. Upper and lower bounds for Csiszár f-divergence in terms of the Kullback-Leibler divergence and applications. [Google Scholar]
  • 42.Dragomir S.S. Inequalities for Csiszár f-Divergence in Information Theory, RGMIA Monographs. Victoria University; Footscray, VIC, Australia: 2000. Upper and lower bounds for Csiszár f-divergence in terms of Hellinger discrimination and applications. [Google Scholar]
  • 43.Dragomir S.S. Inequalities for Csiszár f-Divergence in Information Theory, RGMIA Monographs. Victoria University; Footscray, VIC, Australia: 2000. An upper bound for the Csiszár f-divergence in terms of the variational distance and applications. [Google Scholar]
  • 44.Dragomir S.S., Gluščević V. Some inequalities for the Kullback-Leibler and χ2-distances in information theory and applications. Tamsui Oxf. J. Math. Sci. 2001;17:97–111. [Google Scholar]
  • 45.Dragomir S.S. Bounds for the normalized Jensen functional. Bull. Aust. Math. Soc. 2006;74:471–478. doi: 10.1017/S000497270004051X. [DOI] [Google Scholar]
  • 46.Kumar P., Chhina S. A symmetric information divergence measure of the Csiszár’s f-divergence class and its bounds. Comp. Math. Appl. 2005;49:575–588. doi: 10.1016/j.camwa.2004.07.017. [DOI] [Google Scholar]
  • 47.Taneja I.J. Bounds on non-symmetric divergence measures in terms of symmetric divergence measures. J. Comb. Inf. Syst. Sci. 2005;29:115–134. [Google Scholar]
  • 48.Binette O. A note on reverse Pinsker inequalities. Preprint. [(accessed on 14 May 2018)]; Available online: http://arxiv.org/abs/1805.05135.
  • 49.Gilardoni G.L. On the minimum f-divergence for given total variation. C. R. Math. 2006;343:763–766. doi: 10.1016/j.crma.2006.10.027. [DOI] [Google Scholar]
  • 50.Gilardoni G.L. Corrigendum to the note on the minimum f-divergence for given total variation. C. R. Math. 2010;348:299. doi: 10.1016/j.crma.2010.02.006. [DOI] [Google Scholar]
  • 51.Gushchin A.A. The minimum increment of f-divergences given total variation distances. Math. Methods Stat. 2016;25:304–312. doi: 10.3103/S1066530716040049. [DOI] [Google Scholar]
  • 52.Sason I. Tight bounds on symmetric divergence measures and a refined bound for lossless source coding. IEEE Trans. Inf. Theory. 2015;61:701–707. doi: 10.1109/TIT.2014.2387065. [DOI] [Google Scholar]
  • 53.Sason I. On the Rényi divergence, joint range of relative entropies, and a channel coding theorem. IEEE Trans. Inf. Theory. 2016;62:23–34. doi: 10.1109/TIT.2015.2504100. [DOI] [Google Scholar]
  • 54.Liu J., Cuff P., Verdú S. Eγ-resolvability. IEEE Trans. Inf. Theory. 2017;63:2629–2658. [Google Scholar]
  • 55.Csiszár I., Shields P.C. Information Theory and Statistics: A Tutorial. Found. Trends Commun. Inf. Theory. 2004;1:417–528. doi: 10.1561/0100000004. [DOI] [Google Scholar]
  • 56.Pardo M.C., Vajda I. On asymptotic properties of information-theoretic divergences. IEEE Trans. Inf. Theory. 2003;49:1860–1868. doi: 10.1109/TIT.2003.813509. [DOI] [Google Scholar]
  • 57.Polyanskiy Y., Poor H.V., Verdú S. Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory. 2010;56:2307–2359. doi: 10.1109/TIT.2010.2043769. [DOI] [Google Scholar]
  • 58.Bretagnolle J., Huber C. Estimation des densités: Risque minimax. Probab. Theory Relat. Fields. 1979;47:119–137. [Google Scholar]
  • 59.Vajda I. Note on discrimination information and variation. IEEE Trans. Inf. Theory. 1970;16:771–773. doi: 10.1109/TIT.1970.1054557. [DOI] [Google Scholar]
  • 60.Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
  • 61.Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1900;50:157–175. doi: 10.1080/14786440009463897. [DOI] [Google Scholar]
  • 62.Le Cam L. Asymptotic Methods in Statistical Decision Theory. Springer; New York, NY, USA: 1986. [Google Scholar]
  • 63.Kailath T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967;15:52–60. doi: 10.1109/TCOM.1967.1089532. [DOI] [Google Scholar]
  • 64.Amari S.I., Nagaoka H. Methods of Information Geometry. Oxford University Press; New York, NY, USA: 2000. [Google Scholar]
  • 65.Cichocki A., Amari S.I. Families of Alpha- Beta- and Gamma-divergences: Flexible and robust measures of similarities. Entropy. 2010;12:1532–1568. doi: 10.3390/e12061532. [DOI] [Google Scholar]
  • 66.Cichocki A., Cruces S., Amari S.I. Generalized Alpha-Beta divergences and their application to robust nonnegative matrix factorization. Entropy. 2011;13:134–170. doi: 10.3390/e13010134. [DOI] [Google Scholar]
  • 67.Cichocki A., Cruces S., Amari S.I. Log-determinant divergences revisited: Alpha-Beta and Gamma log-det divergences. Entropy. 2015;17:2988–3034. doi: 10.3390/e17052988. [DOI] [Google Scholar]
  • 68.Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory. 1991;37:145–151. doi: 10.1109/18.61115. [DOI] [Google Scholar]
  • 69.Polyanskiy Y., Wu Y. Dissipation of information in channels with input constraints. IEEE Trans. Inf. Theory. 2016;62:35–55. doi: 10.1109/TIT.2015.2482978. [DOI] [Google Scholar]
  • 70.Kullback S. A lower bound for discrimination information in terms of variation. IEEE Trans. Inf. Theory. 1967;13:126–127. doi: 10.1109/TIT.1967.1053968. [DOI] [Google Scholar]
  • 71.Kemperman J.H.B. On the optimal rate of transmitting information. Ann. Math. Stat. 1969;40:2156–2177. doi: 10.1214/aoms/1177697293. [DOI] [Google Scholar]
  • 72.Corless R.M., Gonnet G.H., Hare D.E., Jeffrey D.J., Knuth D.E. On the Lambert W function. Adv. Comput. Math. 1996;5:329–359. doi: 10.1007/BF02124750. [DOI] [Google Scholar]
  • 73.Van Erven T., Harremoës P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES