Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Nov 8.
Published in final edited form as: Proc IEEE Int Symp Info Theory. 2016 Aug 11;2016:580–584. doi: 10.1109/ISIT.2016.7541365

Chained Kullback-Leibler Divergences

Dmitri S Pavlichin 1, Tsachy Weissman 2
PMCID: PMC5677233  NIHMSID: NIHMS910318  PMID: 29130024

Abstract

We define and characterize the “chained” Kullback-Leibler divergence minw D(pw) + D(wq) minimized over all intermediate distributions w and the analogous k-fold chained K-L divergence min D(pwk−1) + … + D(w2w1) + D(w1q) minimized over the entire path (w1,…,wk−1). This quantity arises in a large deviations analysis of a Markov chain on the set of types – the Wright-Fisher model of neutral genetic drift: a population with allele distribution q produces offspring with allele distribution w, which then produce offspring with allele distribution p, and so on.

The chained divergences enjoy some of the same properties as the K-L divergence (like joint convexity in the arguments) and appear in k-step versions of some of the same settings as the K-L divergence (like information projections and a conditional limit theorem). We further characterize the optimal k-step “path” of distributions appearing in the definition and apply our findings in a large deviations analysis of the Wright-Fisher process. We make a connection to information geometry via the previously studied continuum limit, where the number of steps tends to infinity, and the limiting path is a geodesic in the Fisher information metric.

Finally, we offer a thermodynamic interpretation of the chained divergence (as the rate of operation of an appropriately defined Maxwell’s demon) and we state some natural extensions and applications (a k-step mutual information and k-step maximum likelihood inference). We release code for computing the objects we study.

I. Introduction

We investigate the properties and applications of the k-fold chained Kullback-Leibler divergence between distributions p and q:

D(k)(pq)minw1,,wk1i=1kD(wiwi1) (1)

where w0q, wkp, D(·‖·) denotes the Kullback-Leibler divergence

D(k)(pq)=xXp[x]logp[x]q[x] (2)

and the minimum, if it exists, is taken over a “path” of distributions wi in some family of distributions. k counts the number of hops and D(1)(pq) ≜ D(pq).

Just as the K-L divergence appears in statistical applications of information theory and in the method of types [1], so the chained divergence arises in the analysis of stochastic processes with an iterative resampling flavor: in each round one samples from a distribution wi determined by the outcome of the previous round of sampling from wi−1. A motivating example that we consider in detail is the Wright-Fisher model of neutral genetic drift [2], [3], wherein each generation is obtained by sampling with replacement from the previous generation.

This work is devoted to characterizing the chained divergences in the finite alphabet setting and the optimal k-step path appearing in their definition, and to pointing out their applications. Section II defines the chained divergence in tandem with a discussion of the Wright-Fisher process, in which it naturally appears. Section III derives some properties (like joint convexity) of the chained divergence, including several “operationally” meaningful properties characterizing the rate of large deviations, information projections, and a conditional limit theorem; these are very similar to results on the K-L divergence [4], [5], [6]. Section IV-A characterizes the optimal path of intermediate distributions (w1, …, wk−1) for all values of k and states a method for computing this path. Section IV-B considers the continuum limit of k → ∞ (letting the number of distributions in the optimal path tend to infinity) and recapitulates results previously obtained by [7] in this setting: the limiting path is a geodesic in the Fisher information metric; equivalently, part of a great circle connecting p and q1. Section V considers extensions and other applications of the chained divergences – in particular a k-step version of the mutual information related to the likelihood that two independent genetic loci become dependent through neutral genetic drift in k generations. Finally, we offer a thermodynamic interpretation of the chained divergence as the optimal rate of operation of an appropriately defined Maxwell’s demon.

We release code for computing the objects we study at2. An Appendix contains the details of some of the proofs omitted here due to space constraints and is available at [8].

II. The Wright-Fisher Markov chain

We define the Wright-Fisher (“iterative resampling”) process and review briefly some results about the K-L divergence that we extend to the chained divergences (1).

Denote by Δ{pX:xXp[x]=1,p0} the set of distributions on finite alphabet X. Denote by Δn the set of distributions with integer denominator n for each component:

ΔnΔ{p:npX} (3)

Δn is the set of types [1], equivalently the possible empirical distributions of n X-valued samples.

Consider the Markov chain whose state space is the set of types Δn (3) defined by the following rule: Let the current state of the chain be q ∈ Δn. Draw n samples x1n i.i.d. from q and let p[x]=1n|{i:xi=x}| be the empirical distribution. p is the next state of the chain. This is the Wright-Fisher model [2], [3] of neutral genetic drift among |X| alleles in a population of size n: Each of n individuals in a generation is sampled by cloning a uniformly chosen (hence “neutral”) individual from the previous generation. The process is illustrated in Figure 1. As time increases, alleles “die out” (once q[x] = 0, no samples of x can be observed again) until only one allele remains, or “fixes” (that is, q[x] = 1 for some xX), so the vertices of the simplex Δ are absorbing states. We will study fluctuations of this chain before fixation.

Fig. 1.

Fig. 1

(left) The Wright-Fisher process, a Markov chain on the set of types. (right) A trajectory of the Wright-Fisher process for |X|=3 alleles plotted on the simplex. Eventually only one allele remains (“fixes”).

Denote by Pn(p|q) the transition matrix for this Markov chain:

Pn(p|q)=MultinominalPMF(np;q) (4)
=(n(np[x])xX)xXq[x]np[x] (5)
nenD(pq) (6)

where the prefactor in (5) is a multinomial coefficient, ≐n denotes equality to leading exponential order in n3, and D(·‖·) denotes the Kullback-Leibler divergence (2). We can evaluate Pn(p|q) for any p, q ∈ Δ and n, but since the state space for the Wright-Fisher process is the set of types Δn (3), n ∈ ℕ, we can think of Pn(p|q) as a |Δn| × |Δn| stochastic matrix.

Given a set of distributions E ⊂ Δ, we define Pn(E|q)pEΔnPn(p|q) as the probability that the chain hops to somewhere in E starting from q. If E ∩ Δn = ∅, then Pn(E|q) ≜ 0. If E is the closure of its interior (so E ∩ Δn is nonempty for all large enough n) then Sanov’s theorem tells us [4], [5]

Pn(E|q)pEΔnPn(p|q)enD(Eq) (7)

where

D(Eq)infpED(pq)=D(pq) (8)

where p* ≜ arg minp∈E D(pq) is the I-projection [6], where the minimum exists and is unique since E is closed (thus compact since E ⊂ Δ) by assumption and D(pq) is strictly convex in p. Sanov’s theorem and the conditional limit theorem tell us [4], [5], [1] that conditioned on drawing empirical distribution pE, distribution p is close4 to p* as n → ∞. If E is not closed (so minpE D(pq)might not exist) but is convex, then there exists a unique distribution p* – the generalized I-projection [5] – such that D(pq) ⩾ D(pp*) + D(Eq) for all pE.

What happens when we iterate the resampling chain? Denote by Pn(k)(p|q) the probability to draw distribution p starting from q in k steps, corresponding to the transition matrix Pn(k)=Pnk. We recursively express Pn(k)(|):

Pn(k)(p|q)wΔnPn(p|w)Pn(k1)(w|q) (9)
Pn(k)(E|q)wΔnPn(E|w)Pn(k1)(w|q) (10)

with Pn(1)(|q)Pn(|q) as in (4) and (7).

Theorem 1 establishes that the k-fold chained K-L divergence (1) plays the same role for the k-step resampling chain as the K-L divergence plays for one step of the chain, as described above. It is convenient to define recursively the k-step chained divergence D(k) (pq) for p, q ∈ Δ:

D(k)(pq)minwΔ(D(pw)+D(k1)(wq)) (11)

with D(1) (pq) ≜ D(pq), the K-L divergence (2). Note that we optimize over the simplex Δ (rather than the types Δn). Theorem 1 establishes the existence and uniqueness of the minimum and the equivalence of the recursive definition (11) with our earlier definition for the chained divergence (1).

We further define D(k) (Eq) for a closed, convex set E ⊂ Δ by analogy with (8):

D(k)(Eq)infpED(k)(pq)=D(k)(p(k)q) (12)

and p(k)* ≜ arg minpE D(k) (pq) is the I-projection of wk1(k) on E, where wk1(k) is the next-to-last point in the optimal path (15). If E is convex but minpE D(k) (pq) does not exist, then there exists a unique distribution p(k)* that is the generalized I-projection of wk1(k) on E.

III. Characterizing the k-fold chained K-L divergence

Theorem 1

Let k be a positive integer.

  1. The k-fold chained divergence (11) satisfies for all p, q ∈ Δ such that D(pq) < ∞:
    D(k)(pq)minwΔ(D(pw)+D(k1)(wq)) (13)
    =i=1kD(wi(k)wi1(k)) (14)
    with boundary conditions wk(k)p, w0(k)q, and
    (w1(k),,wk1(k))argminΔk1i=1kD(wiwi1) (15)
    where the minimum is over the (k − 1)-fold product set Δk−1 = Δ × ⋯ × Δ. The minimum exists and is unique. For closed, convex E ⊂ Δ, (12) defines D(k)(Eq). If D(pq) = ∞, then D(k)(Eq) ≜ ∞ and wi(k) is not defined.
  2. Convexity: D(k)(Eq) is jointly strictly convex in p and q:
    D(k)(αp1+(1α)p2αq1+(1α)q2)<αD(k)(p1q1)+(1α)D(k)(p2q2) (16)
    for all distributions (p1, q1) ≠ (p2, q2), α ∈ (0,1).
  3. Scaling in k: Suppose 0 < D(pq) < ∞. Then D(k)(Eq) is strictly decreasing in k and D(k)(pq)=O(1k). Theorem 3 of Section IV-B (on the continuum limit k → ∞) gives the more precise scaling:
    limkkD(k)(pq)=2θp,q2 (17)
    where θp,q=across(pq) is the angle between vectors p and q (see Section IV-B for details).
  4. Markov chain transition matrix Pn(k)(p|q) (9), upper bound for p ∈ Δ:
    Pn(k)(p|q)(n+1)(k1)|X|enD(k)(pq) (18)
    Moreover, for p ∈ Δ and sequence (pn)n∈ℕ, pn ∈ Δn such that limn→∞ D(pn‖p) = 0
    limn1nlogPn(k)(p|q)=limn1nlogPn(k)(pn|q)=D(k)(pq) (19)
    Equivalently, Pn(k)(p|q)enD(k)(pq).
  5. Sanov’s theorem: Let E ⊂ Δ. The probability Pn(k)(E|q) (10) to draw a type in E from q ∈ Δ in k steps is upper bounded:
    Pn(k)(E|q)(n+1)k|X|enD(k)(Eq) (20)
    where D(k)(Eq) is defined in (12). Moreover, if E is closed, convex, and E ∩ Δn ≠ ∅ for sufficiently large n then
    limn1nlogPn(k)(E|q)=D(k)(p(k)q) (21)
    where p(k)* the k-step I-projection of q on E (12); equivalently, Pn(k)(E|q)enD(k)(Eq).
  6. Conditional limit theorem: Let E ⊂ Δ be closed, convex, and E ⊂ Δn ≠ ∅ for sufficiently large n, q ∈ Δ − E, D(Eq) < ∞. Let (Xt(i))t=1n be the i-th sample of the Wright-Fisher process (Figure 1): that is, Xt(i) drawn i.i.d. w^i1 with w^0q and w^i the empirical distribution of Xt(i). Then for all ε > 0, i ∈ {1, …, k}, xX
    limnP(|w^i[x]wi[x]|ε|w^kE)=0 (22)
    with (wi(k))i=1k1 as in (15) and wk=p(k)=argminpED(k)(pq) (the k-step I-projection of q on E (12)).

Joint convexity 2) follows because minimization of jointly convex functions with respect to some of the arguments over a convex set preserves convexity [9]. Existence and uniqueness of the optimal path 1) follow from joint strict convexity and the compactness of Δ ∩ Support(q). Theorem 3 contains 3). The proofs of 4), 5), 6) all follow closely the proofs of analogous results for the K-L divergence in [1], Chapter 11. See Appendix VI-A in [8] for details of the proofs.

IV. Characterizing the k-fold path

What can we say about the optimal k-fold “path” (wi(k))i=1k1=minΔk1i=1kD(wiwi1) (15)? In the event of a large deviation, we observe a path close to this one with high probability as n → ∞ (Theorem 1.6)).

A. Finite number of steps k

Let’s start with the case k = 2 and find w*(p, q) = arg minw∈Δ D(pw) + D(wq). For k > 2, we will then use this local characterization of the path of distributions to compute wi=w(wi+1,wi1).

We set up an optimization problem with Lagrangian Λ=D(pw)+D(wq)+(λ1)(xXw[x]1) where λ − 1 is a Lagrange multiplier5 enforcing normalization of w. Solving the |X| equations Λ|w=0 for w* we find:

w[x]=w(p,q)[x]{p[x]W(eλp[x]q[x]):p[x]>0eλq[x]:p[x]=0 (23)

where W(·) is the principal branch of the Lambert W function6 and λ* is chosen so that w* ∈ Δ is normalized (in fact λ* ∈ [0, 1] for all p, q). The case of p[x] = 0 is obtained by analytic continuation of the solution of the case p[x] > 0. if D(pq) = ∞ then w*(p, q) is not defined. Note that the Lagrange multiplier λ* lives inside the Lambert W function (rather than outside like a partition function prefactor), and so can not generally be expressed analytically in terms of p and q; one could try to find λ* numerically7. Theorem 1.1) tells us that w* is a global minimum.

Theorem 2

(local characterization of the optimal k-step path) Let k ⩾ 2 and D(pq) < ∞ for p, q ∈ Δ. The optimal path (wi(k))i=1k1 (15) used in defining the chained K-L divergence D(k)(pq) satisfies

wi(k)[x]=w(wi+1(k),wi1(k))[x] (24)
p[x]i/kq[x](ki)/k (25)

for all xX, 1 ≤ ik − 1, where w*(·,·) is defined in (23) and with boundary conditions wk(k)=p, w0(k)=q. As a corollary of (24), for i ∈ {1,…, k − 1}

Support(wi(k))=Support(q) (26)

Proof

Suppose (24) does not hold for some p, q, wi(k). Then i=1kD(wi(k)wi1(k)) (14) can be strictly decreased by replacing wi(k) with w(wi+1(k),wi1(k)). Thus (24) holds. ■

To derive the geometric lower bound (25) we massage the function w*(p, q) (23) into an f-divergence8, then use the non-negativity of f-divergences to bound λ* and derive w(p,q)[x]p[x]q[x], and then derive (25). See Appendix VI-B [8] for details.

Remarks

Figure 2 plots the optimal k-step path for k ∈ {2, 6} and in the k → ∞ limit (see Section IV-B). One way to compute the optimum path is to start with some guess (a good initial point is given in Theorem 3) and repeatedly apply (24) until numerical convergence; this is the method used in making Figure 2 and implemented in the code we release9.

Fig. 2.

Fig. 2

(left) k-step path (15) from q to p (corresponding to D(k)(pq)) computed by successifly applying (24) for w1(k=2) and (wi(k=6))1i5. The green line shows the limiting k → ∞ path (31). The dark grey lines show the arithmetic and (normalized) geometric mean of p and q for α ∈ [0, 1]. (right) The limiting path as k → ∞, part of a great circle joining p and q.

The optimal k-step path from q to p is not the same as the optimal path from p to q − an asymmetry inherited from the asymmetry of D(k)(pq); but in the limit k → ∞ these two paths converge to the same limiting path (see Section IV-B). From (23) we see that q[x] > 0 ⇒ w* (p, q)[x] > 0, so Support(wi(k))=Support(q) for i ∈ {1, …, k − 1}.

B. The limit k → ∞

What happens in the continuum limit, as the number of steps k → ∞? This setting was investigated by [7], who states the limiting path and large deviations rate function. We include this Section to make the story more complete, and to check consistency with and provide a finite-k intuition for the results [7] obtained with variational calculus.

A useful perspective in the following is to map the simplex Δ to the part of the unit sphere in |X| in the non-negative orthant, Ψ{ψ|X|:ψ2=1,ψ0} via the bijection pp(p[x])xX. This square root reparametrization appears in [10], [11], [12], [13]. Consider the K-L divergence between nearby points on the orthant Ψ. That is, if p = q + ε ∈ Δ, ε|X|, then δε|X| such that p=q+δεΨ where δε[x]=12ε[x]p[x]1/2+O(ε[x]2p[x]3/2). Then

D(pq)=4(1ppcosθ)2pp22+xXO(δε[x]3q[x]) (27)
=2θ2+O(θ4)+xXO(θ3q[x]) (28)

where in (27) we rewrote the higher order terms in terms of ε, expanded the cos about θ = 0, and where

θ=θp,qacross(pq) (29)

is the angle between vectors p and q and pq=B(p,q) is the Bhattacharyya coefficient between p and q.

The above recapitulates the familiar fact that the K-L divergence is close to the squared Euclidean distance between nearby distributions [13] (here written in square root space); the leading order term is symmetric in p and q and depends on p and q only through the angle θp,q. This enables us to guess that the limiting optimal path is a geodesic in the Euclidean metric restricted to the surface of the unit sphere and that the angle between adjacent points θwi(k),wi1(k)1kθp,q. Theorem 3 confirms this intuition. See Appendix VI-C [8] for proof details.

Theorem 3

Let D(pq), D(k)(pq) be as in (2), (14), respectively. Let p, q ∈ Δ and D(pq) < ∞ and D(qp) < ∞.

  1. Scaling of D(k)(pq) in k:D(pq)=O(1k) and
    limkkD(k)(pq)=limkkD(k)(qp)=2θp,q2 (30)
    where (as in (29)) cosθp,q=pq=B(p,q).
  2. The limiting path (wi(k)) (15) as k → ∞ is a geodesic in the Euclidean metric restricted to the unit sphere (equivalently, a geodesic in the Fisher information metric on the simplex): part of a great circle connecting p and q with constant angular speed. Let t ∈ [0,1], then
    wtk(k)1Zτt(τtp+(1τt)q)2 (31)
    in L2 norm as k → ∞, where ⎿x⏌ denotes the floor function, Zτt is a constant10, and τt : [0,1] → [0,1] is a reparametrization of “time” t that ensures constant angular speed11 on the unit sphere.
Remarks

The condition D(pq) < ∞ and D(qp) < ∞ ensures that p[x] > 0 ⇔ q[x] > 0, so the series expansion (27) does not diverge and the limits in (30) match. Figure 2 depicts the limiting optimal path for a particular choice of p and q. For a finite number of steps k, the limiting value of the quantity wτi/k(k) (31) provides a decent initial guess for the iterative computation of wi(k) described in the remarks following the statement of Theorem 2.

V. Applications and further directions

We conclude by offering an application of the chained K-L divergences in maximum likelihood inference and an interpretation in terms of a thought experiment in thermodynamics.

A. ML inference and mutual information

The k-step K-L divergence D(k)(pE) = infqE D(k)(pq) “from” a set of distributions E has a maximum likelihood interpretation: q* = minqE D(k)(pq) maximizes the likelihood to draw empirical distribution p ∈ Δn in k steps of the Wright-Fisher process with initial distribution qE. A special case of this is a k-step generalization of the mutual information I(X; Y) between random variables X and Y with finite alphabets X, Y, respectively, jointly distributed as pXYΔX×Y:

I(k)(X;Y)minqX,qYD(k)(pXYqXqY) (32)

with the minimum attained by the k-step “marginals” (qX(k),qY(k))argminqX,qYD(k)(pXYqXqY). We can check that the minimum exists and is unique12. For k = 1, we have I(1) (X; Y) = I(X; Y) and (qX(1),qY(1))=(pX,pY) (the marginals of pXY), but for k > 1 this is not the case. I(k) is monotonically decreasing in k.

Written in the minimization form (32), computing the mutual information I(X; Y) corresponds to finding the maximum likelihood distribution under which X and Y are independent given data with empirical distribution pXY. In the context of the Wright-Fisher process, suppose we assume that two genetic loci X and Y are independently distributed k generations ago, but are no longer independent due to the intervening neutral genetic drift; then the maximum likelihood ancenstral distribution is qX(k)qY(k) given the current distribution pXY. After enough time, all but one pair (x, y) of alleles fixes and the two loci become indepedent again, but neutral drift induces dependence before fixation occurs.

A related object we can construct with the chained K-L divergence is D(k)(pXYpXpY); this, like (32), matches the mutual information I(X; Y) for k = 1. The authors do not yet give an operational meaning for this quantity.

B. Maxwell’s demon

Maxwell’s demon is a thought experiment in thermodynamics, here envisioned as a model of a desalination plant: let distribution q correspond to the ambient relative concentrations of some chemicals in the “ocean” (so X={sodiumion,water,potassiumion,}). The demon’s goal is to achieve a desired concentration of chemicals in the water supply; let E denote the set of concentrations the demon considers acceptable. The demon operates by admitting n molecules drawn from the ambient distribution q into a vesicle (temporary storage); if it so happens that p − the empirical distribution of the n molecules − satisfies pE, then the demon releases the contents of the vesicle into the water supply, and otherwise the demon dumps the contents back in the ocean. See Figure 3.

Fig. 3.

Fig. 3

(left) Maxwell’s demon stands ready to release particles into the water supply below if the distribution of a random sample of n happens to be in E. (right) Two demons use an intermediate stage E′ to desalinate faster.

How quickly does the demon desalinate? From the preceding discussion we know that the demon releases the n molecules into the water supply with probability about e−nD (Eq). Suppose we now break up the desalination process into two demons, so there is an intermediate stage between the ocean and the water supply. If either of the two demons rejects, then both reject13. Then the optimal 2-demon desalination rate is enD(2)(Eq)14, which is about twice as large as the 1-demon rate. Adding more stages increases the rate further; in the k → ∞ limit, we have a concentration gradient across the k stages with no large “jumps.” This is not how actual desalinators work, but the design principle of breaking up a large concentration difference into many small stages is used in multi-stage flash desalination and in the loop of Henle in the kidney.

Acknowledgments

The authors gratefully acknowledge Surya Ganguli and Hideo Mabuchi for suggestions and insightful discussions.

VI. Appendix

A. Proof of Theorem 1

1) (Equivalence of recursive and non-recursive definitions of the chained divergence)

We proceed by induction on k. For the base case k = 2 the two definitions (13) and (14) are the same. Now suppose the definitions are equivalent up to k − 1 (abusing notation). Then we expand out the min:

D(k)(pq)minwk1(D(pwk1)+D(k1)(wq)) (33)
=minwk1(D(pwk1)+minw1,,wk2i=1k1D(wiwi1)) (34)
minw1,,wk1i=1kD(wiwi1) (35)

where the second line follows from the inductive hypothesis. □

Existence and uniquness of the path follows from the joint strict convexity of D(k)(pq) (Theorem 1.2)).

2) (Joint convexity of D(k)(·‖·))

We use the following result [9] (Section 3.2.5): Suppose function f (x, y, z) is jointly convex in x, y, zC, where C is a nonempty convex subset of ℝd. Then

g(x,y)minzCf(x,y,z) (36)

is jointly convex in x and y.

We proceed by induction on k. For the base case k = 1, D(1)(pq) = D(pq) is jointly strictly convex in p, q ∈ Δ. Now assume the statement holds up to k − 1. Then D(pw) + D(k − 1)(wq) is jointly strictly convex in p, q,w ∈ Δ and

D(k)(pq)minwΔ(D(pw)+D(k1)(wq)) (37)

is jointly strictly convex in p, q by the above lemma (36) with g(p, q) = D(k)(pq) and f(p, q, w) = D(pw) + D(k − 1)(wq).

Since D(k)(pq) is the minimum of a strictly convex function over a closed convex set Δ, there is a unique global minimum (wi(k)) for 1 ≤ ik − 1. □

3) (Monotonicity of D(k)(pq) in k)

Consider the optimal k-step path ( q=w0(k), w1(k),,wk1(k), p=wk(k)) (15) and modify it by replacing wk1(k) with p:

D(k)(pq)=i=1kD(wi(k)wi1(k)) (38)
<(a)D(pp)+i=1k1D(wi(k)wi1(k)) (39)
(b)D(k1)(pq) (40)

(a) follows because we know from Theorem 2 that wk1(k)p and since we showed above that the optimal path (wi(k)) is unique. (b) follows since in general wi(k1)wi(k). □

4) (Bounding Pn(k)(p|q))

First the upper bound (18). We proceed by induction on k. For k = 1, the statement holds (see [1] Theorem 11.1.4). Now assume the statement holds up to k − 1. Now using (9) we write

Pn(k)(p|q)=wΔnPn(p|w)Pn(k1)(w|q) (41)
(a)(n+1)(k2)|X|wΔnen(D(pw)+D(k1)(wq)) (42)
(n+1)(k2)|X||Δn|maxwΔnen(D(pw)+D(k1)(wq)) (43)
(b)(n+1)(k1)|X|maxwΔen(D(pw)+D(k1)(wq)) (44)
=(n+1)(k1)|X|enminwΔ(D(pw)+D(k1)(wq)) (45)
=(c)(n+1)(k1)|X|enD(k)(pq) (46)

where (a) follows from the inductive hypothesis, (b) follows from |Δn|(n+1)|X| (see [1] Theorem 11.1.1) and from optimizing over Δ rather than Δn ⊂ Δ, and (c) follows from the recursive definition of the chained K-L divergence (11).

Next we find an asymptotic (in n ⟶ ∞) lower bound. Let (pn)n∈ℕ, pn ∈ Δn, limn ⟶ ∞ D(pn‖p) = 0. Choose an n-indexed sequence (vn,i)i=1k1 such that vn,i ∈ Δn, D(vn,iwi)0 as n ⟶ ∞. Such a sequence exists because ∪nΔn is dense in Δ. Denote vn,kpn and wkp. Denote

εn,ivn,iwi (47)

By Pinsker’s inequality

limn|εn,i[x]|=0 (48)

for all i ∈ {1,…, k}, xX. Moreover, since limnD(vn,iwi)=0 by assumption, then for large enough n we have Support(vn,i)=Support(wi), so for large enough n

Support(εn,i)Support(wi) (49)

Therefore for i ∈ {1,…, k}

limnD(vn,ivn,i1)=limnxX(wi[x]+εn,i[x])log(wi[x]+εn,i[x]wi1[x]+εn,i1[x]) (50)
=xXwi[x]log(wi[x]wi1[x]) (51)
=D(wiwi1) (52)

where the second equality follows from (48) and (49) and εn, 0[x] ≜ 0.

Using (9) we write

Pn(k)(pn|q)=(wi)Δnk1i=1kP(wi|wi1) (53)
(a)1(n+1)k|X|(wi)Δnk1eni=1kD(wiwi1) (54)
1(n+1)k|X|enmin(wi)Δnk1i=1kD(wiwi1) (55)

with w0 = q, wk = pn, (wi) = (w1,…,wk−1), where (a) follows from the lower bound Pn(p|q)enD(pq)/(n+1)|X| (see [1], Theorem 11.1.4). Therefore

liminfn1nlogPn(k)(pn|q)liminfn(k|X|log(n+1)nmin(wi)Δnk1i=1kD(wiwi1)) (56)
=liminfn(min(wi)Δnk1i=1kD(wiwi1)) (57)
liminfn(i=1kD(vn,ivn,i1)) (58)
=(a)i=1kD(wiwi1) (59)
=D(k)(pq) (60)

where (a) follows by (52). Recalling the upper bound (18) we conclude

limn1nlogPn(k)(pn|q)=D(k)(pq) (61)

The first equality in (19) follows by letting vn,k = pn= p, εn,k[x] = 0 for all n, xX, so limnD(pvn,k1)=D(pwk1), so (59) holds, and

limn1nlogPn(k)(p|q)=limn1nlogPn(k)(pn|q)=D(k)(pq) (62)

Equivalently, Pn(k)(p|q)enD(k)(pq). □

5) (Sanov’s theorem)

This proof is very similar to the previous proof for the case E = {p} and to the proof of Sanov’s theorem in [1].

First the upper bound (20). Using (7) and (10) we write

Pn(k)(E|q)=pEΔnwΔnPn(p|w)Pn(k1)(w|q) (63)
(a)(n+1)(k1)|X|pEΔnenD(k)(pq) (64)
(n+1)(k1)|X|pEΔneninfpED(k)(pq) (65)
=(n+1)(k1)|X|pEΔnenD(k)(Eq) (66)
(c)(n+1)k|X|enD(k)(Eq) (67)

where (a) follows from the upper bound (18) and (b) follows from |EΔn||Δn|(n+1)|X|.

By assumption E is closed and convex, so

p(k)=argminpED(k)(pq) (68)

is the k-step I-projection of q on E (12). We find an asymptotic (in n ⟶ ∞) lower bound. Using (7) and (10) we write

Pn(k)(E|q)=pEΔn(wi)Δnk1i=1kP(wi|wi1) (69)
(a)1(n+1)k|X|pEΔn(wi)Δnk1eni=1kD(wiwi1) (70)
1(n+1)k|X|enminpEΔn,(wi)Δnk1i=1kD(wiwi1) (71)

with w0 = q, wk = p, (wi) = (w1,…,wk−1), where (a) follows from the lower bound Pn(p|q)enD(pq)/(n+1)|X| (see [1], Theorem 11.1.4).

By assumption E ∩ Δn is non-empty for sufficiently large n, so the lower bound (71) is not vacuous. We can then find a sequence (pn)n, pnE ∩ Δn and limn⟶∞ D(pnp(k)*) = 0 and an n-indexed sequence (vn,i)i=1k1 such that vn,i ∈ Δn, D(vn,iwi)0 as n ⟶ ∞. Such a sequence exists because ∪nΔn is dense in Δ. Recycling the argument in (52) we conclude

limnD(vn,ivn,i1)=D(wiwi1) (72)

where wkp(k). Then

liminfn1nlogPn(k)(E|q)liminfn(k|X|log(n+1)nminpEΔnmin(wi)Δnk1i=1kD(wiwi1)) (73)
=liminfn(minpEΔnmin(wi)Δnk1i=1kD(wiwi1)) (74)
liminfn(D(pn||vn,k1)i=1k1D(vn,ivn,i1)) (75)
=(a)i=1kD(wiwi1) (76)
=D(k)(Eq) (77)

where (a) follows by (72).

Recalling the upper bound (20) we conclude

limn1nlogPn(k)(E|q)=D(k)(p(k)q) (78)

Equivalently, Pn(k)(E|q)enD(k)(Eq). □

6) (Conditional limit theorem)

We follow closely the proof for the conditional limit theorem in [1]. Let (w1,,wk1)=argmin(wi)Δk1D(p(k)q) be the optimal path (15) and p(k)* = argminpE D(k)(pq) (12) be the k-step I-projection of q on E (p(k)* exists and is unique since E is closed and convex by assumption).

Next define the set

St{(w1,,wk1,p)Δk:i=1kD(wiwi1)t} (79)

Let

D=i=1kDi=D(k)(Eq) (80)

Now define the set

ASD+2δ(Δk1×E) (81)

where the last term ensures that (wi)i=1kAwkE. Define the set

B(Δk1×E)A (82)

Thus A ∪ B = Δk−1 × E. In the summations below, v=(w1,,wk1,wk)Δnk1×(EΔn). Then

Pn(k)(B|q)=v:i=1kD(wi||wi1)>D+2δi=1kP(wi|wi1) (83)
(a)v:i=1kD(wiwi1)>D+2δeni=1kD(wiwi1) (84)
v:i=1kD(wiwi1)>D+2δen(D+2δ) (85)
(b)(n+1)k|X|en(D+2δ) (86)

where (a) follows from the upper bound (18) in the case k = 1 and (b) follows since |Δnk1×(EiΔn)||Δn|k(n+1)k|X|.

In the summations below, v=(w1,,wk1,wk)Δnk1×(EΔn). We write

Pn(k)(A|q)(a)Pn(k)(ASD+δ|q) (87)
=v:i=1kD(wiwi1)D+δi=1kP(wi|wi1) (88)
(b)v:i=1kD(wiwi1)D+δeni=1kD(wiwi1)(n+1)k|X| (89)
(c)en(D+δ)(n+1)k|X|for sufficiently large n (90)

where in (a) we used SD+δSD+2δ, in (b) we used the lower bound Pn(p|q)enD(p||q)/(n+1)|X| (see [1], Theorem 11.1.4), and (c) follows since SD+δ(Δk1×E)Δnk for n sufficiently large (since E ∩ Δn ≠ Ø for n sufficiently large by assumption).

Now recall the random variables (Xt(i))t{1,,n} defined in the statement of Theorem 1.6) and define the empirical distributions ŵi ∈ Δn

w^i[x]1n|{t:xt(i)=x}| (91)

for all xX, i ∈ {1,…, k}. Then for n sufficiently large

((w^1,,w^k)B|w^kE)=Pn(k)(B(Δk1×E))Pn(k)(Δk1×E) (92)
Pn(k)(B)Pn(k)(A) (93)
(a)(n+1)k|X|en(D+2δ)en(D+δ)(n+1)k|X| (94)
=(n+1)2k|X|enδ (95)

where (a) follows from (86) and (90). The above quantity goes to 0 as n ⟶ ∞, so P((w^1,,w^k)A|w^kE)1 as n ⟶ ∞.

Next let’s show that elements of A are close to (w1,,wk) in L1 norm. This follows from the strict joint convexity of the function

Λ((wi)i=1k)i=1kD(wiwi1) (96)

on the compact, convex set Δk−1 × E. We state a lemma:

Lemma 4

Let f(x) be jointly strictly convex in xC, where C is a compact, convex subset of ℝd, let x* ≜ argminxC f(x). For δ > 0 denote the sublevel set

Tδ{x:f(x)f(x)δ} (97)

Then

limδ0maxxTδxx1=0 (98)
Proof

Suppose (98) does not hold. Then there exists ε > 0 such that for all δ > 0

maxxTδxx1ε (99)

Let T′ ≜ {x : ║xx*║1ε} and δεminxTf(x)>0 where the min exists by compactness of T′ and the inequality follows since xT′ ⇒ xx* and f(x) is jointly strictly convex. Then

xTδε/2f(x)f(x)δε/2<δε (100)
xT (101)
xx1<ε (102)

This contradicts (99), so (98) holds. ■

The function Λ((wi)i=1k) is jointly strictly convex in (wi)i=1k with compact, convex support Δk−1 × E and attains its minimum at Λ((wi)i=1k)=D(k)(Eq), so by Lemma 4

limδ0max(wi)i=1kA(wi)i=1(wi)i=1k1=0 (103)

Now recalling our earlier conclusion that for all δ > 0 P((w^1,,w^k)A|w^kE)1 as n ⟶ ∞ and using (103), we conclude that for all ε > 0, all i ∈ {1,…, k}, and all xX

limnP(|w^i[x]wi[x]|ε|w^kE)=0 (104)

B. Proof of Theorem 2

Proof

The proof of the local optimality condition (24) is given after the statement of this Theorem.

Now let’s prove the geometric lower bound (25). We first state a lemma.

Lemma 5

Let p, q ∈ Δ, D(pq) < ∞, w*(p, q) be as in (23), and λ the value of the Lagrange multiplier so that w*(p, q) is normalized. Then

λ1 (105)

Proof

First define the function

Zλ(p,q)xXq[x]fλ(p[x]q[x]) (106)

where

fλ(t){tW(teλ)+tW(eλ):t>0eλ:t=0 (107)

We can check that fλ(·) is convex and fλ(1) = 0, so Zλ defines an f-divergence. Now letting λ so that w*(p, q) (23) is normalized, we write

Zλ(p,q)=xXq[x]fλ(p[x]q[x]) (108)
=(a)xX(w(p,q)[x]+p[x]W(eλ)) (109)
=1+1W(eλ) (110)
(b)0 (111)
λ1 (112)

where in (a) we substituted w*(p, q) (23) and in (b) we used the non-negativity of f-divergences [ref?]. ■

Next let’s state another lemma.

Lemma 6

Let p, q ∈ Δ, D(p‖q) < ∞ and w*(p, q) be as in (23). Then

w(p,q)[x]p[x]q[x] (113)

for all xX.

Proof

Suppose there exists some xX such that (113) does not hold, so w(p,q)[x]<p[x]q[x]. This implies p[x], q[x] > 0, so

w(p,q)[x]=p[x]W(eλp[x]q[x])<p[x]q[x] (114)
λ>p[x]q[x]log(p[x]q[x])1 (115)

This contradicts Lemma 5, so the statement holds. ■

Lemma 6 implies

log(wi(k)[x])=log(w(wi+1(k),wi1(k))[x]) (116)
12log(wi+1(k)[x])+12log(wi1(k)[x]) (117)

Thus logwi(k)[x] is concave in i, so

log(wi(k)[x])iklog(wk(k)[x])+kiklog(w0(k)[x]) (118)
=iklog(p[x])+kiklog(q[x]) (119)

so (25) follows. ■

C. Proof of Theorem 3

Proof

Let vt denote the continuous path (31)

vt1Zτt(τtp+(1τt)q)2 (120)

where Zτt and τt are as in the statement of Theorem 3. Let vi(k) denote the continuous path (31) sampled at k uniformly spaced points in time:

(vi(k))i=0k(1Zτt(τtp+(1τt)q)2)t{0,1k,2k,,1} (121)

Let (wi(k))i=0k denote the optimal k-step path (15). Then v0(k)=w0(k)=q and vk(k)=wk(k)=p. For 1 ≤ ik denote by θi(k) the angle between (the square roots of) adjacent distributions on the path (vi(k))i=0k

ϕi(k)arccos(vi(k)·vi1(k))=1kθp,q (122)
θi(k)arccos(wi(k)·wi1(k)) (123)

Next let

μminxSupport(q)min(p[x],q[x])>0 (124)

We have μ > 0 since D(p‖q) < ∞ and D(qp) < ∞, so Support(p) = Support(q) by assumption.

Now apply the series expansion for the K-L divergence (27) to (vi(k))i=0k and (wi(k))i=0k and sum over i ∈ {1,…, k}:

i=1kD(vi(k)vi1(k))=i=1k2(ϕi(k)2+O(ϕi(k)4)+xXO(ϕi(k)3vi(k))) (125)
=i=1k(2ϕi(k)2+O(ϕi(k)3)) (126)

where the second equality comes from the i- and k- independent lower bound

minxXmint[0,1]vt[x]μminxXmin(p[x],q[x])>0 (127)

Since D(pq) < ∞ and D(qp) < ∞ by assumption, we may assume p[x], q[x] > 0 for all xX (since otherwise we could restrict attention to Support(p)=Support(q)X). Thus the last inequality in (127) follows. Similarly we obtain

i=1kD(wi(k)wi1(k))=i=1k2(θi(k)2+O(θi(k)4)+xXO(θi(k)3wi(k))) (128)
=i=1k(2θi(k)2+O(θi(k)3)) (129)

where the last inequality follows from the i- and k-independent lower bound

minxXmini{0,,k}wi(k)minxXp[x]i/kq[x](ki)/kμ>0 (130)

where the first inequality follows from Theorem 2, (25).

Now we write

limkki=1k(D(wi(k)wi1(k))D(vi(k)vi1(k)))=limk2ki=1k(θi(k)2ϕi(k)2) (131)
=limk2ki=1k(θi(k)2θp,q2k2) (132)
0 (133)

where the inequality follows since the path vt is part of a great circle joining p and q at constant angular speed (uniformly spacing the angles ϕi(k)=1kθp,q minimizes the sum of their squares subject to their sum ϕi(k)=θp,q). On the other hand by the optimality of (wi)i=1k

limkki=1k(D(wi(k)wi1(k))D(vi(k)vi1(k)))0 (134)

Thus, combining the above inequalities,

limkki=1k(D(wi(k)wi1(k))D(vi(k)vi1(k)))=0 (135)

We then have

i=1k(D(wi(k)wi1(k))D(vi(k)vi1(k)))=i=1k(O(θi(k)3)+O(ϕi3)) (136)
=kO(1/k3)=O(1/k2) (137)

[why θ(k) = O(1/k)?]

[Now use joint strict convexity of i=1kD(wiwi1) on the compact, convex set Δk−1 × E to complete proof?] ■

Footnotes

1

p denotes the vector with components p[x].

3

That is, annbnlimn1nloganbn=0.

4

Convergence in probability, as in the statement of Theorem 1.6) below.

5

We use λ − 1 rather than λ to make the answer look more compact.

6

W(z) satisfies z = W(z)eW(z) and W(z) ⩾ − 1 corresponds to the principal branch.

7

This means finding a root of the smooth, monotonic function of λ xXp[x]/W(eλp[x]q[x])1 with the knowledge that λ* ∈ [0, 1], so is doable.

8

The f-divergence is Df(p,q)xXq[x]f(p[x]/q[x]) with f(t) = −t/W(teλ) + t/W(eλ). f(t) is convex and f(1) = 0, so Df (p, q) is an f-divergence.

10

We can check that Zτt=τt2+(1τt)2+2τt(1τt)(pq).

11

We can check to τt=12(1+tan((t1/2)θ)tan(θ/2)) where θ=θp,q.

12

The set of distributions {qXY : qXY = qX qY} is log-convex, and letting qXY(k) be the “reverse I-projection” of wk1(k) (15) on E, Theorem 1 of [6] proves existence and uniqueness.

13

This ensures particle conservation in the intermediate stage, though we can envision other scenarios with variable sample sizes and demon policies.

14

The intermediate demon must choose an appropriate acceptance set E′ ⊂ E such that the I-projection of q on E′ is w* = arg minw D(Ew) + D(wq); one possible choice is E′ = {w : D(Ew) ≤ D(Ew*)}.

Contributor Information

Dmitri S. Pavlichin, Stanford University

Tsachy Weissman, Stanford University.

References

  • 1.Cover TM, Thomas JA. Elements of Information Theory. Second. Hoboken, NJ: John Wiley & Sons; 2006. [Google Scholar]
  • 2.Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Crow JF. Wright and Fisher on inbreeding and random drift. Genetics. 2010;184:609–611. doi: 10.1534/genetics.109.110023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sanov IN. On the probability of large deviations of a random variable. Mat Sbornik. 1957;42:11–44. [Google Scholar]
  • 5.Csiszár I. Sanov property, generalized I-projection and a conditional limit theorem. The Annals of Probability. 1984;12:768–793. [Google Scholar]
  • 6.Csiszár I, František M. Information projections revisited. IEEE Transitions on Information Theory. 2003;49:1474–1490. [Google Scholar]
  • 7.Papangelou F. The large deviations of a multi-allele Wright-Fisher process mapped on the sphere. The Annals of Applied Probability. 2000;10:1259–1273. [Google Scholar]
  • 8.Pavlichin DS, Weissman T. Chained Kullback-Leibler divergences. doi: 10.1109/ISIT.2016.7541365. www.stanford.edu/~dmitrip/chaineddivergences.pdf, in preparation. [DOI] [PMC free article] [PubMed]
  • 9.Boyd S, Vandenberge L. Convex Optimization. Cambridge, UK: Cambridge University Press; 2009. [Google Scholar]
  • 10.Hájek J. Asymptotically most powerful rank-order tests. The Annals of Mathematical Statistics. 1962;33:1124–1147. [Google Scholar]
  • 11.Le Cam L. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics. 1970;41:802–828. [Google Scholar]
  • 12.Pollard D. Another look at differentiability in quadratic mean. In: Pollard GLYD, Torgersen E, editors. Festschrift for Lucien Le Cam. New York: Springer; 1997. pp. 305–314. ch 9. [Google Scholar]
  • 13.Amari S, Nagaoka H. Methods of information geometry. American Mathematical Society; 2000. [Google Scholar]

RESOURCES