Abstract
We define and characterize the “chained” Kullback-Leibler divergence minw D(p‖w) + D(w‖q) minimized over all intermediate distributions w and the analogous k-fold chained K-L divergence min D(p‖wk−1) + … + D(w2‖w1) + D(w1‖q) minimized over the entire path (w1,…,wk−1). This quantity arises in a large deviations analysis of a Markov chain on the set of types – the Wright-Fisher model of neutral genetic drift: a population with allele distribution q produces offspring with allele distribution w, which then produce offspring with allele distribution p, and so on.
The chained divergences enjoy some of the same properties as the K-L divergence (like joint convexity in the arguments) and appear in k-step versions of some of the same settings as the K-L divergence (like information projections and a conditional limit theorem). We further characterize the optimal k-step “path” of distributions appearing in the definition and apply our findings in a large deviations analysis of the Wright-Fisher process. We make a connection to information geometry via the previously studied continuum limit, where the number of steps tends to infinity, and the limiting path is a geodesic in the Fisher information metric.
Finally, we offer a thermodynamic interpretation of the chained divergence (as the rate of operation of an appropriately defined Maxwell’s demon) and we state some natural extensions and applications (a k-step mutual information and k-step maximum likelihood inference). We release code for computing the objects we study.
I. Introduction
We investigate the properties and applications of the k-fold chained Kullback-Leibler divergence between distributions p and q:
| (1) |
where w0 ≜ q, wk ≜ p, D(·‖·) denotes the Kullback-Leibler divergence
| (2) |
and the minimum, if it exists, is taken over a “path” of distributions wi in some family of distributions. k counts the number of hops and D(1)(p‖q) ≜ D(p‖q).
Just as the K-L divergence appears in statistical applications of information theory and in the method of types [1], so the chained divergence arises in the analysis of stochastic processes with an iterative resampling flavor: in each round one samples from a distribution wi determined by the outcome of the previous round of sampling from wi−1. A motivating example that we consider in detail is the Wright-Fisher model of neutral genetic drift [2], [3], wherein each generation is obtained by sampling with replacement from the previous generation.
This work is devoted to characterizing the chained divergences in the finite alphabet setting and the optimal k-step path appearing in their definition, and to pointing out their applications. Section II defines the chained divergence in tandem with a discussion of the Wright-Fisher process, in which it naturally appears. Section III derives some properties (like joint convexity) of the chained divergence, including several “operationally” meaningful properties characterizing the rate of large deviations, information projections, and a conditional limit theorem; these are very similar to results on the K-L divergence [4], [5], [6]. Section IV-A characterizes the optimal path of intermediate distributions (w1, …, wk−1) for all values of k and states a method for computing this path. Section IV-B considers the continuum limit of k → ∞ (letting the number of distributions in the optimal path tend to infinity) and recapitulates results previously obtained by [7] in this setting: the limiting path is a geodesic in the Fisher information metric; equivalently, part of a great circle connecting and 1. Section V considers extensions and other applications of the chained divergences – in particular a k-step version of the mutual information related to the likelihood that two independent genetic loci become dependent through neutral genetic drift in k generations. Finally, we offer a thermodynamic interpretation of the chained divergence as the optimal rate of operation of an appropriately defined Maxwell’s demon.
We release code for computing the objects we study at2. An Appendix contains the details of some of the proofs omitted here due to space constraints and is available at [8].
II. The Wright-Fisher Markov chain
We define the Wright-Fisher (“iterative resampling”) process and review briefly some results about the K-L divergence that we extend to the chained divergences (1).
Denote by the set of distributions on finite alphabet . Denote by Δn the set of distributions with integer denominator n for each component:
| (3) |
Δn is the set of types [1], equivalently the possible empirical distributions of n -valued samples.
Consider the Markov chain whose state space is the set of types Δn (3) defined by the following rule: Let the current state of the chain be q ∈ Δn. Draw n samples i.i.d. from q and let be the empirical distribution. p is the next state of the chain. This is the Wright-Fisher model [2], [3] of neutral genetic drift among alleles in a population of size n: Each of n individuals in a generation is sampled by cloning a uniformly chosen (hence “neutral”) individual from the previous generation. The process is illustrated in Figure 1. As time increases, alleles “die out” (once q[x] = 0, no samples of x can be observed again) until only one allele remains, or “fixes” (that is, q[x] = 1 for some ), so the vertices of the simplex Δ are absorbing states. We will study fluctuations of this chain before fixation.
Fig. 1.

(left) The Wright-Fisher process, a Markov chain on the set of types. (right) A trajectory of the Wright-Fisher process for alleles plotted on the simplex. Eventually only one allele remains (“fixes”).
Denote by Pn(p|q) the transition matrix for this Markov chain:
| (4) |
| (5) |
| (6) |
where the prefactor in (5) is a multinomial coefficient, ≐n denotes equality to leading exponential order in n3, and D(·‖·) denotes the Kullback-Leibler divergence (2). We can evaluate Pn(p|q) for any p, q ∈ Δ and n, but since the state space for the Wright-Fisher process is the set of types Δn (3), n ∈ ℕ, we can think of Pn(p|q) as a |Δn| × |Δn| stochastic matrix.
Given a set of distributions E ⊂ Δ, we define as the probability that the chain hops to somewhere in E starting from q. If E ∩ Δn = ∅, then Pn(E|q) ≜ 0. If E is the closure of its interior (so E ∩ Δn is nonempty for all large enough n) then Sanov’s theorem tells us [4], [5]
| (7) |
where
| (8) |
where p* ≜ arg minp∈E D(p‖q) is the I-projection [6], where the minimum exists and is unique since E is closed (thus compact since E ⊂ Δ) by assumption and D(p‖q) is strictly convex in p. Sanov’s theorem and the conditional limit theorem tell us [4], [5], [1] that conditioned on drawing empirical distribution p ∈ E, distribution p is close4 to p* as n → ∞. If E is not closed (so minp∈E D(p‖q)might not exist) but is convex, then there exists a unique distribution p* – the generalized I-projection [5] – such that D(p‖q) ⩾ D(p‖p*) + D(E‖q) for all p ∈ E.
What happens when we iterate the resampling chain? Denote by the probability to draw distribution p starting from q in k steps, corresponding to the transition matrix . We recursively express :
| (9) |
| (10) |
Theorem 1 establishes that the k-fold chained K-L divergence (1) plays the same role for the k-step resampling chain as the K-L divergence plays for one step of the chain, as described above. It is convenient to define recursively the k-step chained divergence D(k) (p‖q) for p, q ∈ Δ:
| (11) |
with D(1) (p‖q) ≜ D(p‖q), the K-L divergence (2). Note that we optimize over the simplex Δ (rather than the types Δn). Theorem 1 establishes the existence and uniqueness of the minimum and the equivalence of the recursive definition (11) with our earlier definition for the chained divergence (1).
We further define D(k) (E‖q) for a closed, convex set E ⊂ Δ by analogy with (8):
| (12) |
and p(k)* ≜ arg minp∈E D(k) (p‖q) is the I-projection of on E, where is the next-to-last point in the optimal path (15). If E is convex but minp∈E D(k) (p‖q) does not exist, then there exists a unique distribution p(k)* that is the generalized I-projection of on E.
III. Characterizing the k-fold chained K-L divergence
Theorem 1
Let k be a positive integer.
- The k-fold chained divergence (11) satisfies for all p, q ∈ Δ such that D(p‖q) < ∞:
(13)
with boundary conditions , , and(14)
where the minimum is over the (k − 1)-fold product set Δk−1 = Δ × ⋯ × Δ. The minimum exists and is unique. For closed, convex E ⊂ Δ, (12) defines D(k)(E‖q). If D(p‖q) = ∞, then D(k)(E‖q) ≜ ∞ and is not defined.(15) - Convexity: D(k)(E‖q) is jointly strictly convex in p and q:
for all distributions (p1, q1) ≠ (p2, q2), α ∈ (0,1).(16) - Scaling in k: Suppose 0 < D(p‖q) < ∞. Then D(k)(E‖q) is strictly decreasing in k and . Theorem 3 of Section IV-B (on the continuum limit k → ∞) gives the more precise scaling:
where is the angle between vectors and (see Section IV-B for details).(17) - Markov chain transition matrix (9), upper bound for p ∈ Δ:
Moreover, for p ∈ Δ and sequence (pn)n∈ℕ, pn ∈ Δn such that limn→∞ D(pn‖p) = 0(18)
Equivalently, .(19) - Sanov’s theorem: Let E ⊂ Δ. The probability (10) to draw a type in E from q ∈ Δ in k steps is upper bounded:
where D(k)(E‖q) is defined in (12). Moreover, if E is closed, convex, and E ∩ Δn ≠ ∅ for sufficiently large n then(20)
where p(k)* the k-step I-projection of q on E (12); equivalently, .(21) - Conditional limit theorem: Let E ⊂ Δ be closed, convex, and E ⊂ Δn ≠ ∅ for sufficiently large n, q ∈ Δ − E, D(E‖q) < ∞. Let be the i-th sample of the Wright-Fisher process (Figure 1): that is, drawn i.i.d. with and the empirical distribution of . Then for all ε > 0, i ∈ {1, …, k},
with as in (15) and (the k-step I-projection of q on E (12)).(22)
Joint convexity 2) follows because minimization of jointly convex functions with respect to some of the arguments over a convex set preserves convexity [9]. Existence and uniqueness of the optimal path 1) follow from joint strict convexity and the compactness of Δ ∩ Support(q). Theorem 3 contains 3). The proofs of 4), 5), 6) all follow closely the proofs of analogous results for the K-L divergence in [1], Chapter 11. See Appendix VI-A in [8] for details of the proofs.
IV. Characterizing the k-fold path
What can we say about the optimal k-fold “path” (15)? In the event of a large deviation, we observe a path close to this one with high probability as n → ∞ (Theorem 1.6)).
A. Finite number of steps k
Let’s start with the case k = 2 and find w*(p, q) = arg minw∈Δ D(p‖w) + D(w‖q). For k > 2, we will then use this local characterization of the path of distributions to compute .
We set up an optimization problem with Lagrangian where λ − 1 is a Lagrange multiplier5 enforcing normalization of w. Solving the equations for w* we find:
| (23) |
where W(·) is the principal branch of the Lambert W function6 and λ* is chosen so that w* ∈ Δ is normalized (in fact λ* ∈ [0, 1] for all p, q). The case of p[x] = 0 is obtained by analytic continuation of the solution of the case p[x] > 0. if D(p‖q) = ∞ then w*(p, q) is not defined. Note that the Lagrange multiplier λ* lives inside the Lambert W function (rather than outside like a partition function prefactor), and so can not generally be expressed analytically in terms of p and q; one could try to find λ* numerically7. Theorem 1.1) tells us that w* is a global minimum.
Theorem 2
(local characterization of the optimal k-step path) Let k ⩾ 2 and D(p‖q) < ∞ for p, q ∈ Δ. The optimal path (15) used in defining the chained K-L divergence D(k)(p‖q) satisfies
| (24) |
| (25) |
for all , 1 ≤ i ≤ k − 1, where w*(·,·) is defined in (23) and with boundary conditions , . As a corollary of (24), for i ∈ {1,…, k − 1}
| (26) |
Proof
Suppose (24) does not hold for some p, q, . Then (14) can be strictly decreased by replacing with . Thus (24) holds. ■
To derive the geometric lower bound (25) we massage the function w*(p, q) (23) into an f-divergence8, then use the non-negativity of f-divergences to bound λ* and derive , and then derive (25). See Appendix VI-B [8] for details.
Remarks
Figure 2 plots the optimal k-step path for k ∈ {2, 6} and in the k → ∞ limit (see Section IV-B). One way to compute the optimum path is to start with some guess (a good initial point is given in Theorem 3) and repeatedly apply (24) until numerical convergence; this is the method used in making Figure 2 and implemented in the code we release9.
Fig. 2.

(left) k-step path (15) from q to p (corresponding to D(k)(p‖q)) computed by successifly applying (24) for and . The green line shows the limiting k → ∞ path (31). The dark grey lines show the arithmetic and (normalized) geometric mean of p and q for α ∈ [0, 1]. (right) The limiting path as k → ∞, part of a great circle joining and .
The optimal k-step path from q to p is not the same as the optimal path from p to q − an asymmetry inherited from the asymmetry of D(k)(p‖q); but in the limit k → ∞ these two paths converge to the same limiting path (see Section IV-B). From (23) we see that q[x] > 0 ⇒ w* (p, q)[x] > 0, so for i ∈ {1, …, k − 1}.
B. The limit k → ∞
What happens in the continuum limit, as the number of steps k → ∞? This setting was investigated by [7], who states the limiting path and large deviations rate function. We include this Section to make the story more complete, and to check consistency with and provide a finite-k intuition for the results [7] obtained with variational calculus.
A useful perspective in the following is to map the simplex Δ to the part of the unit sphere in in the non-negative orthant, via the bijection . This square root reparametrization appears in [10], [11], [12], [13]. Consider the K-L divergence between nearby points on the orthant Ψ. That is, if p = q + ε ∈ Δ, , then such that where . Then
| (27) |
| (28) |
where in (27) we rewrote the higher order terms in terms of ε, expanded the cos about θ = 0, and where
| (29) |
is the angle between vectors and and is the Bhattacharyya coefficient between p and q.
The above recapitulates the familiar fact that the K-L divergence is close to the squared Euclidean distance between nearby distributions [13] (here written in square root space); the leading order term is symmetric in p and q and depends on p and q only through the angle . This enables us to guess that the limiting optimal path is a geodesic in the Euclidean metric restricted to the surface of the unit sphere and that the angle between adjacent points . Theorem 3 confirms this intuition. See Appendix VI-C [8] for proof details.
Theorem 3
Let D(p‖q), D(k)(p‖q) be as in (2), (14), respectively. Let p, q ∈ Δ and D(p‖q) < ∞ and D(q‖p) < ∞.
- Scaling of D(k)(p‖q) in and
where (as in (29)) .(30) - The limiting path (15) as k → ∞ is a geodesic in the Euclidean metric restricted to the unit sphere (equivalently, a geodesic in the Fisher information metric on the simplex): part of a great circle connecting and with constant angular speed. Let t ∈ [0,1], then
in L2 norm as k → ∞, where ⎿x⏌ denotes the floor function, is a constant10, and τt : [0,1] → [0,1] is a reparametrization of “time” t that ensures constant angular speed11 on the unit sphere.(31)
Remarks
The condition D(p‖q) < ∞ and D(q‖p) < ∞ ensures that p[x] > 0 ⇔ q[x] > 0, so the series expansion (27) does not diverge and the limits in (30) match. Figure 2 depicts the limiting optimal path for a particular choice of p and q. For a finite number of steps k, the limiting value of the quantity (31) provides a decent initial guess for the iterative computation of described in the remarks following the statement of Theorem 2.
V. Applications and further directions
We conclude by offering an application of the chained K-L divergences in maximum likelihood inference and an interpretation in terms of a thought experiment in thermodynamics.
A. ML inference and mutual information
The k-step K-L divergence D(k)(p‖E) = infq∈E D(k)(p‖q) “from” a set of distributions E has a maximum likelihood interpretation: q* = minq∈E D(k)(p‖q) maximizes the likelihood to draw empirical distribution p ∈ Δn in k steps of the Wright-Fisher process with initial distribution q ∈ E. A special case of this is a k-step generalization of the mutual information I(X; Y) between random variables X and Y with finite alphabets , , respectively, jointly distributed as :
| (32) |
with the minimum attained by the k-step “marginals” . We can check that the minimum exists and is unique12. For k = 1, we have I(1) (X; Y) = I(X; Y) and (the marginals of pXY), but for k > 1 this is not the case. I(k) is monotonically decreasing in k.
Written in the minimization form (32), computing the mutual information I(X; Y) corresponds to finding the maximum likelihood distribution under which X and Y are independent given data with empirical distribution pXY. In the context of the Wright-Fisher process, suppose we assume that two genetic loci X and Y are independently distributed k generations ago, but are no longer independent due to the intervening neutral genetic drift; then the maximum likelihood ancenstral distribution is given the current distribution pXY. After enough time, all but one pair (x, y) of alleles fixes and the two loci become indepedent again, but neutral drift induces dependence before fixation occurs.
A related object we can construct with the chained K-L divergence is D(k)(pXY‖pXpY); this, like (32), matches the mutual information I(X; Y) for k = 1. The authors do not yet give an operational meaning for this quantity.
B. Maxwell’s demon
Maxwell’s demon is a thought experiment in thermodynamics, here envisioned as a model of a desalination plant: let distribution q correspond to the ambient relative concentrations of some chemicals in the “ocean” (so ). The demon’s goal is to achieve a desired concentration of chemicals in the water supply; let E denote the set of concentrations the demon considers acceptable. The demon operates by admitting n molecules drawn from the ambient distribution q into a vesicle (temporary storage); if it so happens that p − the empirical distribution of the n molecules − satisfies p ∈ E, then the demon releases the contents of the vesicle into the water supply, and otherwise the demon dumps the contents back in the ocean. See Figure 3.
Fig. 3.

(left) Maxwell’s demon stands ready to release particles into the water supply below if the distribution of a random sample of n happens to be in E. (right) Two demons use an intermediate stage E′ to desalinate faster.
How quickly does the demon desalinate? From the preceding discussion we know that the demon releases the n molecules into the water supply with probability about e−nD (E‖q). Suppose we now break up the desalination process into two demons, so there is an intermediate stage between the ocean and the water supply. If either of the two demons rejects, then both reject13. Then the optimal 2-demon desalination rate is 14, which is about twice as large as the 1-demon rate. Adding more stages increases the rate further; in the k → ∞ limit, we have a concentration gradient across the k stages with no large “jumps.” This is not how actual desalinators work, but the design principle of breaking up a large concentration difference into many small stages is used in multi-stage flash desalination and in the loop of Henle in the kidney.
Acknowledgments
The authors gratefully acknowledge Surya Ganguli and Hideo Mabuchi for suggestions and insightful discussions.
VI. Appendix
A. Proof of Theorem 1
1) (Equivalence of recursive and non-recursive definitions of the chained divergence)
We proceed by induction on k. For the base case k = 2 the two definitions (13) and (14) are the same. Now suppose the definitions are equivalent up to k − 1 (abusing notation). Then we expand out the min:
| (33) |
| (34) |
| (35) |
where the second line follows from the inductive hypothesis. □
Existence and uniquness of the path follows from the joint strict convexity of D(k)(p‖q) (Theorem 1.2)).
2) (Joint convexity of D(k)(·‖·))
We use the following result [9] (Section 3.2.5): Suppose function f (x, y, z) is jointly convex in x, y, z ∈ C, where C is a nonempty convex subset of ℝd. Then
| (36) |
is jointly convex in x and y.
We proceed by induction on k. For the base case k = 1, D(1)(p‖q) = D(p‖q) is jointly strictly convex in p, q ∈ Δ. Now assume the statement holds up to k − 1. Then D(p‖w) + D(k − 1)(w‖q) is jointly strictly convex in p, q,w ∈ Δ and
| (37) |
is jointly strictly convex in p, q by the above lemma (36) with g(p, q) = D(k)(p‖q) and f(p, q, w) = D(p‖w) + D(k − 1)(w‖q).
Since D(k)(p‖q) is the minimum of a strictly convex function over a closed convex set Δ, there is a unique global minimum for 1 ≤ i ≤ k − 1. □
3) (Monotonicity of D(k)(p‖q) in k)
Consider the optimal k-step path ( , , ) (15) and modify it by replacing with p:
| (38) |
| (39) |
| (40) |
(a) follows because we know from Theorem 2 that and since we showed above that the optimal path is unique. (b) follows since in general . □
4) (Bounding )
First the upper bound (18). We proceed by induction on k. For k = 1, the statement holds (see [1] Theorem 11.1.4). Now assume the statement holds up to k − 1. Now using (9) we write
| (41) |
| (42) |
| (43) |
| (44) |
| (45) |
| (46) |
where (a) follows from the inductive hypothesis, (b) follows from (see [1] Theorem 11.1.1) and from optimizing over Δ rather than Δn ⊂ Δ, and (c) follows from the recursive definition of the chained K-L divergence (11).
Next we find an asymptotic (in n ⟶ ∞) lower bound. Let (pn)n∈ℕ, pn ∈ Δn, limn ⟶ ∞ D(pn‖p) = 0. Choose an n-indexed sequence such that vn,i ∈ Δn, as n ⟶ ∞. Such a sequence exists because ∪nΔn is dense in Δ. Denote vn,k ≜ pn and . Denote
| (47) |
By Pinsker’s inequality
| (48) |
for all i ∈ {1,…, k}, . Moreover, since by assumption, then for large enough n we have , so for large enough n
| (49) |
Therefore for i ∈ {1,…, k}
| (50) |
| (51) |
| (52) |
where the second equality follows from (48) and (49) and εn, 0[x] ≜ 0.
Using (9) we write
| (53) |
| (54) |
| (55) |
with w0 = q, wk = pn, (wi) = (w1,…,wk−1), where (a) follows from the lower bound (see [1], Theorem 11.1.4). Therefore
| (56) |
| (57) |
| (58) |
| (59) |
| (60) |
where (a) follows by (52). Recalling the upper bound (18) we conclude
| (61) |
The first equality in (19) follows by letting vn,k = pn= p, εn,k[x] = 0 for all n, , so , so (59) holds, and
| (62) |
Equivalently, . □
5) (Sanov’s theorem)
This proof is very similar to the previous proof for the case E = {p} and to the proof of Sanov’s theorem in [1].
First the upper bound (20). Using (7) and (10) we write
| (63) |
| (64) |
| (65) |
| (66) |
| (67) |
where (a) follows from the upper bound (18) and (b) follows from .
By assumption E is closed and convex, so
| (68) |
is the k-step I-projection of q on E (12). We find an asymptotic (in n ⟶ ∞) lower bound. Using (7) and (10) we write
| (69) |
| (70) |
| (71) |
with w0 = q, wk = p, (wi) = (w1,…,wk−1), where (a) follows from the lower bound (see [1], Theorem 11.1.4).
By assumption E ∩ Δn is non-empty for sufficiently large n, so the lower bound (71) is not vacuous. We can then find a sequence (pn)n, pn ∈ E ∩ Δn and limn⟶∞ D(pn‖p(k)*) = 0 and an n-indexed sequence such that vn,i ∈ Δn, as n ⟶ ∞. Such a sequence exists because ∪nΔn is dense in Δ. Recycling the argument in (52) we conclude
| (72) |
where . Then
| (73) |
| (74) |
| (75) |
| (76) |
| (77) |
where (a) follows by (72).
Recalling the upper bound (20) we conclude
| (78) |
Equivalently, . □
6) (Conditional limit theorem)
We follow closely the proof for the conditional limit theorem in [1]. Let be the optimal path (15) and p(k)* = argminp∈E D(k)(p‖q) (12) be the k-step I-projection of q on E (p(k)* exists and is unique since E is closed and convex by assumption).
Next define the set
| (79) |
Let
| (80) |
Now define the set
| (81) |
where the last term ensures that . Define the set
| (82) |
Thus A ∪ B = Δk−1 × E. In the summations below, . Then
| (83) |
| (84) |
| (85) |
| (86) |
where (a) follows from the upper bound (18) in the case k = 1 and (b) follows since .
In the summations below, . We write
| (87) |
| (88) |
| (89) |
| (90) |
where in (a) we used , in (b) we used the lower bound (see [1], Theorem 11.1.4), and (c) follows since for n sufficiently large (since E ∩ Δn ≠ Ø for n sufficiently large by assumption).
Now recall the random variables defined in the statement of Theorem 1.6) and define the empirical distributions ŵi ∈ Δn
| (91) |
for all , i ∈ {1,…, k}. Then for n sufficiently large
| (92) |
| (93) |
| (94) |
| (95) |
where (a) follows from (86) and (90). The above quantity goes to 0 as n ⟶ ∞, so as n ⟶ ∞.
Next let’s show that elements of A are close to in L1 norm. This follows from the strict joint convexity of the function
| (96) |
on the compact, convex set Δk−1 × E. We state a lemma:
Lemma 4
Let f(x) be jointly strictly convex in x ∈ C, where C is a compact, convex subset of ℝd, let x* ≜ argminx∈C f(x). For δ > 0 denote the sublevel set
| (97) |
Then
| (98) |
Proof
Suppose (98) does not hold. Then there exists ε > 0 such that for all δ > 0
| (99) |
Let T′ ≜ {x : ║x − x*║1 ⩾ ε} and where the min exists by compactness of T′ and the inequality follows since x ∈ T′ ⇒ x ≠ x* and f(x) is jointly strictly convex. Then
| (100) |
| (101) |
| (102) |
This contradicts (99), so (98) holds. ■
The function is jointly strictly convex in with compact, convex support Δk−1 × E and attains its minimum at , so by Lemma 4
| (103) |
Now recalling our earlier conclusion that for all δ > 0 as n ⟶ ∞ and using (103), we conclude that for all ε > 0, all i ∈ {1,…, k}, and all
| (104) |
□
B. Proof of Theorem 2
Proof
The proof of the local optimality condition (24) is given after the statement of this Theorem.
Now let’s prove the geometric lower bound (25). We first state a lemma.
Lemma 5
Let p, q ∈ Δ, D(p‖q) < ∞, w*(p, q) be as in (23), and λ⋆ the value of the Lagrange multiplier so that w*(p, q) is normalized. Then
| (105) |
Proof
First define the function
| (106) |
where
| (107) |
We can check that fλ(·) is convex and fλ(1) = 0, so Zλ defines an f-divergence. Now letting λ⋆ so that w*(p, q) (23) is normalized, we write
| (108) |
| (109) |
| (110) |
| (111) |
| (112) |
where in (a) we substituted w*(p, q) (23) and in (b) we used the non-negativity of f-divergences [ref?]. ■
Next let’s state another lemma.
Lemma 6
Let p, q ∈ Δ, D(p‖q) < ∞ and w*(p, q) be as in (23). Then
| (113) |
for all .
Proof
Suppose there exists some such that (113) does not hold, so . This implies p[x], q[x] > 0, so
| (114) |
| (115) |
This contradicts Lemma 5, so the statement holds. ■
Lemma 6 implies
| (116) |
| (117) |
Thus is concave in i, so
| (118) |
| (119) |
so (25) follows. ■
C. Proof of Theorem 3
Proof
Let vt denote the continuous path (31)
| (120) |
where and are as in the statement of Theorem 3. Let denote the continuous path (31) sampled at k uniformly spaced points in time:
| (121) |
Let denote the optimal k-step path (15). Then and . For 1 ≤ i ≤ k denote by the angle between (the square roots of) adjacent distributions on the path
| (122) |
| (123) |
Next let
| (124) |
We have μ > 0 since D(p‖q) < ∞ and D(q‖p) < ∞, so Support(p) = Support(q) by assumption.
Now apply the series expansion for the K-L divergence (27) to and and sum over i ∈ {1,…, k}:
| (125) |
| (126) |
where the second equality comes from the i- and k- independent lower bound
| (127) |
Since D(p‖q) < ∞ and D(q‖p) < ∞ by assumption, we may assume p[x], q[x] > 0 for all (since otherwise we could restrict attention to ). Thus the last inequality in (127) follows. Similarly we obtain
| (128) |
| (129) |
where the last inequality follows from the i- and k-independent lower bound
| (130) |
where the first inequality follows from Theorem 2, (25).
Now we write
| (131) |
| (132) |
| (133) |
where the inequality follows since the path vt is part of a great circle joining and at constant angular speed (uniformly spacing the angles minimizes the sum of their squares subject to their sum ). On the other hand by the optimality of
| (134) |
Thus, combining the above inequalities,
| (135) |
We then have
| (136) |
| (137) |
[why θ(k) = O(1/k)?]
[Now use joint strict convexity of on the compact, convex set Δk−1 × E to complete proof?] ■
Footnotes
denotes the vector with components .
That is, .
Convergence in probability, as in the statement of Theorem 1.6) below.
We use λ − 1 rather than λ to make the answer look more compact.
W(z) satisfies z = W(z)eW(z) and W(z) ⩾ − 1 corresponds to the principal branch.
This means finding a root of the smooth, monotonic function of λ with the knowledge that λ* ∈ [0, 1], so is doable.
The f-divergence is with f(t) = −t/W(teλ) + t/W(eλ). f(t) is convex and f(1) = 0, so Df (p, q) is an f-divergence.
We can check that .
We can check to where .
The set of distributions {qXY : qXY = qX qY} is log-convex, and letting be the “reverse I-projection” of (15) on E, Theorem 1 of [6] proves existence and uniqueness.
This ensures particle conservation in the intermediate stage, though we can envision other scenarios with variable sample sizes and demon policies.
The intermediate demon must choose an appropriate acceptance set E′ ⊂ E such that the I-projection of q on E′ is w* = arg minw D(E‖w) + D(w‖q); one possible choice is E′ = {w : D(E‖w) ≤ D(E‖w*)}.
Contributor Information
Dmitri S. Pavlichin, Stanford University
Tsachy Weissman, Stanford University.
References
- 1.Cover TM, Thomas JA. Elements of Information Theory. Second. Hoboken, NJ: John Wiley & Sons; 2006. [Google Scholar]
- 2.Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Crow JF. Wright and Fisher on inbreeding and random drift. Genetics. 2010;184:609–611. doi: 10.1534/genetics.109.110023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sanov IN. On the probability of large deviations of a random variable. Mat Sbornik. 1957;42:11–44. [Google Scholar]
- 5.Csiszár I. Sanov property, generalized I-projection and a conditional limit theorem. The Annals of Probability. 1984;12:768–793. [Google Scholar]
- 6.Csiszár I, František M. Information projections revisited. IEEE Transitions on Information Theory. 2003;49:1474–1490. [Google Scholar]
- 7.Papangelou F. The large deviations of a multi-allele Wright-Fisher process mapped on the sphere. The Annals of Applied Probability. 2000;10:1259–1273. [Google Scholar]
- 8.Pavlichin DS, Weissman T. Chained Kullback-Leibler divergences. doi: 10.1109/ISIT.2016.7541365. www.stanford.edu/~dmitrip/chaineddivergences.pdf, in preparation. [DOI] [PMC free article] [PubMed]
- 9.Boyd S, Vandenberge L. Convex Optimization. Cambridge, UK: Cambridge University Press; 2009. [Google Scholar]
- 10.Hájek J. Asymptotically most powerful rank-order tests. The Annals of Mathematical Statistics. 1962;33:1124–1147. [Google Scholar]
- 11.Le Cam L. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics. 1970;41:802–828. [Google Scholar]
- 12.Pollard D. Another look at differentiability in quadratic mean. In: Pollard GLYD, Torgersen E, editors. Festschrift for Lucien Le Cam. New York: Springer; 1997. pp. 305–314. ch 9. [Google Scholar]
- 13.Amari S, Nagaoka H. Methods of information geometry. American Mathematical Society; 2000. [Google Scholar]
