Chained Kullback-Leibler Divergences

Dmitri S Pavlichin; Tsachy Weissman

doi:10.1109/ISIT.2016.7541365

. Author manuscript; available in PMC: 2017 Nov 8.

Published in final edited form as: Proc IEEE Int Symp Info Theory. 2016 Aug 11;2016:580–584. doi: 10.1109/ISIT.2016.7541365

Chained Kullback-Leibler Divergences

Dmitri S Pavlichin ¹, Tsachy Weissman ²

PMCID: PMC5677233 NIHMSID: NIHMS910318 PMID: 29130024

Abstract

We define and characterize the “chained” Kullback-Leibler divergence min_w D(p‖w) + D(w‖q) minimized over all intermediate distributions w and the analogous k-fold chained K-L divergence min D(p‖w_k₋₁) + … + D(w₂‖w₁) + D(w₁‖q) minimized over the entire path (w₁,…,w_k₋₁). This quantity arises in a large deviations analysis of a Markov chain on the set of types – the Wright-Fisher model of neutral genetic drift: a population with allele distribution q produces offspring with allele distribution w, which then produce offspring with allele distribution p, and so on.

The chained divergences enjoy some of the same properties as the K-L divergence (like joint convexity in the arguments) and appear in k-step versions of some of the same settings as the K-L divergence (like information projections and a conditional limit theorem). We further characterize the optimal k-step “path” of distributions appearing in the definition and apply our findings in a large deviations analysis of the Wright-Fisher process. We make a connection to information geometry via the previously studied continuum limit, where the number of steps tends to infinity, and the limiting path is a geodesic in the Fisher information metric.

Finally, we offer a thermodynamic interpretation of the chained divergence (as the rate of operation of an appropriately defined Maxwell’s demon) and we state some natural extensions and applications (a k-step mutual information and k-step maximum likelihood inference). We release code for computing the objects we study.

I. Introduction

We investigate the properties and applications of the k-fold chained Kullback-Leibler divergence between distributions p and q:

D^{(k)} (p ‖ q) ‖ min_{w_{1}, \dots, w_{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})

(1)

where w₀ ≜ q, w_k ≜ p, D(·‖·) denotes the Kullback-Leibler divergence

D^{(k)} (p ‖ q) = \sum_{x \in X} p [x] \log \frac{p [x]}{q [x]}

(2)

and the minimum, if it exists, is taken over a “path” of distributions w_i in some family of distributions. k counts the number of hops and D⁽¹⁾(p‖q) ≜ D(p‖q).

Just as the K-L divergence appears in statistical applications of information theory and in the method of types [1], so the chained divergence arises in the analysis of stochastic processes with an iterative resampling flavor: in each round one samples from a distribution w_i determined by the outcome of the previous round of sampling from w_i₋₁. A motivating example that we consider in detail is the Wright-Fisher model of neutral genetic drift [2], [3], wherein each generation is obtained by sampling with replacement from the previous generation.

This work is devoted to characterizing the chained divergences in the finite alphabet setting and the optimal k-step path appearing in their definition, and to pointing out their applications. Section II defines the chained divergence in tandem with a discussion of the Wright-Fisher process, in which it naturally appears. Section III derives some properties (like joint convexity) of the chained divergence, including several “operationally” meaningful properties characterizing the rate of large deviations, information projections, and a conditional limit theorem; these are very similar to results on the K-L divergence [4], [5], [6]. Section IV-A characterizes the optimal path of intermediate distributions (w₁, …, w_k₋₁) for all values of k and states a method for computing this path. Section IV-B considers the continuum limit of k → ∞ (letting the number of distributions in the optimal path tend to infinity) and recapitulates results previously obtained by [7] in this setting: the limiting path is a geodesic in the Fisher information metric; equivalently, part of a great circle connecting $\sqrt{p}$ and $\sqrt{q}$ ¹. Section V considers extensions and other applications of the chained divergences – in particular a k-step version of the mutual information related to the likelihood that two independent genetic loci become dependent through neutral genetic drift in k generations. Finally, we offer a thermodynamic interpretation of the chained divergence as the optimal rate of operation of an appropriately defined Maxwell’s demon.

We release code for computing the objects we study at². An Appendix contains the details of some of the proofs omitted here due to space constraints and is available at [8].

II. The Wright-Fisher Markov chain

We define the Wright-Fisher (“iterative resampling”) process and review briefly some results about the K-L divergence that we extend to the chained divergences (1).

Denote by $Δ ‖ {p \in ℝ^{X} : \sum_{x \in X} p [x] = 1, p ⩾ 0}$ the set of distributions on finite alphabet $X$ . Denote by Δ_n the set of distributions with integer denominator n for each component:

Δ_{n} ‖ Δ \cap {p : n p \in ℕ^{X}}

(3)

Δ_n is the set of types [1], equivalently the possible empirical distributions of n $X$ -valued samples.

Consider the Markov chain whose state space is the set of types Δ_n (3) defined by the following rule: Let the current state of the chain be q ∈ Δ_n. Draw n samples $x_{1}^{n}$ i.i.d. from q and let $p [x] = \frac{1}{n} | {i : x_{i} = x} |$ be the empirical distribution. p is the next state of the chain. This is the Wright-Fisher model [2], [3] of neutral genetic drift among $| X |$ alleles in a population of size n: Each of n individuals in a generation is sampled by cloning a uniformly chosen (hence “neutral”) individual from the previous generation. The process is illustrated in Figure 1. As time increases, alleles “die out” (once q[x] = 0, no samples of x can be observed again) until only one allele remains, or “fixes” (that is, q[x] = 1 for some $x \in X$ ), so the vertices of the simplex Δ are absorbing states. We will study fluctuations of this chain before fixation.

Fig. 1 — (left) The Wright-Fisher process, a Markov chain on the set of types. (right) A trajectory of the Wright-Fisher process for $| X | = 3$ alleles plotted on the simplex. Eventually only one allele remains (“fixes”).

Denote by P_n(p|q) the transition matrix for this Markov chain:

P_{n} (p | q) = MultinominalPMF (n p; q)

(4)

= (\begin{matrix} n \\ {(n p [x])}_{x \in X} \end{matrix}) \prod_{x \in X} q {[x]}^{n p [x]}

(5)

≐_{n} e^{- n D (p ‖ q)}

(6)

where the prefactor in (5) is a multinomial coefficient, ≐_n denotes equality to leading exponential order in n³, and D(·‖·) denotes the Kullback-Leibler divergence (2). We can evaluate P_n(p|q) for any p, q ∈ Δ and n, but since the state space for the Wright-Fisher process is the set of types Δ_n (3), n ∈ ℕ, we can think of P_n(p|q) as a |Δ_n| × |Δ_n| stochastic matrix.

Given a set of distributions E ⊂ Δ, we define $P_{n} (E | q) ‖ \sum_{p \in E \cap Δ_{n}} P_{n} (p | q)$ as the probability that the chain hops to somewhere in E starting from q. If E ∩ Δ_n = ∅, then P_n(E|q) ≜ 0. If E is the closure of its interior (so E ∩ Δ_n is nonempty for all large enough n) then Sanov’s theorem tells us [4], [5]

P_{n} (E | q) ‖ \sum_{p \in E \cap Δ_{n}} P_{n} (p | q) ≐ e^{- n D (E ‖ q)}

(7)

where

D (E ‖ q) ‖ \inf_{p \in E} D (p ‖ q) = D (p^{*} ‖ q)

(8)

where p* ≜ arg min_p∈E D(p‖q) is the I-projection [6], where the minimum exists and is unique since E is closed (thus compact since E ⊂ Δ) by assumption and D(p‖q) is strictly convex in p. Sanov’s theorem and the conditional limit theorem tell us [4], [5], [1] that conditioned on drawing empirical distribution p ∈ E, distribution p is close⁴ to p* as n → ∞. If E is not closed (so min_p_∈_E D(p‖q)might not exist) but is convex, then there exists a unique distribution p* – the generalized I-projection [5] – such that D(p‖q) ⩾ D(p‖p*) + D(E‖q) for all p ∈ E.

What happens when we iterate the resampling chain? Denote by $P_{n}^{(k)} (p | q)$ the probability to draw distribution p starting from q in k steps, corresponding to the transition matrix $P_{n}^{(k)} = P_{n}^{k}$ . We recursively express $P_{n}^{(k)} (\cdot | \cdot)$ :

P_{n}^{(k)} (p | q) ‖ \sum_{w \in Δ_{n}} P_{n} (p | w) P_{n}^{(k - 1)} (w | q)

(9)

P_{n}^{(k)} (E | q) ‖ \sum_{w \in Δ_{n}} P_{n} (E | w) P_{n}^{(k - 1)} (w | q)

(10)

with $P_{n}^{(1)} (\cdot | q) ‖ P_{n} (\cdot | q)$ as in (4) and (7).

Theorem 1 establishes that the k-fold chained K-L divergence (1) plays the same role for the k-step resampling chain as the K-L divergence plays for one step of the chain, as described above. It is convenient to define recursively the k-step chained divergence D^(k) (p‖q) for p, q ∈ Δ:

D^{(k)} (p ‖ q) ‖ min_{w \in Δ} (D (p ‖ w) + D^{(k - 1)} (w ‖ q))

(11)

with D⁽¹⁾ (p‖q) ≜ D(p‖q), the K-L divergence (2). Note that we optimize over the simplex Δ (rather than the types Δ_n). Theorem 1 establishes the existence and uniqueness of the minimum and the equivalence of the recursive definition (11) with our earlier definition for the chained divergence (1).

We further define D^(k) (E‖q) for a closed, convex set E ⊂ Δ by analogy with (8):

D^{(k)} (E ‖ q) ‖ \inf_{p \in E} D^{(k)} (p ‖ q) = D^{(k)} (p^{(k) *} ‖ q)

(12)

and p^(k)* ≜ arg min_p_∈_E D^(k) (p‖q) is the I-projection of $w_{k - 1}^{(k) *}$ on E, where $w_{k - 1}^{(k) *}$ is the next-to-last point in the optimal path (15). If E is convex but min_p_∈_E D^(k) (p‖q) does not exist, then there exists a unique distribution p^(k)* that is the generalized I-projection of $w_{k - 1}^{(k) *}$ on E.

III. Characterizing the k-fold chained K-L divergence

Theorem 1

Let k be a positive integer.

The k-fold chained divergence (11) satisfies for all p, q ∈ Δ such that D(p‖q) < ∞:
$D^{(k)} (p ‖ q) ‖ min_{w \in Δ} (D (p ‖ w) + D^{(k - 1)} (w ‖ q))$ (13)

$= \sum_{i = 1}^{k} D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *})$ (14)
with boundary conditions $w_{k}^{(k) *} ‖ p$ , $w_{0}^{(k) *} ‖ q$ , and
$(w_{1}^{(k) *}, \dots, w_{k - 1}^{(k) *}) ‖ \arg min_{Δ^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})$ (15)
where the minimum is over the (k − 1)-fold product set Δ^k⁻¹ = Δ × ⋯ × Δ. The minimum exists and is unique. For closed, convex E ⊂ Δ, (12) defines D^(k)(E‖q). If D(p‖q) = ∞, then D^(k)(E‖q) ≜ ∞ and $w_{i}^{(k) *}$ is not defined.
Convexity: D^(k)(E‖q) is jointly strictly convex in p and q:
$D^{(k)} (α p_{1} + (1 - α) p_{2} ‖ α q_{1} + (1 - α) q_{2}) < α D^{(k)} (p_{1} ‖ q_{1}) + (1 - α) D^{(k)} (p_{2} ‖ q_{2})$ (16)
for all distributions (p₁, q₁) ≠ (p₂, q₂), α ∈ (0,1).
Scaling in k: Suppose 0 < D(p‖q) < ∞. Then D^(k)(E‖q) is strictly decreasing in k and $D^{(k)} (p ‖ q) = O (\frac{1}{k})$ . Theorem 3 of Section IV-B (on the continuum limit k → ∞) gives the more precise scaling:
$\lim_{k \to \infty} k D^{(k)} (p ‖ q) = 2 θ_{\sqrt{p}, \sqrt{q}}^{2}$ (17)
where $θ_{\sqrt{p}, \sqrt{q}} = across (\sqrt{p} \cdot \sqrt{q})$ is the angle between vectors $\sqrt{p}$ and $\sqrt{q}$ (see Section IV-B for details).
Markov chain transition matrix $P_{n}^{(k)} (p | q)$ (9), upper bound for p ∈ Δ:
$P_{n}^{(k)} (p | q) \leq {(n + 1)}^{(k - 1) | X |} e^{- n D^{(k)} (p ‖ q)}$ (18)
Moreover, for p ∈ Δ and sequence (p_n)_n_∈ℕ, p_n ∈ Δ_n such that lim_n_→∞ D(p_n‖p) = 0
$\lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (p | q) = \lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (p_{n} | q) = - D^{(k)} (p ‖ q)$ (19)
Equivalently, $P_{n}^{(k)} (p | q) ≐ e^{- n D^{(k)} (p ‖ q)}$ .
Sanov’s theorem: Let E ⊂ Δ. The probability $P_{n}^{(k)} (E | q)$ (10) to draw a type in E from q ∈ Δ in k steps is upper bounded:
$P_{n}^{(k)} (E | q) \leq {(n + 1)}^{k | X |} e^{- n D^{(k)} (E ‖ q)}$ (20)
where D^(k)(E‖q) is defined in (12). Moreover, if E is closed, convex, and E ∩ Δ_n ≠ ∅ for sufficiently large n then
$\lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (E | q) = - D^{(k)} (p^{(k) *} ‖ q)$ (21)
where p^(k)* the k-step I-projection of q on E (12); equivalently, $P_{n}^{(k)} (E | q) ≐ e^{- n D^{(k)} (E ‖ q)}$ .
Conditional limit theorem: Let E ⊂ Δ be closed, convex, and E ⊂ Δ_n ≠ ∅ for sufficiently large n, q ∈ Δ − E, D(E‖q) < ∞. Let ${(X_{t}^{(i)})}_{t = 1}^{n}$ be the i-th sample of the Wright-Fisher process (Figure 1): that is, $X_{t}^{(i)}$ drawn i.i.d. $\sim {\hat{w}}_{i - 1}$ with ${\hat{w}}_{0} ‖ q$ and ${\hat{w}}_{i}$ the empirical distribution of $X_{t}^{(i)}$ . Then for all ε > 0, i ∈ {1, …, k}, $x \in X$
$\lim_{n \to \infty} P (| {\hat{w}}_{i} [x] - w_{i}^{*} [x] | ⩾ ε | {\hat{w}}_{k} \in E) = 0$ (22)
with ${(w_{i}^{(k) *})}_{i = 1}^{k - 1}$ as in (15) and $w_{k}^{*} = p^{(k) *} = \arg {min}_{p \in E} D^{(k)} (p ‖ q)$ (the k-step I-projection of q on E (12)).

Joint convexity 2) follows because minimization of jointly convex functions with respect to some of the arguments over a convex set preserves convexity [9]. Existence and uniqueness of the optimal path 1) follow from joint strict convexity and the compactness of Δ ∩ Support(q). Theorem 3 contains 3). The proofs of 4), 5), 6) all follow closely the proofs of analogous results for the K-L divergence in [1], Chapter 11. See Appendix VI-A in [8] for details of the proofs.

IV. Characterizing the k-fold path

What can we say about the optimal k-fold “path” ${(w_{i}^{(k) *})}_{i = 1}^{k - 1} = {min}_{Δ^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})$ (15)? In the event of a large deviation, we observe a path close to this one with high probability as n → ∞ (Theorem 1.6)).

A. Finite number of steps k

Let’s start with the case k = 2 and find w*(p, q) = arg min_w∈Δ D(p‖w) + D(w‖q). For k > 2, we will then use this local characterization of the path of distributions to compute $w_{i}^{*} = w^{*} (w_{i + 1}^{*}, w_{i - 1}^{*})$ .

We set up an optimization problem with Lagrangian $Λ = D (p ‖ w) + D (w ‖ q) + (λ - 1) (\sum_{x \in X} w [x] - 1)$ where λ − 1 is a Lagrange multiplier⁵ enforcing normalization of w. Solving the $| X |$ equations $\nabla Λ |_{w^{*}} = 0$ for w* we find:

w^{*} [x] = w^{*} (p, q) [x] ‖ {\begin{matrix} \frac{p [x]}{W (e^{λ *} \frac{p [x]}{q [x]})} & : p [x] > 0 \\ e^{- λ *} q [x] & : p [x] = 0 \end{matrix}

(23)

where W(·) is the principal branch of the Lambert W function⁶ and λ* is chosen so that w* ∈ Δ is normalized (in fact λ* ∈ [0, 1] for all p, q). The case of p[x] = 0 is obtained by analytic continuation of the solution of the case p[x] > 0. if D(p‖q) = ∞ then w*(p, q) is not defined. Note that the Lagrange multiplier λ* lives inside the Lambert W function (rather than outside like a partition function prefactor), and so can not generally be expressed analytically in terms of p and q; one could try to find λ* numerically⁷. Theorem 1.1) tells us that w* is a global minimum.

Theorem 2

(local characterization of the optimal k-step path) Let k ⩾ 2 and D(p‖q) < ∞ for p, q ∈ Δ. The optimal path ${(w_{i}^{(k) *})}_{i = 1}^{k - 1}$ (15) used in defining the chained K-L divergence D^(k)(p‖q) satisfies

w_{i}^{(k) *} [x] = w^{*} (w_{i + 1}^{(k) *}, w_{i - 1}^{(k) *}) [x]

(24)

⩾ p {[x]}^{i / k} q {[x]}^{(k - i) / k}

(25)

for all $x \in X$ , 1 ≤ i ≤ k − 1, where w*(·,·) is defined in (23) and with boundary conditions $w_{k}^{(k) *} = p$ , $w_{0}^{(k) *} = q$ . As a corollary of (24), for i ∈ {1,…, k − 1}

Support (w_{i}^{(k) *}) = Support (q)

(26)

Proof

Suppose (24) does not hold for some p, q, $w_{i}^{(k) *}$ . Then $\sum_{i = 1}^{k} D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *})$ (14) can be strictly decreased by replacing $w_{i}^{(k) *}$ with $w^{*} (w_{i + 1}^{(k) *}, w_{i - 1}^{(k) *})$ . Thus (24) holds. ■

To derive the geometric lower bound (25) we massage the function w*(p, q) (23) into an f-divergence⁸, then use the non-negativity of f-divergences to bound λ* and derive $w^{*} (p, q) [x] ⩾ \sqrt{p [x] q [x]}$ , and then derive (25). See Appendix VI-B [8] for details.

Remarks

Figure 2 plots the optimal k-step path for k ∈ {2, 6} and in the k → ∞ limit (see Section IV-B). One way to compute the optimum path is to start with some guess (a good initial point is given in Theorem 3) and repeatedly apply (24) until numerical convergence; this is the method used in making Figure 2 and implemented in the code we release⁹.

Fig. 2 — (left) k-step path (15) from q to p (corresponding to D^(k)(p‖q)) computed by successifly applying (24) for $w_{1}^{(k = 2) *}$ and ${(w_{i}^{(k = 6)} *)}_{1 \leq i \leq 5}$ . The green line shows the limiting k → ∞ path (31). The dark grey lines show the arithmetic and (normalized) geometric mean of p and q for α ∈ [0, 1]. (right) The limiting path as k → ∞, part of a great circle joining $\sqrt{p}$ and $\sqrt{q}$ .

The optimal k-step path from q to p is not the same as the optimal path from p to q − an asymmetry inherited from the asymmetry of D^(k)(p‖q); but in the limit k → ∞ these two paths converge to the same limiting path (see Section IV-B). From (23) we see that q[x] > 0 ⇒ w* (p, q)[x] > 0, so $Support (w_{i}^{(k) *}) = Support (q)$ for i ∈ {1, …, k − 1}.

B. The limit k → ∞

What happens in the continuum limit, as the number of steps k → ∞? This setting was investigated by [7], who states the limiting path and large deviations rate function. We include this Section to make the story more complete, and to check consistency with and provide a finite-k intuition for the results [7] obtained with variational calculus.

A useful perspective in the following is to map the simplex Δ to the part of the unit sphere in $ℝ^{| X |}$ in the non-negative orthant, $Ψ ‖ {ψ \in ℝ^{| X |} : {‖ ψ ‖}_{2} = 1, ψ ⩾ 0}$ via the bijection $p \to \sqrt{p} ‖ {(\sqrt{p [x]})}_{x \in X}$ . This square root reparametrization appears in [10], [11], [12], [13]. Consider the K-L divergence between nearby points on the orthant Ψ. That is, if p = q + ε ∈ Δ, $ε \in ℝ^{| X |}$ , then $\exists δ_{ε} \in ℝ^{| X |}$ such that $\sqrt{p} = \sqrt{q} + δ_{ε} \in Ψ$ where $δ_{ε} [x] = \frac{1}{2} ε [x] p {[x]}^{- 1 / 2} + O (ε {[x]}^{2} p {[x]}^{- 3 / 2})$ . Then

D (p ‖ q) = \overset{2 ‖ \sqrt{p} - \sqrt{p} ‖ \begin{matrix} 2 \\ 2 \end{matrix}}{\overset{︷}{4 (1 - \underset{\cos θ}{\underset{︸}{\sqrt{p} \cdot \sqrt{p}}})}} + \sum_{x \in X} O (\frac{δ_{ε} {[x]}^{3}}{\sqrt{q [x]}})

(27)

= 2 θ^{2} + O (θ^{4}) + \sum_{x \in X} O (\frac{θ^{3}}{\sqrt{q [x]}})

(28)

where in (27) we rewrote the higher order terms in terms of ε, expanded the cos about θ = 0, and where

θ = θ_{\sqrt{p}, \sqrt{q}} ‖ across (\sqrt{p} \cdot \sqrt{q})

(29)

is the angle between vectors $\sqrt{p}$ and $\sqrt{q}$ and $\sqrt{p} \cdot \sqrt{q} = B (p, q)$ is the Bhattacharyya coefficient between p and q.

The above recapitulates the familiar fact that the K-L divergence is close to the squared Euclidean distance between nearby distributions [13] (here written in square root space); the leading order term is symmetric in p and q and depends on p and q only through the angle $θ_{\sqrt{p}, \sqrt{q}}$ . This enables us to guess that the limiting optimal path is a geodesic in the Euclidean metric restricted to the surface of the unit sphere and that the angle between adjacent points $θ_{\sqrt{w_{i}^{(k) *}}, \sqrt{w_{i - 1}^{(k) *}}} \to \frac{1}{k} θ_{\sqrt{p}, \sqrt{q}}$ . Theorem 3 confirms this intuition. See Appendix VI-C [8] for proof details.

Theorem 3

Let D(p‖q), D^(k)(p‖q) be as in (2), (14), respectively. Let p, q ∈ Δ and D(p‖q) < ∞ and D(q‖p) < ∞.

Scaling of D^(k)(p‖q) in $k : D (p ‖ q) = O (\frac{1}{k})$ and
$\lim_{k \to \infty} k D^{(k)} (p ‖ q) = \lim_{k \to \infty} k D^{(k)} (q ‖ p) = 2 θ_{\sqrt{p}, \sqrt{q}}^{2}$ (30)
where (as in (29)) $\cos θ_{\sqrt{p}, \sqrt{q}} = \sqrt{p} \cdot \sqrt{q} = B (p, q)$ .
The limiting path $(w_{i}^{(k) *})$ (15) as k → ∞ is a geodesic in the Euclidean metric restricted to the unit sphere (equivalently, a geodesic in the Fisher information metric on the simplex): part of a great circle connecting $\sqrt{p}$ and $\sqrt{q}$ with constant angular speed. Let t ∈ [0,1], then
$w_{⌊ t k ⌋}^{(k) *} \to \frac{1}{Z_{τ_{t}}} {(τ_{t} \sqrt{p} + (1 - τ_{t}) \sqrt{q})}^{2}$ (31)
in L² norm as k → ∞, where ⎿x⏌ denotes the floor function, $Z_{τ_{t}}$ is a constant¹⁰, and τ_t : [0,1] → [0,1] is a reparametrization of “time” t that ensures constant angular speed¹¹ on the unit sphere.

Remarks

The condition D(p‖q) < ∞ and D(q‖p) < ∞ ensures that p[x] > 0 ⇔ q[x] > 0, so the series expansion (27) does not diverge and the limits in (30) match. Figure 2 depicts the limiting optimal path for a particular choice of p and q. For a finite number of steps k, the limiting value of the quantity $w_{τ_{i / k}}^{(k) *}$ (31) provides a decent initial guess for the iterative computation of $w_{i}^{(k) *}$ described in the remarks following the statement of Theorem 2.

V. Applications and further directions

We conclude by offering an application of the chained K-L divergences in maximum likelihood inference and an interpretation in terms of a thought experiment in thermodynamics.

A. ML inference and mutual information

The k-step K-L divergence D^(k)(p‖E) = inf_q_∈_E D^(k)(p‖q) “from” a set of distributions E has a maximum likelihood interpretation: q* = min_q_∈_E D^(k)(p‖q) maximizes the likelihood to draw empirical distribution p ∈ Δ_n in k steps of the Wright-Fisher process with initial distribution q ∈ E. A special case of this is a k-step generalization of the mutual information I(X; Y) between random variables X and Y with finite alphabets $X$ , $Y$ , respectively, jointly distributed as $p_{X Y} \in Δ^{X \times Y}$ :

I^{(k)} (X; Y) ‖ min_{q_{X}, q_{Y}} D^{(k)} (p_{X Y} ║ q_{X} q_{Y})

(32)

with the minimum attained by the k-step “marginals” $(q_{X}^{(k) *}, q_{Y}^{(k) *}) ‖ \arg {min}_{q_{X}, q_{Y}} D^{(k)} (p_{X Y} ║ q_{X} q_{Y})$ . We can check that the minimum exists and is unique¹². For k = 1, we have I⁽¹⁾ (X; Y) = I(X; Y) and $(q_{X}^{(1) *}, q_{Y}^{(1) *}) = (p_{X}, p_{Y})$ (the marginals of p_XY), but for k > 1 this is not the case. I^(k) is monotonically decreasing in k.

Written in the minimization form (32), computing the mutual information I(X; Y) corresponds to finding the maximum likelihood distribution under which X and Y are independent given data with empirical distribution p_XY. In the context of the Wright-Fisher process, suppose we assume that two genetic loci X and Y are independently distributed k generations ago, but are no longer independent due to the intervening neutral genetic drift; then the maximum likelihood ancenstral distribution is $q_{X}^{(k) *} q_{Y}^{(k) *}$ given the current distribution p_XY. After enough time, all but one pair (x, y) of alleles fixes and the two loci become indepedent again, but neutral drift induces dependence before fixation occurs.

A related object we can construct with the chained K-L divergence is D^(k)(p_XY‖p_Xp_Y); this, like (32), matches the mutual information I(X; Y) for k = 1. The authors do not yet give an operational meaning for this quantity.

B. Maxwell’s demon

Maxwell’s demon is a thought experiment in thermodynamics, here envisioned as a model of a desalination plant: let distribution q correspond to the ambient relative concentrations of some chemicals in the “ocean” (so $X = {sodium ion, water, potassium ion, \dots}$ ). The demon’s goal is to achieve a desired concentration of chemicals in the water supply; let E denote the set of concentrations the demon considers acceptable. The demon operates by admitting n molecules drawn from the ambient distribution q into a vesicle (temporary storage); if it so happens that p − the empirical distribution of the n molecules − satisfies p ∈ E, then the demon releases the contents of the vesicle into the water supply, and otherwise the demon dumps the contents back in the ocean. See Figure 3.

Fig. 3 — (left) Maxwell’s demon stands ready to release particles into the water supply below if the distribution of a random sample of n happens to be in E. (right) Two demons use an intermediate stage E′ to desalinate faster.

How quickly does the demon desalinate? From the preceding discussion we know that the demon releases the n molecules into the water supply with probability about e^−nD ⁽^E^‖^q⁾. Suppose we now break up the desalination process into two demons, so there is an intermediate stage between the ocean and the water supply. If either of the two demons rejects, then both reject¹³. Then the optimal 2-demon desalination rate is $e^{- n D^{(2)}} (E ║ q)$ ¹⁴, which is about twice as large as the 1-demon rate. Adding more stages increases the rate further; in the k → ∞ limit, we have a concentration gradient across the k stages with no large “jumps.” This is not how actual desalinators work, but the design principle of breaking up a large concentration difference into many small stages is used in multi-stage flash desalination and in the loop of Henle in the kidney.

Acknowledgments

The authors gratefully acknowledge Surya Ganguli and Hideo Mabuchi for suggestions and insightful discussions.

VI. Appendix

A. Proof of Theorem 1

1) (Equivalence of recursive and non-recursive definitions of the chained divergence)

We proceed by induction on k. For the base case k = 2 the two definitions (13) and (14) are the same. Now suppose the definitions are equivalent up to k − 1 (abusing notation). Then we expand out the min:

D^{(k)} (p ‖ q) ‖ min_{w_{k - 1}} (D (p ‖ w_{k - 1}) + D^{(k - 1)} (w ‖ q))

(33)

= min_{w_{k - 1}} (D (p ‖ w_{k - 1}) + min_{w_{1}, \dots, w_{k - 2}} \sum_{i = 1}^{k - 1} D (w_{i} ‖ w_{i - 1}))

(34)

min_{w_{1}, \dots, w_{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})

(35)

where the second line follows from the inductive hypothesis. □

Existence and uniquness of the path follows from the joint strict convexity of D^(k)(p‖q) (Theorem 1.2)).

2) (Joint convexity of D^(k)(·‖·))

We use the following result [9] (Section 3.2.5): Suppose function f (x, y, z) is jointly convex in x, y, z ∈ C, where C is a nonempty convex subset of ℝ^d. Then

g (x, y) ‖ min_{z \in C} f (x, y, z)

(36)

is jointly convex in x and y.

We proceed by induction on k. For the base case k = 1, D⁽¹⁾(p‖q) = D(p‖q) is jointly strictly convex in p, q ∈ Δ. Now assume the statement holds up to k − 1. Then D(p‖w) + D^{(k − 1)}(w‖q) is jointly strictly convex in p, q,w ∈ Δ and

D^{(k)} (p ‖ q) ‖ min_{w \in Δ} (D (p ‖ w) + D^{(k - 1)} (w ‖ q))

(37)

is jointly strictly convex in p, q by the above lemma (36) with g(p, q) = D^(k)(p‖q) and f(p, q, w) = D(p‖w) + D^{(k − 1)}(w‖q).

Since D^(k)(p‖q) is the minimum of a strictly convex function over a closed convex set Δ, there is a unique global minimum $(w_{i}^{(k) *})$ for 1 ≤ i ≤ k − 1. □

3) (Monotonicity of D^(k)(p‖q) in k)

Consider the optimal k-step path ( $q = w_{0}^{(k) *}$ , $w_{1}^{(k) *}, \dots, w_{k - 1}^{(k) *}$ , $p = w_{k}^{(k) *}$ ) (15) and modify it by replacing $w_{k - 1}^{(k) *}$ with p:

D^{(k)} (p ‖ q) = \sum_{i = 1}^{k} D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *})

(38)

\overset{(a)}{<} D (p ‖ p) + \sum_{i = 1}^{k - 1} D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *})

(39)

\overset{(b)}{\leq} D^{(k - 1)} (p ‖ q)

(40)

(a) follows because we know from Theorem 2 that $w_{k - 1}^{(k) *} \neq p$ and since we showed above that the optimal path $(w_{i}^{(k) *})$ is unique. (b) follows since in general $w_{i}^{(k - 1) *} \neq w_{i}^{(k) *}$ . □

4) (Bounding $P_{n}^{(k)} (p | q)$ )

First the upper bound (18). We proceed by induction on k. For k = 1, the statement holds (see [1] Theorem 11.1.4). Now assume the statement holds up to k − 1. Now using (9) we write

P_{n}^{(k)} (p | q) = \sum_{w \in Δ_{n}} P_{n} (p | w) P_{n}^{(k - 1)} (w | q)

(41)

\overset{(a)}{\leq} {(n + 1)}^{(k - 2) | X |} \sum_{w \in Δ_{n}} e^{- n (D (p ‖ w) + D^{(k - 1)} (w ‖ q))}

(42)

\leq {(n + 1)}^{(k - 2) | X |} | Δ_{n} | max_{w \in Δ_{n}} e^{- n (D (p ‖ w) + D^{(k - 1)} (w ‖ q))}

(43)

\overset{(b)}{\leq} {(n + 1)}^{(k - 1) | X |} max_{w \in Δ} e^{- n (D (p ‖ w) + D^{(k - 1)} (w ‖ q))}

(44)

= {(n + 1)}^{(k - 1) | X |} e^{- n {min}_{w \in Δ} (D (p ‖ w) + D^{(k - 1)} (w ‖ q))}

(45)

\overset{(c)}{=} {(n + 1)}^{(k - 1) | X |} e^{- n D^{(k)} (p ‖ q)}

(46)

where (a) follows from the inductive hypothesis, (b) follows from $| Δ_{n} | \leq {(n + 1)}^{| X |}$ (see [1] Theorem 11.1.1) and from optimizing over Δ rather than Δ_n ⊂ Δ, and (c) follows from the recursive definition of the chained K-L divergence (11).

Next we find an asymptotic (in n ⟶ ∞) lower bound. Let (p_n)_n_∈ℕ, p_n ∈ Δ_n, lim_n _{⟶ ∞} D(p_n‖p) = 0. Choose an n-indexed sequence ${(v_{n, i})}_{i = 1}^{k - 1}$ such that v_n_,_i ∈ Δ_n, $D (v_{n, i} ‖ w_{i}^{*}) \to 0$ as n ⟶ ∞. Such a sequence exists because ∪_nΔ_n is dense in Δ. Denote v_n_,_k ≜ p_n and $w_{k}^{*} ‖ p$ . Denote

ε_{n, i} ‖ v_{n, i} - w_{i}^{*}

(47)

By Pinsker’s inequality

\lim_{n \to \infty} | ε_{n, i} [x] | = 0

(48)

for all i ∈ {1,…, k}, $x \in X$ . Moreover, since $\lim_{n \to \infty} D (v_{n, i} ‖ w_{i}^{*}) = 0$ by assumption, then for large enough n we have $Support (v_{n, i}) = Support (w_{i}^{*})$ , so for large enough n

Support (ε_{n, i}) \subset Support (w_{i}^{*})

(49)

Therefore for i ∈ {1,…, k}

\lim_{n \to \infty} D (v_{n, i} ‖ v_{n, i - 1}) = \lim_{n \to \infty} \sum_{x \in X} (w_{i}^{*} [x] + ε_{n, i} [x]) \log (\frac{w_{i}^{*} [x] + ε_{n, i} [x]}{w_{i - 1}^{*} [x] + ε_{n, i - 1} [x]})

(50)

= \sum_{x \in X} w_{i}^{*} [x] \log (\frac{w_{i}^{*} [x]}{w_{i - 1}^{*} [x]})

(51)

= D (w_{i}^{*} ‖ w_{i - 1}^{*})

(52)

where the second equality follows from (48) and (49) and ε_n, 0[x] ≜ 0.

Using (9) we write

P_{n}^{(k)} (p_{n} | q) = \sum_{(w_{i}) \in Δ_{n}^{k - 1}} \prod_{i = 1}^{k} P (w_{i} | w_{i - 1})

(53)

\overset{(a)}{⩾} \frac{1}{{(n + 1)}^{k | X |}} \sum_{(w_{i}) \in Δ_{n}^{k - 1}} e^{- n \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})}

(54)

⩾ \frac{1}{{(n + 1)}^{k | X |}} e^{- n {min}_{(w_{i}) \in Δ_{n}^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})}

(55)

with w₀ = q, w_k = p_n, (w_i) = (w₁,…,w_k−₁), where (a) follows from the lower bound $P_{n} (p | q) ⩾ e^{- n D (p ‖ q)} / {(n + 1)}^{| X |}$ (see [1], Theorem 11.1.4). Therefore

\underset{n \to \infty}{\lim \inf} \frac{1}{n} \log P_{n}^{(k)} (p_{n} | q) ⩾ \underset{n \to \infty}{\lim \inf} (- \frac{k | X | \log (n + 1)}{n} - min_{(w_{i}) \in Δ_{n}^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}))

(56)

= \underset{n \to \infty}{\lim \inf} (- min_{(w_{i}) \in Δ_{n}^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}))

(57)

⩾ \underset{n \to \infty}{\lim \inf} (- \sum_{i = 1}^{k} D (v_{n, i} ‖ v_{n, i - 1}))

(58)

\overset{(a)}{=} - \sum_{i = 1}^{k} D (w_{i}^{*} ‖ w_{i - 1}^{*})

(59)

= - D^{(k)} (p ‖ q)

(60)

where (a) follows by (52). Recalling the upper bound (18) we conclude

\lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (p_{n} | q) = - D^{(k)} (p ‖ q)

(61)

The first equality in (19) follows by letting v_n,k = p_n= p, ε_n,k[x] = 0 for all n, $x \in X$ , so $\lim_{n \to \infty} D (p ‖ v_{n, k - 1}) = D (p ‖ w_{k - 1}^{*})$ , so (59) holds, and

\lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (p | q) = \lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (p_{n} | q) = - D^{(k)} (p ‖ q)

(62)

Equivalently, $P_{n}^{(k)} (p | q) ≐ e^{- n D^{(k)} (p ‖ q)}$ . □

5) (Sanov’s theorem)

This proof is very similar to the previous proof for the case E = {p} and to the proof of Sanov’s theorem in [1].

First the upper bound (20). Using (7) and (10) we write

P_{n}^{(k)} (E | q) = \sum_{p \in E \cap Δ_{n}} \sum_{w \in Δ_{n}} P_{n} (p | w) P_{n}^{(k - 1)} (w | q)

(63)

\overset{(a)}{\leq} {(n + 1)}^{(k - 1) | X |} \sum_{p \in E \cap Δ_{n}} e^{- n D^{(k)} (p ‖ q)}

(64)

\leq {(n + 1)}^{(k - 1) | X |} \sum_{p \in E \cap Δ_{n}} e^{- n \inf_{p \in E} D^{(k)} (p ‖ q)}

(65)

= {(n + 1)}^{(k - 1) | X |} \sum_{p \in E \cap Δ_{n}} e^{- n D^{(k)} (E ‖ q)}

(66)

\overset{(c)}{\leq} {(n + 1)}^{k | X |} e^{- n D^{(k)} (E ‖ q)}

(67)

where (a) follows from the upper bound (18) and (b) follows from $| E \cap Δ_{n} | \leq | Δ_{n} | \leq {(n + 1)}^{| X |}$ .

By assumption E is closed and convex, so

p^{(k) *} = \arg min_{p \in E} D^{(k)} (p ‖ q)

(68)

is the k-step I-projection of q on E (12). We find an asymptotic (in n ⟶ ∞) lower bound. Using (7) and (10) we write

P_{n}^{(k)} (E | q) = \sum_{p \in E \cap Δ_{n}} \sum_{(w_{i}) \in Δ_{n}^{k - 1}} \prod_{i = 1}^{k} P (w_{i} | w_{i - 1})

(69)

\overset{(a)}{⩾} \frac{1}{{(n + 1)}^{k | X |}} \sum_{p \in E \cap Δ_{n}} \sum_{(w_{i}) \in Δ_{n}^{k - 1}} e^{- n \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})}

(70)

⩾ \frac{1}{{(n + 1)}^{k | X |}} e^{- n {min}_{p \in E \cap Δ_{n}, (w_{i}) \in Δ_{n}^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})}

(71)

with w₀ = q, w_k = p, (w_i) = (w₁,…,w_k₋₁), where (a) follows from the lower bound $P_{n} (p | q) ⩾ e^{- n D (p ‖ q)} / {(n + 1)}^{| X |}$ (see [1], Theorem 11.1.4).

By assumption E ∩ Δ_n is non-empty for sufficiently large n, so the lower bound (71) is not vacuous. We can then find a sequence (p_n)_n, p_n ∈ E ∩ Δ_n and lim_n_⟶∞ D(p_n‖p^(k)*) = 0 and an n-indexed sequence ${(v_{n, i})}_{i = 1}^{k - 1}$ such that v_n_,_i ∈ Δ_n, $D (v_{n, i} ‖ w_{i}^{*}) \to 0$ as n ⟶ ∞. Such a sequence exists because ∪_nΔ_n is dense in Δ. Recycling the argument in (52) we conclude

\lim_{n \to \infty} D (v_{n, i} ‖ v_{n, i - 1}) = D (w_{i}^{*} ‖ w_{i - 1}^{*})

(72)

where $w_{k}^{*} ‖ p^{(k) *}$ . Then

\underset{n \to \infty}{\lim \inf} \frac{1}{n} \log P_{n}^{(k)} (E | q) ⩾ \underset{n \to \infty}{\lim \inf} (- \frac{k | X | \log (n + 1)}{n} - min_{p \in E \cap Δ_{n}} min_{(w_{i}) \in Δ_{n}^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}))

(73)

= \underset{n \to \infty}{\lim \inf} (- min_{p \in E \cap Δ_{n}} min_{(w_{i}) \in Δ_{n}^{k - 1}} \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}))

(74)

⩾ \underset{n \to \infty}{\lim \inf} (- D (p_{n} | | v_{n, k - 1}) - \sum_{i = 1}^{k - 1} D (v_{n, i} ‖ v_{n, i - 1}))

(75)

\overset{(a)}{=} - \sum_{i = 1}^{k} D (w_{i}^{*} ‖ w_{i - 1}^{*})

(76)

= - D^{(k)} (E ‖ q)

(77)

where (a) follows by (72).

Recalling the upper bound (20) we conclude

\lim_{n \to \infty} \frac{1}{n} \log P_{n}^{(k)} (E | q) = D^{(k)} (p^{(k) *} ‖ q)

(78)

Equivalently, $P_{n}^{(k)} (E | q) ≐ e^{- n D^{(k)} (E ‖ q)}$ . □

6) (Conditional limit theorem)

We follow closely the proof for the conditional limit theorem in [1]. Let $(w_{1}^{*}, \dots, w_{k - 1}^{*}) = \arg {min}_{(w_{i}) \in Δ^{k - 1}} D (p^{(k) *} ‖ q)$ be the optimal path (15) and p^(k)* = argmin_p_∈_E D^(k)(p‖q) (12) be the k-step I-projection of q on E (p^(k)* exists and is unique since E is closed and convex by assumption).

Next define the set

S_{t} ‖ {(w_{1}, \dots, w_{k - 1}, p) \in Δ^{k} : \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}) \leq t}

(79)

Let

D * = \sum_{i = 1}^{k} D_{i}^{*} = D^{(k)} (E ‖ q)

(80)

Now define the set

A ‖ S_{D * + 2 δ} \cap (Δ^{k - 1} \times E)

(81)

where the last term ensures that ${(w_{i})}_{i = 1}^{k} \in A \Rightarrow w_{k} \in E$ . Define the set

B ‖ (Δ^{k - 1} \times E) - A

(82)

Thus A ∪ B = Δ^k−1 × E. In the summations below, $v = (w_{1}, \dots, w_{k - 1}, w_{k}) \in Δ_{n}^{k - 1} \times (E \cap Δ_{n})$ . Then

P_{n}^{(k)} (B | q) = \sum_{v : \sum_{i = 1}^{k} D (w_{i} | | w_{i - 1}) > D * + 2 δ} \prod_{i = 1}^{k} P (w_{i} | w_{i - 1})

(83)

\overset{(a)}{\leq} \sum_{v : \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}) > D * + 2 δ} e^{- n \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})}

(84)

\leq \sum_{v : \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}) > D * + 2 δ} e^{- n (D * + 2 δ)}

(85)

\overset{(b)}{\leq} {(n + 1)}^{k | X |} e^{- n (D * + 2 δ)}

(86)

where (a) follows from the upper bound (18) in the case k = 1 and (b) follows since $| Δ_{n}^{k - 1} \times (E_{i} \cap Δ_{n}) | \leq | Δ_{n} |^{k} \leq {(n + 1)}^{k | X |}$ .

In the summations below, $v = (w_{1}, \dots, w_{k - 1}, w_{k}) \in Δ_{n}^{k - 1} \times (E \cap Δ_{n})$ . We write

P_{n}^{(k)} (A | q) \overset{(a)}{⩾} P_{n}^{(k)} (A \cap S_{D * + δ} | q)

(87)

= \sum_{v : \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}) \leq D * + δ} \prod_{i = 1}^{k} P (w_{i} | w_{i - 1})

(88)

\overset{(b)}{⩾} \sum_{v : \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1}) \leq D * + δ} \frac{e^{- n \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})}}{{(n + 1)}^{k | X |}}

(89)

\overset{(c)}{⩾} \frac{e^{- n (D * + δ)}}{{(n + 1)}^{k | X |}} for sufficiently large n

(90)

where in (a) we used $S_{D * + δ} \subset S_{D * + 2 δ}$ , in (b) we used the lower bound $P_{n} (p | q) ⩾ e^{- n D (p | | q)} / {(n + 1)}^{| X |}$ (see [1], Theorem 11.1.4), and (c) follows since $S_{D * + δ} \cap (Δ^{k - 1} \times E) \cap Δ_{n}^{k} \neq \emptyset$ for n sufficiently large (since E ∩ Δ_n ≠ Ø for n sufficiently large by assumption).

Now recall the random variables ${(X_{t}^{(i)})}_{t \in {1, \dots, n}}$ defined in the statement of Theorem 1.6) and define the empirical distributions ŵ_i ∈ Δ_n

{\hat{w}}_{i} [x] ‖ \frac{1}{n} | {t : x_{t}^{(i)} = x} |

(91)

for all $x \in X$ , i ∈ {1,…, k}. Then for n sufficiently large

ℙ (({\hat{w}}_{1}, \dots, {\hat{w}}_{k}) \in B | {\hat{w}}_{k} \in E) = \frac{P_{n}^{(k)} (B \cap (Δ^{k - 1} \times E))}{P_{n}^{(k)} (Δ^{k - 1} \times E)}

(92)

\leq \frac{P_{n}^{(k)} (B)}{P_{n}^{(k)} (A)}

(93)

\overset{(a)}{\leq} \frac{{(n + 1)}^{k | X |} e^{- n (D * + 2 δ)}}{\frac{e^{- n (D * + δ)}}{{(n + 1)}^{k | X |}}}

(94)

= {(n + 1)}^{2 k | X |} e^{- n δ}

(95)

where (a) follows from (86) and (90). The above quantity goes to 0 as n ⟶ ∞, so $P (({\hat{w}}_{1}, \dots, {\hat{w}}_{k}) \in A | {\hat{w}}_{k} \in E) \to 1$ as n ⟶ ∞.

Next let’s show that elements of A are close to $(w_{1}^{*}, \dots, w_{k}^{*})$ in L¹ norm. This follows from the strict joint convexity of the function

Λ ({(w_{i})}_{i = 1}^{k}) ‖ \sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})

(96)

on the compact, convex set Δ^k−1 × E. We state a lemma:

Lemma 4

Let f(x) be jointly strictly convex in x ∈ C, where C is a compact, convex subset of ℝ^d, let x* ≜ argmin_x_∈_C f(x). For δ > 0 denote the sublevel set

T_{δ} ‖ {x : f (x) - f (x *) \leq δ}

(97)

Then

\lim_{δ \to 0} max_{x \in T_{δ}} {‖ x - x * ‖}_{1} = 0

(98)

Proof

Suppose (98) does not hold. Then there exists ε > 0 such that for all δ > 0

max_{x \in T_{δ}} {‖ x - x * ‖}_{1} ⩾ ε

(99)

Let T′ ≜ {x : ║x − x*║₁ ⩾ ε} and $δ_{ε} ‖ {min}_{x \in T^{'}} f (x) > 0$ where the min exists by compactness of T′ and the inequality follows since x ∈ T′ ⇒ x ≠ x* and f(x) is jointly strictly convex. Then

x \in T_{δ_{ε} / 2} \Rightarrow f (x) - f (x *) \leq δ_{ε} / 2 < δ_{ε}

(100)

\Rightarrow x \in T^{'}

(101)

\Rightarrow {‖ x - x * ‖}_{1} < ε

(102)

This contradicts (99), so (98) holds. ■

The function $Λ ({(w_{i})}_{i = 1}^{k})$ is jointly strictly convex in ${(w_{i})}_{i = 1}^{k}$ with compact, convex support Δ^k−1 × E and attains its minimum at $Λ ({(w_{i}^{*})}_{i = 1}^{k}) = D^{(k)} (E ‖ q)$ , so by Lemma 4

\lim_{δ \to 0} max_{{(w_{i}^{*})}_{i = 1}^{k} \in A} {‖ {(w_{i})}_{i = 1}^{*} - {(w_{i}^{*})}_{i = 1}^{k} ‖}_{1} = 0

(103)

Now recalling our earlier conclusion that for all δ > 0 $P (({\hat{w}}_{1}, \dots, {\hat{w}}_{k}) \in A | {\hat{w}}_{k} \in E) \to 1$ as n ⟶ ∞ and using (103), we conclude that for all ε > 0, all i ∈ {1,…, k}, and all $x \in X$

\lim_{n \to \infty} P (| {\hat{w}}_{i} [x] - w_{i}^{*} [x] | ⩾ ε | {\hat{w}}_{k} \in E) = 0

(104)

□

B. Proof of Theorem 2

Proof

The proof of the local optimality condition (24) is given after the statement of this Theorem.

Now let’s prove the geometric lower bound (25). We first state a lemma.

Lemma 5

Let p, q ∈ Δ, D(p‖q) < ∞, w*(p, q) be as in (23), and λ^⋆ the value of the Lagrange multiplier so that w*(p, q) is normalized. Then

λ^{⋆} \leq 1

(105)

Proof

First define the function

Z_{λ} (p, q) ‖ \sum_{x \in X} q [x] f_{λ} (\frac{p [x]}{q [x]})

(106)

where

f_{λ} (t) ‖ {\begin{matrix} \frac{- t}{W (t e^{λ})} + \frac{t}{W (e^{λ})} & : t > 0 \\ - e^{- λ} & : t = 0 \end{matrix}

(107)

We can check that f_λ(·) is convex and f_λ(1) = 0, so Z_λ defines an f-divergence. Now letting λ^⋆ so that w*(p, q) (23) is normalized, we write

Z_{λ^{⋆}} (p, q) = \sum_{x \in X} q [x] f_{λ^{⋆}} (\frac{p [x]}{q [x]})

(108)

\overset{(a)}{=} \sum_{x \in X} (- w * (p, q) [x] + \frac{p [x]}{W (e^{λ^{⋆}})})

(109)

= - 1 + \frac{1}{W (e^{λ^{⋆}})}

(110)

\begin{array}{l} \overset{(b)}{⩾} 0 \\ ⇓ \end{array}

(111)

λ^{⋆} \leq 1

(112)

where in (a) we substituted w*(p, q) (23) and in (b) we used the non-negativity of f-divergences [ref?]. ■

Next let’s state another lemma.

Lemma 6

Let p, q ∈ Δ, D(p‖q) < ∞ and w*(p, q) be as in (23). Then

w * (p, q) [x] ⩾ \sqrt{p [x] q [x]}

(113)

for all $x \in X$ .

Proof

Suppose there exists some $x \in X$ such that (113) does not hold, so $w * (p, q) [x] < \sqrt{p [x] q [x]}$ . This implies p[x], q[x] > 0, so

\begin{matrix} w * (p, q) [x] = \frac{p [x]}{W (e^{λ^{⋆}} \frac{p [x]}{q [x]})} < \sqrt{p [x] q [x]} \\ ⇓ \end{matrix}

(114)

λ^{⋆} > \sqrt{\frac{p [x]}{q [x]}} - \log (\frac{p [x]}{q [x]}) ⩾ 1

(115)

This contradicts Lemma 5, so the statement holds. ■

Lemma 6 implies

\log (w_{i}^{(k) *} [x]) = \log (w * (w_{i + 1}^{(k) *}, w_{i - 1}^{(k) *}) [x])

(116)

⩾ \frac{1}{2} \log (w_{i + 1}^{(k) *} [x]) + \frac{1}{2} \log (w_{i - 1}^{(k) *} [x])

(117)

Thus $\log w_{i}^{(k) *} [x]$ is concave in i, so

\log (w_{i}^{(k) *} [x]) ⩾ \frac{i}{k} \log (w_{k}^{(k) *} [x]) + \frac{k - i}{k} \log (w_{0}^{(k) *} [x])

(118)

= \frac{i}{k} \log (p [x]) + \frac{k - i}{k} \log (q [x])

(119)

so (25) follows. ■

C. Proof of Theorem 3

Proof

Let v_t denote the continuous path (31)

v_{t} ‖ \frac{1}{Z_{τ_{t}}} {(τ_{t} \sqrt{p} + (1 - τ_{t}) \sqrt{q})}^{2}

(120)

where $Z_{τ_{t}}$ and $τ_{t}$ are as in the statement of Theorem 3. Let $v_{i}^{(k)}$ denote the continuous path (31) sampled at k uniformly spaced points in time:

{(v_{i}^{(k)})}_{i = 0}^{k} ‖ {(\frac{1}{Z_{τ_{t}}} {(τ_{t} \sqrt{p} + (1 - τ_{t}) \sqrt{q})}^{2})}_{t \in {0, \frac{1}{k}, \frac{2}{k}, \dots, 1}}

(121)

Let ${(w_{i}^{(k) *})}_{i = 0}^{k}$ denote the optimal k-step path (15). Then $v_{0}^{(k)} = w_{0}^{(k) *} = q$ and $v_{k}^{(k)} = w_{k}^{(k) *} = p$ . For 1 ≤ i ≤ k denote by $θ_{i}^{(k)}$ the angle between (the square roots of) adjacent distributions on the path ${(v_{i}^{(k)})}_{i = 0}^{k}$

ϕ_{i}^{(k)} ‖ \arccos (\sqrt{v_{i}^{(k)} \cdot \sqrt{v_{i - 1}^{(k)}}}) = \frac{1}{k} θ_{\sqrt{p}, \sqrt{q}}

(122)

θ_{i}^{(k)} ‖ \arccos (\sqrt{w_{i}^{(k) *} \cdot \sqrt{w_{i - 1}^{(k) *}}})

(123)

Next let

μ ‖ min_{x \in Support (q)} min (p [x], q [x]) > 0

(124)

We have μ > 0 since D(p‖q) < ∞ and D(q‖p) < ∞, so Support(p) = Support(q) by assumption.

Now apply the series expansion for the K-L divergence (27) to ${(v_{i}^{(k)})}_{i = 0}^{k}$ and ${(w_{i}^{(k) *})}_{i = 0}^{k}$ and sum over i ∈ {1,…, k}:

\sum_{i = 1}^{k} D (v_{i}^{(k)} ‖ v_{i - 1}^{(k)}) = \sum_{i = 1}^{k} 2 (ϕ_{i}^{(k) 2} + O (ϕ_{i}^{(k) 4}) + \sum_{x \in X} O (\frac{ϕ_{i}^{(k) 3}}{\sqrt{v_{i}^{(k)}}}))

(125)

= \sum_{i = 1}^{k} (2 ϕ_{i}^{(k) 2} + O (ϕ_{i}^{(k) 3}))

(126)

where the second equality comes from the i- and k- independent lower bound

min_{x \in X} min_{t \in [0, 1]} v_{t} [x] ⩾ μ ‖ min_{x \in X} min (p [x], q [x]) > 0

(127)

Since D(p‖q) < ∞ and D(q‖p) < ∞ by assumption, we may assume p[x], q[x] > 0 for all $x \in X$ (since otherwise we could restrict attention to $Support (p) = Support (q) \subset X$ ). Thus the last inequality in (127) follows. Similarly we obtain

\sum_{i = 1}^{k} D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *}) = \sum_{i = 1}^{k} 2 (θ_{i}^{(k) 2} + O (θ_{i}^{(k) 4}) + \sum_{x \in X} O (\frac{θ_{i}^{(k) 3}}{\sqrt{w_{i}^{(k) *}}}))

(128)

= \sum_{i = 1}^{k} (2 θ_{i}^{(k) 2} + O (θ_{i}^{(k) 3}))

(129)

where the last inequality follows from the i- and k-independent lower bound

min_{x \in X} min_{i \in {0, \dots, k}} w_{i}^{(k) *} ⩾ min_{x \in X} p {[x]}^{i / k} q {[x]}^{(k - i) / k} ⩾ μ > 0

(130)

where the first inequality follows from Theorem 2, (25).

Now we write

\lim_{k \to \infty} k \sum_{i = 1}^{k} (D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *}) - D (v_{i}^{(k)} ‖ v_{i - 1}^{(k)})) = \lim_{k \to \infty} 2 k \sum_{i = 1}^{k} (θ_{i}^{(k) 2} - ϕ_{i}^{(k) 2})

(131)

= \lim_{k \to \infty} 2 k \sum_{i = 1}^{k} (θ_{i}^{(k) 2} - \frac{θ_{\sqrt{p}, \sqrt{q}}^{2}}{k^{2}})

(132)

⩾ 0

(133)

where the inequality follows since the path v_t is part of a great circle joining $\sqrt{p}$ and $\sqrt{q}$ at constant angular speed (uniformly spacing the angles $ϕ_{i}^{(k)} = \frac{1}{k} θ_{\sqrt{p}, \sqrt{q}}$ minimizes the sum of their squares subject to their sum $ϕ_{i}^{(k)} = θ_{\sqrt{p}, \sqrt{q}}$ ). On the other hand by the optimality of ${(w_{i}^{*})}_{i = 1}^{k}$

\lim_{k \to \infty} k \sum_{i = 1}^{k} (D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *}) - D (v_{i}^{(k)} ‖ v_{i - 1}^{(k)})) \leq 0

(134)

Thus, combining the above inequalities,

\lim_{k \to \infty} k \sum_{i = 1}^{k} (D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *}) - D (v_{i}^{(k)} ‖ v_{i - 1}^{(k)})) = 0

(135)

We then have

\sum_{i = 1}^{k} (D (w_{i}^{(k) *} ‖ w_{i - 1}^{(k) *}) - D (v_{i}^{(k)} ‖ v_{i - 1}^{(k)})) = \sum_{i = 1}^{k} (O (θ_{i}^{(k) 3}) + O (ϕ_{i}^{3}))

(136)

= k O (1 / k^{3}) = O (1 / k^{2})

(137)

[why θ^(k) = O(1/k)?]

[Now use joint strict convexity of $\sum_{i = 1}^{k} D (w_{i} ‖ w_{i - 1})$ on the compact, convex set Δ^k−1 × E to complete proof?] ■

Footnotes

$\sqrt{p}$ denotes the vector with components $\sqrt{p [x]}$ .

https://github.com/dmitrip/chaineddivergences

That is, $a_{n} ≐_{n} b_{n} \Leftrightarrow \lim_{n \to \infty} \frac{1}{n} \log \frac{a_{n}}{b_{n}} = 0$ .

⁴

Convergence in probability, as in the statement of Theorem 1.6) below.

⁵

We use λ − 1 rather than λ to make the answer look more compact.

⁶

W(z) satisfies z = W(z)e^W⁽^z⁾ and W(z) ⩾ − 1 corresponds to the principal branch.

⁷

This means finding a root of the smooth, monotonic function of λ $\sum_{x \in X} p [x] / W (e^{λ} \frac{p [x]}{q [x]}) - 1$ with the knowledge that λ* ∈ [0, 1], so is doable.

⁸

The f-divergence is $D_{f} (p, q) ‖ \sum_{x \in X} q [x] f (p [x] / q [x])$ with f(t) = −t/W(te^λ) + t/W(e^λ). f(t) is convex and f(1) = 0, so D_f (p, q) is an f-divergence.

⁹

https://github.com/dmitrip/chaineddivergences

¹⁰

We can check that $Z_{τ_{t}} = {τ_{t}}^{2} + {(1 - τ_{t})}^{2} + 2 τ_{t} (1 - τ_{t}) (\sqrt{p} \cdot \sqrt{q})$ .

¹¹

We can check to $τ_{t} = \frac{1}{2} (1 + \frac{\tan ((t - 1 / 2) θ)}{\tan (θ / 2)})$ where $θ = θ_{\sqrt{p}, \sqrt{q}}$ .

¹²

The set of distributions {q_XY : q_XY = q_X q_Y} is log-convex, and letting $q_{X Y}^{(k) *}$ be the “reverse I-projection” of $w_{k - 1}^{(k) *}$ (15) on E, Theorem 1 of [6] proves existence and uniqueness.

¹³

This ensures particle conservation in the intermediate stage, though we can envision other scenarios with variable sample sizes and demon policies.

¹⁴

The intermediate demon must choose an appropriate acceptance set E′ ⊂ E such that the I-projection of q on E′ is w* = arg min_w D(E‖w) + D(w‖q); one possible choice is E′ = {w : D(E‖w) ≤ D(E‖w*)}.

Contributor Information

Dmitri S. Pavlichin, Stanford University

Tsachy Weissman, Stanford University.

References

1.Cover TM, Thomas JA. Elements of Information Theory. Second. Hoboken, NJ: John Wiley & Sons; 2006. [Google Scholar]
2.Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Crow JF. Wright and Fisher on inbreeding and random drift. Genetics. 2010;184:609–611. doi: 10.1534/genetics.109.110023. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sanov IN. On the probability of large deviations of a random variable. Mat Sbornik. 1957;42:11–44. [Google Scholar]
5.Csiszár I. Sanov property, generalized I-projection and a conditional limit theorem. The Annals of Probability. 1984;12:768–793. [Google Scholar]
6.Csiszár I, František M. Information projections revisited. IEEE Transitions on Information Theory. 2003;49:1474–1490. [Google Scholar]
7.Papangelou F. The large deviations of a multi-allele Wright-Fisher process mapped on the sphere. The Annals of Applied Probability. 2000;10:1259–1273. [Google Scholar]
8.Pavlichin DS, Weissman T. Chained Kullback-Leibler divergences. doi: 10.1109/ISIT.2016.7541365. www.stanford.edu/~dmitrip/chaineddivergences.pdf, in preparation. [DOI] [PMC free article] [PubMed]
9.Boyd S, Vandenberge L. Convex Optimization. Cambridge, UK: Cambridge University Press; 2009. [Google Scholar]
10.Hájek J. Asymptotically most powerful rank-order tests. The Annals of Mathematical Statistics. 1962;33:1124–1147. [Google Scholar]
11.Le Cam L. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics. 1970;41:802–828. [Google Scholar]
12.Pollard D. Another look at differentiability in quadratic mean. In: Pollard GLYD, Torgersen E, editors. Festschrift for Lucien Le Cam. New York: Springer; 1997. pp. 305–314. ch 9. [Google Scholar]
13.Amari S, Nagaoka H. Methods of information geometry. American Mathematical Society; 2000. [Google Scholar]

[R1] 1.Cover TM, Thomas JA. Elements of Information Theory. Second. Hoboken, NJ: John Wiley & Sons; 2006. [Google Scholar]

[R2] 2.Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Crow JF. Wright and Fisher on inbreeding and random drift. Genetics. 2010;184:609–611. doi: 10.1534/genetics.109.110023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Sanov IN. On the probability of large deviations of a random variable. Mat Sbornik. 1957;42:11–44. [Google Scholar]

[R5] 5.Csiszár I. Sanov property, generalized I-projection and a conditional limit theorem. The Annals of Probability. 1984;12:768–793. [Google Scholar]

[R6] 6.Csiszár I, František M. Information projections revisited. IEEE Transitions on Information Theory. 2003;49:1474–1490. [Google Scholar]

[R7] 7.Papangelou F. The large deviations of a multi-allele Wright-Fisher process mapped on the sphere. The Annals of Applied Probability. 2000;10:1259–1273. [Google Scholar]

[R8] 8.Pavlichin DS, Weissman T. Chained Kullback-Leibler divergences. doi: 10.1109/ISIT.2016.7541365. www.stanford.edu/~dmitrip/chaineddivergences.pdf, in preparation. [DOI] [PMC free article] [PubMed]

[R9] 9.Boyd S, Vandenberge L. Convex Optimization. Cambridge, UK: Cambridge University Press; 2009. [Google Scholar]

[R10] 10.Hájek J. Asymptotically most powerful rank-order tests. The Annals of Mathematical Statistics. 1962;33:1124–1147. [Google Scholar]

[R11] 11.Le Cam L. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics. 1970;41:802–828. [Google Scholar]

[R12] 12.Pollard D. Another look at differentiability in quadratic mean. In: Pollard GLYD, Torgersen E, editors. Festschrift for Lucien Le Cam. New York: Springer; 1997. pp. 305–314. ch 9. [Google Scholar]

[R13] 13.Amari S, Nagaoka H. Methods of information geometry. American Mathematical Society; 2000. [Google Scholar]

PERMALINK

Chained Kullback-Leibler Divergences

Dmitri S Pavlichin

Tsachy Weissman

Abstract

I. Introduction

II. The Wright-Fisher Markov chain

Fig. 1.

III. Characterizing the k-fold chained K-L divergence

Theorem 1

IV. Characterizing the k-fold path

A. Finite number of steps k

Theorem 2

Proof

Remarks

Fig. 2.

B. The limit k → ∞

Theorem 3

Remarks

V. Applications and further directions

A. ML inference and mutual information

B. Maxwell’s demon

Fig. 3.

Acknowledgments

VI. Appendix

A. Proof of Theorem 1

1) (Equivalence of recursive and non-recursive definitions of the chained divergence)

2) (Joint convexity of D(k)(·‖·))

3) (Monotonicity of D(k)(p‖q) in k)

4) (Bounding Pn(k)(p|q))

5) (Sanov’s theorem)

6) (Conditional limit theorem)

Lemma 4

Proof

B. Proof of Theorem 2

Proof

Lemma 5

Proof

Lemma 6

Proof

C. Proof of Theorem 3

Proof

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2) (Joint convexity of D^(k)(·‖·))

3) (Monotonicity of D^(k)(p‖q) in k)

4) (Bounding $P_{n}^{(k)} (p | q)$ )