Asymptotic Geometry of Multiple Hypothesis Testing

M Brandon Westover

doi:10.1109/TIT.2008.924656

. Author manuscript; available in PMC: 2019 Oct 11.

Published in final edited form as: IEEE Trans Inf Theory. 2008 Jun 17;54(7):3327–3329. doi: 10.1109/TIT.2008.924656

Asymptotic Geometry of Multiple Hypothesis Testing

M Brandon Westover ¹

PMCID: PMC6788803 NIHMSID: NIHMS1053530 PMID: 31607755

Abstract

We present a simple geometrical interpretation for the solution to the multiple hypothesis testing problem in the asymptotic limit. Under this interpretation, the optimal decision rule is a nearest neighbor classifier on the probability simplex.

Index Terms—: Geometry, hypothesis testing, large deviations, pattern recognition

I. Introduction

In the binary hypothesis testing problem with independent and identically distributed (i.i.d.) observations, it is well known that the error probability for the optimal decision rule decays with a constant exponential rate equal to the Chernoff distance between the two hypothesis distributions. The generalization to multiple hypothesis testing for i.i.d. observations was derived by Leang and Johnson [1], and extended for observations modeled as stationary ergodic processes by Schmid and O’Sullivan [2]. In this correspondence, we focus on the i.i.d. case. For M-ary hypothesis testing, the error probability decays exponentially with a rate equal to the minimum Chernoff distance between all distinct pairs of hypothesis distributions. In this correspondence, we describe a simple geometrical interpretation of this result, illustrated in Fig. 2. We first review the binary case.

II. Binary Hypothesis Testing

Let π_i, i = 1, 2 be the prior probabilities for the hypotheses H₁ and H₂; and let Yⁿ = (Y₁, …., Y_n) be an i.i.d. sequence of observations from a finite alphabet $Y$ . We must decide between

H_{1} : Y^{n} ~ P_{1}

and

H_{2} : Y^{n} ~ P_{2} .

A. Optimal Decision Rule and Error Exponent

The following results are standard (see, e.g., [3, Ch. 12]). Letting π₁ = π₂ = 1/2, the optimal decision $\hat{H} = \hat{H} (Y^{n})$ rule can be expressed in terms of the Kullback–Leibler (KL)-divergence as

\hat{H} = \underset{i \in {1, 2}}{argmin} D (P_{Y^{n}} ‖ P_{i})

(1)

where $P_{Y} n$ is the type (empirical histogram) of Yⁿ. This decision rule partitions the probability simplex $P$ over $Y$ into two disjoint decision regions A₁, A₂. Denoting their respective complements $A_{1}^{c} = A_{2}, A_{2}^{c} = A_{1}$ , an application of Sanov’s theorem shows that the error probabilities are

\begin{array}{l} P_{1}^{n} (A_{1}^{c}) ≐ 2^{- n D (P_{1}^{*} ‖ P_{1})} \\ P_{2}^{n} (A_{2}^{c}) ≐ 2^{- n D (P_{2}^{*} ‖ P_{2})} \end{array}

(2)

where

P_{i}^{*} ≜ \underset{p \in A_{i}^{c}}{argmin} D (p ‖ P_{i}), i = 1, 2

and ≐ denotes “equality to first order in the exponent,” i.e., a_n ≐ b_n means ${lim}_{n \to \infty} \frac{1}{n} log \frac{a_{n}}{b_{n}} = 0$ (see [3, p. 55]). For the total probability of error $P_{e}^{n}$ we have

P_{e}^{n} = π_{1} P_{1}^{n} (A_{1}^{c}) + π_{2} P_{2}^{n} (A_{2}^{c}) ≐ π_{1} 2^{- n D (P_{1}^{*} ‖ P_{1})} + π_{2} 2^{- n D (P_{2}^{*} ‖ P_{2})} ≐ 2^{- n min {D (P_{1}^{*} ‖ P_{1}), D (P_{2}^{*} ‖ P_{2})}} ≜ 2^{- n C (P_{1}, P_{2})} .

(3)

The exponent

C (P_{1}, P_{2}) = min {D (P_{1}^{*} ‖ P_{1}), D (P_{2}^{*} ‖ P_{2})}

is the Chernoff distance between P₁ and P₂, usually written in the alternative but equivalent form

C (P_{1}, P_{2}) = min_{λ \in [0, 1]} log (\sum_{y} P_{1}^{λ} (y) P_{2}^{λ - 1} (y)) .

B. Geometric Interpretation

Fig. 1 illustrates the preceding results for alphabet size $| Y | = 3$ , with corresponding probability simplex

P = {p : p_{1}, p_{2}, p_{3} \geq 0, p_{1} + p_{2} + p_{3} = 1} .

(4)

In this illustration, each probability distribution $p \in P$ is mapped to a point $r \in ℝ^{2}$ in an equilateral triangle with vertices at points $r = (x, y) \in ℝ^{2} : (0, 1), (\sqrt{3} / 2, - 1 / 2)$ , and $(- \sqrt{3} / 2, - 1 / 2)$ , using the transformation r = Tp + b, where

T = (\begin{matrix} - 1 / \sqrt{3} & - 1 / 3 \\ 1 / \sqrt{3} & - 1 / 3 \\ 0 & 2 / 3 \end{matrix}), b = (\begin{matrix} 1 / 3 \\ 1 / 3 \\ 1 / 3 \end{matrix}) .

Given an observation sequence Yⁿ, the decision rule (1) can be interpreted geometrically in terms of the following algorithm:

Locate the observation’s type $P_{Y} n$ in the simplex $P$ ;
Compute the distance to each hypothesis distribution $D (P_{Y^{n}} ‖ P_{1}), D (P_{Y^{n}} ‖ P_{2})$ ;
Assign Yⁿ to the closest hypothesis.

$\hat{H}$ thus divides $P$ into two disjoint cells,¹ with centroids P₁ and P₂, and each point $p \in P$ assigned to the nearest centroid, with “distance” measured in KL-divergence.

Next consider (2). Under hypothesis H₁ : Yⁿ ~ P₁, the probability of incorrectly deciding $Y^{n} ~ P_{2}$ , that is, $P_{1}^{n} (A_{1}^{c})$ , depends on the distance from P₁ to the closest distribution outside of A₁. This distribution $P_{1}^{*}$ can be expressed as a point along the geodesic in $P$ joining P₁ and P₂

p (λ) = \frac{1}{Z (λ)} P_{1}^{1 - λ} P_{2}^{λ}, λ \in [0, 1]

(5)

where exponentiation is defined pointwise (e.g., $if | Y | = 3$ , then $p^{λ} = (p_{1}^{λ}, p_{2}^{λ}, p_{3}^{λ})$ . Conceptually, the value λ* for which $p (λ^{*}) = P_{1}^{*}$ can be found by setting λ = 0, then continuously increasing λ until the point p(λ) reaches the border separating regions A₁ and A₂. Symmetric arguments apply for $P_{2}^{*}$ and, by symmetry, we must have $P_{1}^{*} = P_{2}^{*}$ .

Finally, interpreting (3), the overall probability of error $P_{e}^{n}$ is determined by measuring the distance along the geodesic from either centroid P₁ or P₂ to its border. The shortest of the two paths asymptotically dominates the overall probability of error.

III. M-Ary Hypothesis Testing

Now consider the M-ary hypothesis testing problem. Let π_i be the prior probability for hypothesis H_i. We must decide among

H_{i} : Y^{n} ~ P_{i}, i = 1, \dots, M .

As above, we assume equal prior probabilities, π_i = 1/M, i = 1, ….,M.

An important special case is the following formulation of the general statistical pattern recognition problem [4]: Given a set of M template patterns $C = {x^{n} (1), \dots, x^{n} (M)}$ , suppose Nature selects one of the pattern templates w ∈ {1, …,M} at random with probability p(w) = 1/M, w = 1, ….,M. The observation data Yⁿ is generated from the template xⁿ(w) according to $P_{w} = p (y^{n} | x^{n} (w)) = \prod_{j = 1}^{n} p (y_{j} | x_{j} (w))$ . In this case, the optimal hypothesis test $\hat{w}$ is a classifier, i.e., a rule that infers the pattern class underlying the observation Yⁿ.

A. Optimal Decision Rule and Error Exponent

The natural generalizations of (1)–(3) to the M-ary case, derived by Leang and Johnson [1], are as follows. The optimal decision rule is

\hat{H} = \underset{i \in {1, \dots, M}}{argmin} D (P_{Y^{n}} ‖ P_{i}) .

(6)

Denoting the optimal decision regions by A_i, and their complements by $A_{i}^{c}$ , applying Sanov’s theorem yields for the error probabilities under each hypothesis H_i, i = 1, …,M

P_{i}^{n} (A_{i}^{c}) ≐ 2^{- n D (P_{i}^{*} ‖ P_{i})} P_{i}^{*} ≜ \underset{p \in A_{i}^{c}}{argmin} D (p ‖ P_{i}) .

(7)

Hence, for the total probability of error $P_{e}^{n}$ we have

P_{e}^{n} = \sum_{i = 1}^{M} π_{i} P_{i}^{n} (A_{i}^{c}) ≐ \sum_{i = 1}^{M} π_{i} 2^{- n D (P_{i}^{*} ‖ P_{i})} ≐ 2^{- n {min}_{i} D (P_{i}^{*} ‖ P_{i})} = 2^{- n C (P_{w}, P_{w^{'}})}

(8)

where

(w, w^{'}) = \underset{i \neq j}{argmin} C (P_{i}, P_{j})

i.e., P_w, P_w′ is the closest pair of distributions, measured in Chernoff distance.

B. Geometric Interpretation

Now consider the geometric interpretation of the preceding results for M-ary hypothesis testing. Fig. 2 represents the M-ary generalization of Fig. 1.

Given an observation sequence Yⁿ, the optimal decision rule (6) can be interpreted geometrically in terms of the following algorithm:

Locate the observation’s type $P_{Y} n$ in the simplex $P$ ;
Compute the distance to each hypothesis distribution $D (P_{Y} n ‖ P_{i}), i = 1, \dots, M$ ;
Assign Yⁿ to the closest hypothesis.

The rule $\hat{H}$ thus divides $P$ into M cells, with centroids P_i, i = 1, …,M, with each point $p \in P$ assigned to the nearest centroid.

These cells are convex: Given a cell A_i with centroid P_i and two points Q₁,Q₂ ∈ A_i, let Q = λQ₁ + (1 − λ)Q₂, λ ∈ [0, 1], let P_j ∈ A_j be any other centroid, and define

Δ (Q_{k}) ≜ D (Q_{k} ‖ P_{i}) - D (Q_{k} ‖ P_{j}) = E_{Q_{k}} log \frac{P_{j}}{P_{i}}, k = 1, 2 .

The condition Q₁,Q₂ ∈ A_i is equivalent to requiring Δ(Q_k) < 0, k = 1, 2. Hence, noting that Δ(Q_k) is linear in Q_k, we have

Δ (Q) = λ Δ (Q_{1}) + (1 - λ) Δ (Q_{2}) < 0

whence Q ∈ A_i, establishing the convexity of A_i. A similar argument shows that the borders of the decision cells are composed of straight lines: Q₁, Q₂ are on the boundary between A_i, A_j, if and only if Δ(Q₁) = Δ(Q₂) = 0, hence, Δ(Q) = λΔ(Q₁) + (1 − λ)Δ(Q₂) = 0.

Turning to (7), under each hypothesis H_i the corresponding probability of error $P_{i} (A_{i}^{c})$ is again determined by the distance from the centroid P_i of A_i to the nearest point outside A_i. However, in the M-ary case, each complement $A_{i}^{c}$ consists of M − 1 other decision regions, and $P_{1}^{*}$ lies on the geodesic joining P_i with the closest neighboring region to A_i. Note that, unlike the binary case, for two neighboring cells A_i, A_j, it is not necessarily the case that $P_{i}^{*} = P_{j}^{*}$ , although this is the case when geodesics intersect a shared border.

Finally, consider (8). To determine the overall probability of error $P_{e}^{n}$ , we compare the geodesic distance from each centroid to its nearest border, $D (P_{i}^{*} ‖ P_{i})$ . The overall probability of error is dominated by the shortest such path, of length D(P_w‖P_w′).

IV. Conclusion

We have described a simple, easily remembered geometrical interpretation of the results of [1] for the multiple hypothesis testing problem in the asymptotic regime. Under this interpretation, the optimal decision rule is a nearest neighbor classifier, with “distance” measured in terms of the KL-divergence between the observation data’s type and each hypothesis distribution. That is, the optimal decision rule splits the probability simplex into M cells, and assigns each observation to the nearest centroid. The error probability under each hypothesis is determined by the geodesic distance from the centroid of its cell to the nearest cell border. The overall probability of error is determined by the smallest such distance.

Acknowledgment

The author wishes to thank Joseph A. O’Sullivan for insightful comments.

Footnotes

It is tempting to call these “Voronoi” cells, but strictly speaking they are not, because the KL-divergence is not a true distance metric.

References

[1].Leang C and Johnson DH, “On the asymptotics of M-hypothesis Bayesian detection,” IEEE Trans. Inf. Theory, vol. 43, no. 1, pp. 280–282, Jan. 1997. [Google Scholar]
[2].Schmid NA and O’Sullivan JA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct. 2004. [Google Scholar]
[3].Cover TM and Thomas J, Elements of Information Theory. New York: Wiley, 1991. [Google Scholar]
[4].Westover MB and O’Sullivan JA, “Achievable rates for pattern recognition,” IEEE Trans. Inf. Theory, vol. 54, pp. 299–320, Jan. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Leang C and Johnson DH, “On the asymptotics of M-hypothesis Bayesian detection,” IEEE Trans. Inf. Theory, vol. 43, no. 1, pp. 280–282, Jan. 1997. [Google Scholar]

[R2] [2].Schmid NA and O’Sullivan JA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct. 2004. [Google Scholar]

[R3] [3].Cover TM and Thomas J, Elements of Information Theory. New York: Wiley, 1991. [Google Scholar]

[R4] [4].Westover MB and O’Sullivan JA, “Achievable rates for pattern recognition,” IEEE Trans. Inf. Theory, vol. 54, pp. 299–320, Jan. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Asymptotic Geometry of Multiple Hypothesis Testing

M Brandon Westover

Abstract

I. Introduction

Fig. 2.

II. Binary Hypothesis Testing

A. Optimal Decision Rule and Error Exponent

B. Geometric Interpretation

Fig. 1.

III. M-Ary Hypothesis Testing

A. Optimal Decision Rule and Error Exponent

B. Geometric Interpretation

IV. Conclusion

Acknowledgment

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Asymptotic Geometry of Multiple Hypothesis Testing

M Brandon Westover

Abstract

I. Introduction

Fig. 2.

II. Binary Hypothesis Testing

A. Optimal Decision Rule and Error Exponent

B. Geometric Interpretation

Fig. 1.

III. M-Ary Hypothesis Testing

A. Optimal Decision Rule and Error Exponent

B. Geometric Interpretation

IV. Conclusion

Acknowledgment

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases