Abstract
We present a simple geometrical interpretation for the solution to the multiple hypothesis testing problem in the asymptotic limit. Under this interpretation, the optimal decision rule is a nearest neighbor classifier on the probability simplex.
Index Terms—: Geometry, hypothesis testing, large deviations, pattern recognition
I. Introduction
In the binary hypothesis testing problem with independent and identically distributed (i.i.d.) observations, it is well known that the error probability for the optimal decision rule decays with a constant exponential rate equal to the Chernoff distance between the two hypothesis distributions. The generalization to multiple hypothesis testing for i.i.d. observations was derived by Leang and Johnson [1], and extended for observations modeled as stationary ergodic processes by Schmid and O’Sullivan [2]. In this correspondence, we focus on the i.i.d. case. For M-ary hypothesis testing, the error probability decays exponentially with a rate equal to the minimum Chernoff distance between all distinct pairs of hypothesis distributions. In this correspondence, we describe a simple geometrical interpretation of this result, illustrated in Fig. 2. We first review the binary case.
Fig. 2.
M-ary hypothesis testing.
II. Binary Hypothesis Testing
Let πi, i = 1, 2 be the prior probabilities for the hypotheses H1 and H2; and let Yn = (Y1, …., Yn) be an i.i.d. sequence of observations from a finite alphabet . We must decide between
and
A. Optimal Decision Rule and Error Exponent
The following results are standard (see, e.g., [3, Ch. 12]). Letting π1 = π2 = 1/2, the optimal decision rule can be expressed in terms of the Kullback–Leibler (KL)-divergence as
(1) |
where is the type (empirical histogram) of Yn. This decision rule partitions the probability simplex over into two disjoint decision regions A1, A2. Denoting their respective complements , an application of Sanov’s theorem shows that the error probabilities are
(2) |
where
and ≐ denotes “equality to first order in the exponent,” i.e., an ≐ bn means (see [3, p. 55]). For the total probability of error we have
(3) |
The exponent
is the Chernoff distance between P1 and P2, usually written in the alternative but equivalent form
B. Geometric Interpretation
Fig. 1 illustrates the preceding results for alphabet size , with corresponding probability simplex
(4) |
In this illustration, each probability distribution is mapped to a point in an equilateral triangle with vertices at points , and , using the transformation r = Tp + b, where
Fig. 1.
Binary hypothesis testing.
Given an observation sequence Yn, the decision rule (1) can be interpreted geometrically in terms of the following algorithm:
Locate the observation’s type in the simplex ;
Compute the distance to each hypothesis distribution ;
Assign Yn to the closest hypothesis.
thus divides into two disjoint cells,1 with centroids P1 and P2, and each point assigned to the nearest centroid, with “distance” measured in KL-divergence.
Next consider (2). Under hypothesis H1 : Yn ~ P1, the probability of incorrectly deciding , that is, , depends on the distance from P1 to the closest distribution outside of A1. This distribution can be expressed as a point along the geodesic in joining P1 and P2
(5) |
where exponentiation is defined pointwise (e.g., , then . Conceptually, the value λ* for which can be found by setting λ = 0, then continuously increasing λ until the point p(λ) reaches the border separating regions A1 and A2. Symmetric arguments apply for and, by symmetry, we must have .
Finally, interpreting (3), the overall probability of error is determined by measuring the distance along the geodesic from either centroid P1 or P2 to its border. The shortest of the two paths asymptotically dominates the overall probability of error.
III. M-Ary Hypothesis Testing
Now consider the M-ary hypothesis testing problem. Let πi be the prior probability for hypothesis Hi. We must decide among
As above, we assume equal prior probabilities, πi = 1/M, i = 1, ….,M.
An important special case is the following formulation of the general statistical pattern recognition problem [4]: Given a set of M template patterns , suppose Nature selects one of the pattern templates w ∈ {1, …,M} at random with probability p(w) = 1/M, w = 1, ….,M. The observation data Yn is generated from the template xn(w) according to . In this case, the optimal hypothesis test is a classifier, i.e., a rule that infers the pattern class underlying the observation Yn.
A. Optimal Decision Rule and Error Exponent
The natural generalizations of (1)–(3) to the M-ary case, derived by Leang and Johnson [1], are as follows. The optimal decision rule is
(6) |
Denoting the optimal decision regions by Ai, and their complements by , applying Sanov’s theorem yields for the error probabilities under each hypothesis Hi, i = 1, …,M
(7) |
Hence, for the total probability of error we have
(8) |
where
i.e., Pw, Pw′ is the closest pair of distributions, measured in Chernoff distance.
B. Geometric Interpretation
Now consider the geometric interpretation of the preceding results for M-ary hypothesis testing. Fig. 2 represents the M-ary generalization of Fig. 1.
Given an observation sequence Yn, the optimal decision rule (6) can be interpreted geometrically in terms of the following algorithm:
Locate the observation’s type in the simplex ;
Compute the distance to each hypothesis distribution ;
Assign Yn to the closest hypothesis.
The rule thus divides into M cells, with centroids Pi, i = 1, …,M, with each point assigned to the nearest centroid.
These cells are convex: Given a cell Ai with centroid Pi and two points Q1,Q2 ∈ Ai, let Q = λQ1 + (1 − λ)Q2, λ ∈ [0, 1], let Pj ∈ Aj be any other centroid, and define
The condition Q1,Q2 ∈ Ai is equivalent to requiring Δ(Qk) < 0, k = 1, 2. Hence, noting that Δ(Qk) is linear in Qk, we have
whence Q ∈ Ai, establishing the convexity of Ai. A similar argument shows that the borders of the decision cells are composed of straight lines: Q1, Q2 are on the boundary between Ai, Aj, if and only if Δ(Q1) = Δ(Q2) = 0, hence, Δ(Q) = λΔ(Q1) + (1 − λ)Δ(Q2) = 0.
Turning to (7), under each hypothesis Hi the corresponding probability of error is again determined by the distance from the centroid Pi of Ai to the nearest point outside Ai. However, in the M-ary case, each complement consists of M − 1 other decision regions, and lies on the geodesic joining Pi with the closest neighboring region to Ai. Note that, unlike the binary case, for two neighboring cells Ai, Aj, it is not necessarily the case that , although this is the case when geodesics intersect a shared border.
Finally, consider (8). To determine the overall probability of error , we compare the geodesic distance from each centroid to its nearest border, . The overall probability of error is dominated by the shortest such path, of length D(Pw‖Pw′).
IV. Conclusion
We have described a simple, easily remembered geometrical interpretation of the results of [1] for the multiple hypothesis testing problem in the asymptotic regime. Under this interpretation, the optimal decision rule is a nearest neighbor classifier, with “distance” measured in terms of the KL-divergence between the observation data’s type and each hypothesis distribution. That is, the optimal decision rule splits the probability simplex into M cells, and assigns each observation to the nearest centroid. The error probability under each hypothesis is determined by the geodesic distance from the centroid of its cell to the nearest cell border. The overall probability of error is determined by the smallest such distance.
Acknowledgment
The author wishes to thank Joseph A. O’Sullivan for insightful comments.
Footnotes
It is tempting to call these “Voronoi” cells, but strictly speaking they are not, because the KL-divergence is not a true distance metric.
References
- [1].Leang C and Johnson DH, “On the asymptotics of M-hypothesis Bayesian detection,” IEEE Trans. Inf. Theory, vol. 43, no. 1, pp. 280–282, Jan. 1997. [Google Scholar]
- [2].Schmid NA and O’Sullivan JA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct. 2004. [Google Scholar]
- [3].Cover TM and Thomas J, Elements of Information Theory. New York: Wiley, 1991. [Google Scholar]
- [4].Westover MB and O’Sullivan JA, “Achievable rates for pattern recognition,” IEEE Trans. Inf. Theory, vol. 54, pp. 299–320, Jan. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]