Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Oct 11.
Published in final edited form as: IEEE Trans Inf Theory. 2008 Jun 17;54(7):3327–3329. doi: 10.1109/TIT.2008.924656

Asymptotic Geometry of Multiple Hypothesis Testing

M Brandon Westover 1
PMCID: PMC6788803  NIHMSID: NIHMS1053530  PMID: 31607755

Abstract

We present a simple geometrical interpretation for the solution to the multiple hypothesis testing problem in the asymptotic limit. Under this interpretation, the optimal decision rule is a nearest neighbor classifier on the probability simplex.

Index Terms—: Geometry, hypothesis testing, large deviations, pattern recognition

I. Introduction

In the binary hypothesis testing problem with independent and identically distributed (i.i.d.) observations, it is well known that the error probability for the optimal decision rule decays with a constant exponential rate equal to the Chernoff distance between the two hypothesis distributions. The generalization to multiple hypothesis testing for i.i.d. observations was derived by Leang and Johnson [1], and extended for observations modeled as stationary ergodic processes by Schmid and O’Sullivan [2]. In this correspondence, we focus on the i.i.d. case. For M-ary hypothesis testing, the error probability decays exponentially with a rate equal to the minimum Chernoff distance between all distinct pairs of hypothesis distributions. In this correspondence, we describe a simple geometrical interpretation of this result, illustrated in Fig. 2. We first review the binary case.

Fig. 2.

Fig. 2.

M-ary hypothesis testing.

II. Binary Hypothesis Testing

Let πi, i = 1, 2 be the prior probabilities for the hypotheses H1 and H2; and let Yn = (Y1, …., Yn) be an i.i.d. sequence of observations from a finite alphabet Y. We must decide between

H1:Yn~P1

and

H2:Yn~P2.

A. Optimal Decision Rule and Error Exponent

The following results are standard (see, e.g., [3, Ch. 12]). Letting π1 = π2 = 1/2, the optimal decision H^=H^(Yn) rule can be expressed in terms of the Kullback–Leibler (KL)-divergence as

H^=argmin i{1,2}D(PYnPi) (1)

where PYn is the type (empirical histogram) of Yn. This decision rule partitions the probability simplex P over Y into two disjoint decision regions A1, A2. Denoting their respective complements A1c=A2,A2c=A1, an application of Sanov’s theorem shows that the error probabilities are

P1n(A1c)2nD(P1*P1)P2n(A2c)2nD(P2*P2) (2)

where

Pi*argmin pAicD(pPi),     i=1,2

and ≐ denotes “equality to first order in the exponent,” i.e., anbn means limn1nloganbn=0 (see [3, p. 55]). For the total probability of error Pen we have

Pen=π1P1n(A1c)+π2P2n(A2c)π12nD(P1*P1)+π22nD(P2*P2)2n  min{D(P1*P1),D(P2*P2)}2nC(P1,P2). (3)

The exponent

C(P1,P2)=min{D(P1*P1),D(P2*P2)}

is the Chernoff distance between P1 and P2, usually written in the alternative but equivalent form

C(P1,P2)=minλ[0,1]log(yP1λ(y)P2λ1(y)).

B. Geometric Interpretation

Fig. 1 illustrates the preceding results for alphabet size |Y|=3, with corresponding probability simplex

P={p:p1,p2,p30,p1+p2+p3=1}. (4)

In this illustration, each probability distribution pP is mapped to a point r2 in an equilateral triangle with vertices at points r=(x,y)2:(0,1),(3/2,1/2), and (3/2,1/2), using the transformation r = Tp + b, where

T=(1/31/31/31/302/3),b=(1/31/31/3).

Fig. 1.

Fig. 1.

Binary hypothesis testing.

Given an observation sequence Yn, the decision rule (1) can be interpreted geometrically in terms of the following algorithm:

  • Locate the observation’s type PYn in the simplex P;

  • Compute the distance to each hypothesis distribution D(PYnP1),D(PYnP2);

  • Assign Yn to the closest hypothesis.

H^ thus divides P into two disjoint cells,1 with centroids P1 and P2, and each point pP assigned to the nearest centroid, with “distance” measured in KL-divergence.

Next consider (2). Under hypothesis H1 : Yn ~ P1, the probability of incorrectly deciding Yn~P2, that is, P1n(A1c), depends on the distance from P1 to the closest distribution outside of A1. This distribution P1* can be expressed as a point along the geodesic in P joining P1 and P2

p(λ)=1Z(λ)P11λP2λ,λ[0,1] (5)

where exponentiation is defined pointwise (e.g., if|Y|=3, then pλ=(p1λ,p2λ,p3λ). Conceptually, the value λ* for which p(λ*)=P1* can be found by setting λ = 0, then continuously increasing λ until the point p(λ) reaches the border separating regions A1 and A2. Symmetric arguments apply for P2* and, by symmetry, we must have P1*=P2*.

Finally, interpreting (3), the overall probability of error Pen is determined by measuring the distance along the geodesic from either centroid P1 or P2 to its border. The shortest of the two paths asymptotically dominates the overall probability of error.

III. M-Ary Hypothesis Testing

Now consider the M-ary hypothesis testing problem. Let πi be the prior probability for hypothesis Hi. We must decide among

Hi:Yn~Pi,i=1,,M.

As above, we assume equal prior probabilities, πi = 1/M, i = 1, ….,M.

An important special case is the following formulation of the general statistical pattern recognition problem [4]: Given a set of M template patterns C={xn(1),,xn(M)}, suppose Nature selects one of the pattern templates w ∈ {1, …,M} at random with probability p(w) = 1/M, w = 1, ….,M. The observation data Yn is generated from the template xn(w) according to Pw=p(yn|xn(w))=j=1np(yj|xj(w)). In this case, the optimal hypothesis test w^ is a classifier, i.e., a rule that infers the pattern class underlying the observation Yn.

A. Optimal Decision Rule and Error Exponent

The natural generalizations of (1)–(3) to the M-ary case, derived by Leang and Johnson [1], are as follows. The optimal decision rule is

H^=argmin i{1,,M}  D(PYnPi). (6)

Denoting the optimal decision regions by Ai, and their complements by Aic, applying Sanov’s theorem yields for the error probabilities under each hypothesis Hi, i = 1, …,M

Pin(Aic)2nD(Pi*Pi)Pi*argmin pAicD(pPi). (7)

Hence, for the total probability of error Pen we have

Pen=i=1MπiPin(Aic)i=1Mπi2nD(Pi*Pi)2n miniD(Pi*Pi)=2nC(Pw,Pw) (8)

where

(w,w)=argmin ijC(Pi,Pj)

i.e., Pw, Pw is the closest pair of distributions, measured in Chernoff distance.

B. Geometric Interpretation

Now consider the geometric interpretation of the preceding results for M-ary hypothesis testing. Fig. 2 represents the M-ary generalization of Fig. 1.

Given an observation sequence Yn, the optimal decision rule (6) can be interpreted geometrically in terms of the following algorithm:

  • Locate the observation’s type PYn in the simplex P;

  • Compute the distance to each hypothesis distribution D(PYnPi),i=1,,M;

  • Assign Yn to the closest hypothesis.

The rule H^ thus divides P into M cells, with centroids Pi, i = 1, …,M, with each point pP assigned to the nearest centroid.

These cells are convex: Given a cell Ai with centroid Pi and two points Q1,Q2Ai, let Q = λQ1 + (1 − λ)Q2, λ ∈ [0, 1], let PjAj be any other centroid, and define

Δ(Qk)D(QkPi)D(QkPj)=EQklogPjPi,k=1,2.

The condition Q1,Q2Ai is equivalent to requiring Δ(Qk) < 0, k = 1, 2. Hence, noting that Δ(Qk) is linear in Qk, we have

Δ(Q)=λΔ(Q1)+(1λ)Δ(Q2)<0

whence QAi, establishing the convexity of Ai. A similar argument shows that the borders of the decision cells are composed of straight lines: Q1, Q2 are on the boundary between Ai, Aj, if and only if Δ(Q1) = Δ(Q2) = 0, hence, Δ(Q) = λΔ(Q1) + (1 − λ)Δ(Q2) = 0.

Turning to (7), under each hypothesis Hi the corresponding probability of error Pi(Aic) is again determined by the distance from the centroid Pi of Ai to the nearest point outside Ai. However, in the M-ary case, each complement Aic consists of M − 1 other decision regions, and P1* lies on the geodesic joining Pi with the closest neighboring region to Ai. Note that, unlike the binary case, for two neighboring cells Ai, Aj, it is not necessarily the case that Pi*=Pj*, although this is the case when geodesics intersect a shared border.

Finally, consider (8). To determine the overall probability of error Pen, we compare the geodesic distance from each centroid to its nearest border, D(Pi*Pi). The overall probability of error is dominated by the shortest such path, of length D(PwPw).

IV. Conclusion

We have described a simple, easily remembered geometrical interpretation of the results of [1] for the multiple hypothesis testing problem in the asymptotic regime. Under this interpretation, the optimal decision rule is a nearest neighbor classifier, with “distance” measured in terms of the KL-divergence between the observation data’s type and each hypothesis distribution. That is, the optimal decision rule splits the probability simplex into M cells, and assigns each observation to the nearest centroid. The error probability under each hypothesis is determined by the geodesic distance from the centroid of its cell to the nearest cell border. The overall probability of error is determined by the smallest such distance.

Acknowledgment

The author wishes to thank Joseph A. O’Sullivan for insightful comments.

Footnotes

1

It is tempting to call these “Voronoi” cells, but strictly speaking they are not, because the KL-divergence is not a true distance metric.

References

  • [1].Leang C and Johnson DH, “On the asymptotics of M-hypothesis Bayesian detection,” IEEE Trans. Inf. Theory, vol. 43, no. 1, pp. 280–282, Jan. 1997. [Google Scholar]
  • [2].Schmid NA and O’Sullivan JA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct. 2004. [Google Scholar]
  • [3].Cover TM and Thomas J, Elements of Information Theory. New York: Wiley, 1991. [Google Scholar]
  • [4].Westover MB and O’Sullivan JA, “Achievable rates for pattern recognition,” IEEE Trans. Inf. Theory, vol. 54, pp. 299–320, Jan. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES