Ideal Observers and Optimal ROC Hypersurfaces in N -Class Classification

Darrin C Edwards; Charles E Metz; Matthew A Kupinski

doi:10.1109/TMI.2004.828358

. Author manuscript; available in PMC: 2008 Jul 14.

Published in final edited form as: IEEE Trans Med Imaging. 2004 Jul;23(7):891–895. doi: 10.1109/TMI.2004.828358

Ideal Observers and Optimal ROC Hypersurfaces in N -Class Classification

Darrin C Edwards ^1,^✉, Charles E Metz ², Matthew A Kupinski ³

PMCID: PMC2464283 NIHMSID: NIHMS55985 PMID: 15250641

Abstract

The likelihood ratio, or ideal observer, decision rule is known to be optimal for two-class classification tasks in the sense that it maximizes expected utility (or, equivalently, minimizes the Bayes risk). Furthermore, using this decision rule yields a receiver operating characteristic (ROC) curve which is never above the ROC curve produced using any other decision rule, provided the observer’s misclassification rate with respect to one of the two classes is chosen as the dependent variable for the curve (i.e., an “inversion” of the more common formulation in which the observer’s true-positive fraction is plotted against its false-positive fraction). It is also known that for a decision task requiring classification of observations into N classes, optimal performance in the expected utility sense is obtained using a set of N − 1 likelihood ratios as decision variables. In the N-class extension of ROC analysis, the ideal observer performance is describable in terms of an (N² − N − 1)-parameter hypersurface in an (N² − N)-dimensional probability space. We show that the result for two classes holds in this case as well, namely that the ROC hypersurface obtained using the ideal observer decision rule is never above the ROC hypersurface obtained using any other decision rule (where in our formulation performance is given exclusively with respect to between-class error rates rather than within-class sensitivities).

Index Terms: ROC analysis, ideal observer, N-class classification

I. Introduction

IN a two-class classification task, observations x⃗ are randomly drawn from a distribution of “signals” (or “positive” or “abnormal” observations) and a distribution of “noise” (or “negative” or “normal” observations). The probability density functions (pdfs) are p_x⃗(x⃗| π_s) for the signal observations and p_x⃗(x⃗|π_n) for the noise observations. (We use boldface type to denote statistically variable quantities, and arrows above quantities to denote them as vectors.) It is well known that using the likelihood ratio, defined as

LR \equiv \frac{p_{\vec{x}} (\vec{x} ∣ π_{s})}{p_{\vec{x}} (\vec{x} ∣ π_{n})}

(1)

as the decision variable yields the optimal classification performance achievable in an ideal observer sense [1], [2]. In particular, the ideal observer is obtained when one requires that expected utility be maximized (or equivalently, that Bayes risk be minimized). Here LR is to be interpreted as a function of the underlying random variable x⃗ describing the data. We denote the pdf of LR as p_LR(LR), distinct from p_x⃗(x⃗) above. The receiver operating characteristic (ROC) curve for this decision variable is defined as the locus of all sensitivity and specificity pairs, or equivalently as the parametric curve (FPF(LR_c), TPF(LR_c)), where

TPF ({LR}_{c}) \equiv \int_{{LR}_{c}}^{\infty} p_{LR} (LR ∣ π_{s}) d LR

(2)

is the true-positive fraction (TPF), or sensitivity; and where

{FPF(LR}_{c}) \equiv \int_{{LR}_{c}}^{\infty} p_{LR} (LR ∣ π_{n}) d LR

(3)

is the false-positive fraction (FPF), or one minus the specificity. Here, LR_c is a threshold applied to the decision variable LR.

It is clear from this standard formulation that any monotonic transformation of the decision variable LR will yield the same ROC curve [2]. The decision variable LR, or any monotonic transformation of this decision variable, will produce an ROC curve which is optimal in the sense that at any given FPF, no ROC curve produced using a different decision variable can have a TPF higher than that of the ideal observer. This optimality can be demonstrated in two ways: by showing that the observer which satisfies the Neyman–Pearson criterion (maximizing TPF at any given FPF) is in fact the ideal observer [1]; or by showing that maximizing expected utility results in an ROC curve with this property [3].

Note that we can just as well specify the observer’s performance using the parametric pair (FPF(LR_c), FNF(LR_c)) where

FNF ({LR}_{c}) \equiv \int_{- \infty}^{{LR}_{c}} p_{LR} (LR ∣ π_{s}) d LR

(4)

is the false-negative fraction (FNF), or one minus the sensitivity. In this equivalent formulation, the ideal observer decision variable will produce an ROC curve which is optimal in the sense that at any given FPF, no ROC curve produced using a different decision variable (i.e., one which is not a monotonic transformation of LR) can have a FNF lower than that of the ideal observer. Similarly, the Neyman-Pearson criterion is now satisfied by minimizing FNF at any given FPF. This is the formulation which we will adopt throughout this paper; despite being somewhat unusual for the two-class classification case, it has certain notational advantages when considering more than two classes.

The ideal observer can be extended to the case of N classes [1]. Since the dimensionalities of the probability spaces and of the decision variable vectors involved increase rapidly with the number of classes, we will review this extension in detail for the N-class classification problem in the next section. In the remaining sections, we show that the result for two-class classification holds in the N-class extension of ROC analysis as well; namely, that an ROC hypersurface obtained by using the ideal observer decision rule is never above the ROC hypersurface obtained using any other decision rule. Here “above” has the simple interpretation that a function f(x⃗) is above g(x⃗) at a particular point x⃗₀ if f(x⃗₀) > g(x⃗₀).

II. Maximization of Expected Utility

We seek to classify observations x⃗ as coming from one of N classes, which we label π_i where i varies from 1 to N. (Informally, class π_N may be thought of as representing “normal” cases, while the other N − 1 classes represent various types of “abnormality,” as in, for example, a medical diagnostic task. The actual properties of the classes are, however, irrelevant to the analysis here.) For any observation x⃗, we may define the actual class (the “truth”) to which it belongs as t, and the class to which it is assigned (the “decision”) as d, where t and d can take on any of the values π_i, …, π_i, … π_N, the labels of the various classes. A given observation x⃗ thus arises from one of N conditional pdfs p_x⃗(x⃗ | t = π_i), and is classified according to a set of N rules of the form

decide d = π_{i} iff \vec{x} \in Z_{i}

(5)

where the sets Z_i partition the domain of x⃗. In effect, the sets Z_i define a classifier.

The performance of this classifier is described by N² conditional probabilities of the form P(d = π_i | t = π_j), which represent the various probabilities of classifying an observation as belonging to class π_i given that it is actually drawn from the distribution of class π_j. From the definition of conditional probability [4]

\sum_{i = 1}^{N} P (d = π_{i} ∣ t = π_{j}) = 1

(6)

meaning that only N² − N of these N² probabilities are needed to completely describe the behavior of a particular classifier. (This generalizes the familiar fact that, for two classes, we do not need to consider explicitly the true negative fraction (TNF), or the false negative fraction (FNF), since TNF = 1 − FPF and FNF = 1 − TPF.)

Some researchers have suggested that in, e.g., a three-class classification task [5], [6], the set of three “sensitivities” (P(d = π_i | t = π_i) in our notation) provides a complete description of observer performance. This is incorrect in general, because it ignores the N² − N misclassification probabilities, not all of which are determined uniquely by the “sensitivities” when N > 2 unless particular restrictions are imposed on the observer’s behavior. Complete quantification of the trade-offs available among the probabilities of various kinds of misclassification error is important in medical diagnosis, where different misclassification errors often have substantially different clinical consequences. Moreover, restrictions concerning the observer’s behavior are inappropriate when considering ideal observers, human observers, or automated observers (such as automated schemes for computer-aided diagnosis) designed to approximate ideal or human observer behavior.

Hypothetically, a “perfect” classifier would have P(d = π_i | t= π_j) = δ_ij; that is, data drawn from class π_j would always be assigned to class π_j, and never assigned to any class π_i for i ≠ j. Since the data x⃗ are random, and since the pdfs of the data will overlap in any nontrivial classification task, this is not possible even in principle. We therefore seek a classifier which performs optimally given the random nature of the data. Ideal observer decision theory requires that a decision alternative be selected only if its expected utility is greater than the expected utility of any other possible decision. That is, for a given observation x⃗

\begin{array}{l} decide d = π_{1} iff \\ E {U_{π_{1}} (\vec{x}, t) ∣ \vec{x}} > E {U_{π_{j}} (\vec{x}, t) ∣ \vec{x}} {2 \leq j \leq N} \end{array}

(7)

\begin{array}{l} decide d = π_{2} iff \\ E {U_{π_{2}} (\vec{x}, t) ∣ \vec{x}} \geq E {U_{π_{1}} (\vec{x}, t) ∣ \vec{x}} and \\ E {U_{π_{2}} (\vec{x}, t) ∣ \vec{x}} > E {U_{π_{j}} (\vec{x}, t) ∣ \vec{x}} {3 \leq j \leq N} \end{array}

(8)

\begin{array}{l} decide d = π_{i} iff \\ E {U_{π_{i}} (\vec{x}, t) ∣ \vec{x}} \geq E {U_{π_{j}} (\vec{x}, t) ∣ \vec{x}} {j < i} and \\ E {U_{π_{i}} (\vec{x}, t) ∣ \vec{x}} > E {U_{π_{j}} (\vec{x}, t) ∣ \vec{x}} {j > i} \end{array}

(9)

\begin{array}{l} decide d = π_{N} iff \\ E {U_{π_{N}} (\vec{x}, t) ∣ \vec{x}} \geq E {U_{π_{j}} (\vec{x}, t) ∣ \vec{x}} {j < N} \end{array}

(10)

where the random variable U_{π_i} (x⃗, t) is the utility of assigning an observation x⃗, actually drawn from class t, to class π_i [1]. (Since we are free to define the “utility” as the negative of the Bayes risk, it should be clear that maximizing the expected utility is equivalent to minimizing the Bayes risk.) Note that the choice of which decision to make when the expected utilities of a given pair of decisions are equal (informally, where “equality” symbols appear in the above relations) is arbitrary, since in this case the overall expected utility is unchanged regardless of which decision is made.

For a particular observation x⃗ actually drawn from class π_j, the utility U_{π_i} (x⃗ t) is just a number, which we denote U_i _| _j. The expectation values in (7) through (10) can then be evaluated, yielding decision rules of the form

\begin{array}{l} decide d = π_{i} iff \\ \sum_{k = 1}^{N} U_{i ∣ k} P (t = π_{k} ∣ \vec{x}) \geq \sum_{k = 1}^{N} U_{j ∣ k} P (t = π_{k} ∣ \vec{x}) {j < i} \\ and \\ \sum_{k = 1}^{N} U_{i ∣ k} P (t = π_{k} ∣ \vec{x}) > \sum_{k = 1}^{N} U_{j ∣ k} P (t = π_{k} ∣ \vec{x}) {j > i} . \end{array}

(11)

We apply Bayes’s rule to obtain

P (t = π_{k} ∣ \vec{x}) = \frac{p_{\vec{x}} (\vec{x} ∣ t = π_{k}) P (t = π_{k})}{p_{\vec{x}} (\vec{x})}

(12)

and then multiply the resulting equations by p_x⃗(x⃗), which is independent of the summation index k and nonnegative for any feasible observation. The decision rules are now

\begin{array}{l} decide d = π_{i} iff \sum_{k = 1}^{N} U_{i ∣ k} P (t = π_{k}) p_{\vec{x}} (\vec{x} ∣ t = π_{k}) \\ \geq \sum_{k = 1}^{N} U_{j ∣ k} P (t = π_{k}) p_{\vec{x}} (\vec{x} ∣ t = π_{k}) {j < i} \\ and \sum_{k = 1}^{N} U_{i ∣ k} P (t = π_{k}) p_{\vec{x}} (\vec{x} ∣ t = π_{k}) \\ > \sum_{k = 1}^{N} U_{j ∣ k} P (t = π_{k}) p_{\vec{x}} (\vec{x} ∣ t = π_{k}) {j < i} . \end{array}

(13)

By defining N − 1 likelihood ratios as

{LR}_{i} \equiv \frac{p_{\vec{x}} (\vec{x} ∣ t = π_{i})}{p_{\vec{x}} (\vec{x} ∣ t = π_{N})}

(14)

for i < N, we can divide the decision rules in (13) by p_x⃗(x⃗|t = π_N) and rearrange terms to obtain

\begin{array}{l} decide d = π_{i} iff \sum_{k = 1}^{N - 1} U_{i ∣ k} - U_{j ∣ k}) P (t = π_{k}) {LR}_{k} \\ \geq (U_{j ∣ N} - U_{i ∣ N}) P (t = π_{N}) {j < i} \end{array}

(15)

\begin{array}{l} and \sum_{k = 1}^{N - 1} (U_{i ∣ k} - U_{j ∣ k}) P (t = π_{k}) {LR}_{k} \\ > (U_{j ∣ N} - U_{i ∣ N}) P (t = π_{N}) {j < i} . \end{array}

(16)

Expression (15), with equality holding, defines a set of (N/2) (N − 1) hyperplanes which partition the decision variable space, i.e., the domain of the vector LR⃗ whose components are defined by (14), into N subsets. (A portion of each hyperplane determines a boundary between two classes.) The partitioning is determined by the parameters

γ_{ijk} \equiv (U_{i ∣ k} - U_{j ∣ k}) P (t = π_{k})

(17)

with i, j, and k varying from 1 to N, and j ≠ i. These parameters are not independent, however, because

γ_{ijk} + γ_{jlk} = γ_{ilk} .

(18)

For a given k, all the other parameters may be derived from γ₁_jk, with j varying from 2 to N. This leaves a total of N² − N parameters over all j and k.

A further constraint may be obtained by noting that the hyperplanes represented by the N − 1 equations

\sum_{k = 1}^{N - 1} γ_{1 j k} {LR}_{k} = - γ_{1 j N} {2 \leq j \leq N}

(19)

are unchanged if we multiply all of these equations by a single scalar, such as 1/|γ₁_NN|. This reduces the total number of parameters which determine the partitioning of the decision variable space to N² − N − 1. Note that we cannot multiply different equations by different scalars, as this would change the relative orientations of the remaining decision boundary hyperplanes derived via (18).

Thus, an N-class ideal observer is described by an (N² − N − 1)-parameter ROC hypersurface in an (N² − N)-dimensional probability space. These probabilities will be given by expressions of the form

\begin{array}{l} P (d = π_{i} ∣ t = π_{j}) \\ = \int \dots \int_{L \vec{R} \in Z_{i}^{*}} p_{L \vec{R}} (L \vec{R} ∣ t = π_{j}) d^{N - 1} L \vec{R} \end{array}

(20)

where the region $Z_{i}^{*}$ is the set of all LR⃗ such that d = π_i is decided, as defined in (15) and (16).

III. The Neyman-Pearson Criterion

In the preceding section, we showed that when classifying observations drawn from N classes, the ideal observer, which maximizes expected utility, uses N − 1 likelihood ratio decision variables to form its decision rule as stated in (15) and (16). However, this does not directly tell us anything about the actual performance of the ideal observer in terms of its correct and incorrect classification rates, i.e., the probabilities in (20). We wish first to generalize concepts familiar from two-class ROC analysis [7] to the N-class case under consideration, which will enable us to extend the concept of the Neyman-Pearson observer to N classes as well.

The performance of any observer which classifies observations into N classes is determined by the N² probabilities P(d = π_i | t= π_j) {1 ≤ i ≤ N, 1 ≤ j ≤ N}, where d = π_i indicates a decision that the observation belongs to class i and t = π_j indicates that the observation is actually drawn from class j. As shown above in (6), only N² − N of these probabilities are necessary to completely describe the classifier’s performance. Without loss of generality, we will eliminate the N probabilities P(d = π₁ | t= π₁),…, P(d = π_i | t= π_i),…, P(d = π_N | t= π_N), i.e., the within-class sensitivities. (In the two-class case, with class π₂ representing “negative” observations, this would be equivalent to using FNF and FPF, or one minus sensitivity and one minus specificity, to specify an observer’s performance.)

The N² − N probabilities are functions of a set of parameters defined by the observer’s decision rule. Given these functions, the observer’s performance is given by an ROC hypersurface in which one of the probabilities is an implicit function of the other N² − N − 1 probabilities; without loss of generality, we take P(d = π_N | t= π₁) (an analogue of the FNF in the two-class case) to be the dependent variable defining our ROC hypersurface.

In the preceding section, it was shown that the ideal observer’s performance depends on N² − N − 1 parameters as well as the N conditional pdfs of x⃗. It may be that a particular observer (not necessarily ideal) uses a decision rule that depends on more than N² − N − 1 parameters; note that in this case, however, we may define a “simpler” observer whose performance is always given by the set of aggregated parameters which minimize P(d = π_N | t= π₁) at a given set of fixed values of the other N² − N − 1 probabilities. The performance of this simplified observer is thus effectively determined by only N² − N − 1 parameters. Furthermore, in this work we will not consider observers which use discrete decision variables, or which use decision rules dependent on fewer than N² − N − 1 parameters. Thus, in what follows, we explicitly assume that the performance of any observer under consideration is continuous and dependent on exactly N² − N − 1 free parameters.

By definition, the Neyman-Pearson observer is that which minimizes the dependent variable (in our case P(d = π_N | t = π₁)) at a fixed set of values of the other probabilities; this condition is known as the Neyman-Pearson criterion. Note that the Neyman-Pearson criterion for two classes is typically stated as the minimization of, e.g., P(d = π₂ | t= π₁) for values of P(d = π₁ | t= π₂) less than or equal to a given fixed value. However, the inequality in this formulation is relevant only to the case of discrete decision variables [1] for which arbitrary values of P(d = π₁ | t= π₂) may not in fact be achievable; as stated in the preceding paragraph, we are not considering observers which use discrete decision variables, and thuswe will adopt the simplified version of the Neyman-Pearson criterion appropriate for continuous decision variables.

In order to optimize classification performance in a Neyman-Pearson sense, we wish to minimize P(d = π_N | t= π₁) at a particular point in the domain of the probability space, i.e.,

P (d = π_{i} ∣ t = π_{j}) = α_{i j} {j \neq i}

(21)

where 1 ≤ i ≤ N and 1 ≤ j ≤ N, and where the term for i = N, j = 1 is excluded. The α_ij are fixed but arbitrary, apart from the obvious constraints 0 ≤ α_ij ≤ 1 and 0 ≤ Σ_i_≠_j α_ij ≤ 1. We construct the function

\begin{array}{l} F \equiv P (d = π_{N} ∣ t = π_{1}) \\ + \sum_{i = 1}^{N} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} λ_{i j} (P (d = π_{i} ∣ t = π_{j}) - α_{i j}) \end{array}

(22)

where the case i = N, j = 1 is excluded from the sum. When the constraints P(d = π_i | t= π_j) = α_ij are all satisfied, minimizing F is equivalent to minimizing P(d = π_N | t= π₁). This is achieved, using the method of Lagrange multipliers [1], by first minimizing F and then finding values of λ_ij which satisfy the constraints.

Rearranging terms in (22) and expressing the probabilities explicitly via the definition in (5), we obtain

\begin{aligned} F = & - \sum_{i = 1}^{N} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} λ_{i j} α_{i j} + \int_{Z_{N}} p (\vec{x} ∣ t = π_{1}) d^{n} \vec{x} \\ + \sum_{i - 1}^{N} \int_{Z_{i}} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} λ_{i j} p (\vec{x} ∣ t = π_{j}) d^{n} \vec{x} \\ = & - \sum_{i = 1}^{N} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} λ_{i j} α_{i j} \\ + \int_{Z_{N}} [p (\vec{x} ∣ t = π_{1}) + \sum_{j = 2}^{N - 1} λ_{N j} p (\vec{x} ∣ t = π_{j})] d^{n} \vec{x} \\ + \sum_{i = 1}^{N - 1} \int_{Z_{i}} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} λ_{i j} p (\vec{x} ∣ t = π_{j}) d^{n} \vec{x} \end{aligned}

(23)

where n is the dimensionality of the observations x⃗, and where the case i = N, j = 1 is again excluded from any sum in which it would otherwise occur.

Clearly, F is minimized by choosing the class partitions Z_i so that each observation x⃗ is assigned to the term in (23) that has the smallest integrand. That is, for i < N

\begin{array}{l} decide d = π_{i} iff \sum_{\begin{matrix} k = 1 \\ k \neq i \end{matrix}}^{N} λ_{i k} p (\vec{x} ∣ t = π_{k}) \\ < \sum_{\begin{matrix} k = 1 \\ k \neq j \end{matrix}}^{N} λ_{j k} p (\vec{x} ∣ t = π_{k}) {j \neq i} \\ and \sum_{\begin{matrix} k = 1 \\ k \neq i \end{matrix}}^{N} λ_{i k} p (\vec{x} ∣ t = π_{k}) \\ < p (\vec{x} ∣ t = π_{1}) + \sum_{k = 2}^{N - 1} λ_{N k} p (\vec{x} ∣ t = π_{k}) \end{array}

(24)

and

\begin{array}{l} decide d = π_{N} iff p (\vec{x} ∣ t = π_{1}) + \sum_{k = 2}^{N - 1} λ_{N k} p (\vec{x} ∣ t = π_{k}) \\ < \sum_{\begin{matrix} k = 1 \\ k \neq j \end{matrix}}^{N} λ_{j k} p (\vec{x} ∣ t = π_{k}) \end{array}

(25)

where 1 ≤ j ≤ N − 1, and where the restrictions on the sums are the same as before. Cases for which particular integrands are equal may be decided in an arbitrary but consistent manner.

We now divide (24) and (25) by p(x⃗ | t = π_N) and rearrange terms to obtain (for i < N)

\begin{array}{l} decide d = π_{i} iff \sum_{k = 1}^{N - 1} (λ_{j k} - λ_{i k}) {LR}_{k} > λ_{i N} - λ_{j N} \\ and \sum_{k = 1}^{N - 1} (λ_{N k} - λ_{i k}) {LR}_{k} > λ_{i N} \end{array}

(26)

decide d = π_{N} iff \sum_{k = 1}^{N - 1} (λ_{j k} - λ_{N k}) {LR}_{k} > - λ_{j N}

(27)

where 1 ≤ j ≤ N − 1, and where λ_N₁ ≡ 1. Note that some of the λ_ij above may actually be zero, because we have for clarity removed the explicit restrictions on the sums.

Comparison of (26) and (27) with (15) and (16) reveals that the observer which satisfies the Neyman-Pearson criterion is in fact an ideal observer. That is, given the inherent arbitrariness in the definition of utility, we are free to define

U_{i ∣ j} \equiv \frac{λ_{i j}}{P (t = π_{j})}

(28)

U_{N ∣ N} \equiv 0

(29)

yielding inequalities in (15) and (16) identical to those in (26) and (27); furthermore, equalities in (26) and (27) may be decided consistently with those in (15) and (16). The known result for two-class classification tasks, that the ideal observer achieves optimal performance in an ROC sense [1], thus holds for N-class classification tasks as well.

IV. Expected Utility and Optimal Performance: An Alternative View

We have shown that in an N-class classification task, the observer which maximizes expected utility, and that which maximizes performance in an ROC or Neyman-Pearson sense, are the same, namely the ideal observer. It is interesting to ask whether there is a more direct connection between maximization of expected utility and optimization of performance in an ROC sense. Such a connection was explored by one of us [3] for the two-class classification task; in this section we extend those results to the N-class classification task.

Consider again the expected utility as given in (7) through (10), but now in general form rather than with respect to the utility of particular decisions. That is

E {U} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} U_{i ∣ j} P (d = π_{i} ∣ t = π_{j}) P (t = π_{j}) .

(30)

As in Section III, we simplify this expression by eliminating N of the N² conditional probabilities (namely, the within-class sensitivities) to obtain

\begin{array}{l} E {U} = \sum_{j = 1}^{N} U_{j ∣ j} P (t = π_{j}) \\ + \sum_{i = 1}^{N} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} (U_{i ∣ j} - U_{j ∣ j}) P (t = π_{j}) P (d = π_{i} ∣ t = π_{j}) . \end{array}

(31)

Although this expression appears cumbersome, it can be regarded quite simply as a linear relation between the expected utility E{U} and the N² − N probabilities which have been chosen to represent observer performance; the coefficients in this linear relation are determined by the various decision utilities and the a priori class probabilities.

Equation (31) can be rearranged to give P(d = π_N | t= π₁) as a function of the other conditional probabilities (i.e., a locus in ROC space) and of E{U}. The result is

\begin{array}{l} P (d = π_{N} ∣ t = π_{1}) \\ = \sum_{i = 2}^{N - 1} \frac{(U_{i ∣ 1} - U_{1 ∣ 1})}{(U_{1 ∣ 1} - U_{N ∣ 1})} P (d = π_{i} ∣ t = π_{1}) \\ + \sum_{i = 1}^{N} \sum_{\begin{matrix} j = 2 \\ j \neq i \end{matrix}}^{N - 1} \frac{(U_{i ∣ j} - U_{j ∣ j}) P (t = π_{j})}{(U_{1 ∣ 1} - U_{N ∣ 1}) P (t = π_{1})} P (d = π_{i} ∣ t = π_{j}) \\ + \frac{[\sum_{j = 1}^{N} U_{j ∣ j} P (t = π_{j})] - E {U}}{(U_{1 ∣ 1} - U_{N ∣ 1}) P (t = π_{1})} . \end{array}

(32)

For what follows it will be necessary to assume that U_{1 | 1} > U_N _{| 1}; if this were not the case, an increase in expected utility would correspond to an increase in the incorrect-classification probability P(d = π_N | t = π₁) (or P(d = π_N | t = π₁) would be undefined).

Consider a fixed set of decision utilities U_i _| _j and a priori class probabilities P(t = π_i). For this fixed set of values, (32) defines a hyperplane in ROC space. In particular, the operating point achieved by the ideal observer via (15) and (16) using the given decision utilities and a priori class probabilities is contained in this hyperplane. The ideal observer achieves a particular value of E{U} when operating at this operating point; an observer using a different decision rule, but still with the same fixed values for the decision utilities and a priori class probabilities, would obtain a generally different E{U} and, thus, would have an operating point lying in a hyperplane parallel to that just described. Moreover, because the ideal observer is that which maximizes E{U} for any given set of decision utilities, this second hyperplane would necessarily be above the hyperplane containing the ideal observer’s operating point. This implies that no operating points below the hyperplane containing the ideal observer’s operating point under consideration are achievable, no matter what decision rule is used.

Now suppose we allow the decision utilities U_i _| _j, or the a priori class probabilities, to change slightly. A generally different operating point will be attained by the ideal observer, and this point will lie in a different hyperplane. The remaining arguments of the preceding paragraphs are unchanged, however, and it follows that no decision rule can achieve any operating point lying in the union of the two regions below the hyperplanes containing the two ideal observer operating points in question. Continuing in this fashion for every point in the domain of the ROC space, we find that the ideal observer’s performance is given by the concave hull of the set of hyperplanes so defined, and that no observer using any decision rule can achieve a performance (operating point) which lies below this hypersurface.

V. Conclusion

The N-class classification task presents many challenges which are daunting when compared with the great successes achieved in analyzing two-class classification tasks. In particular, the number of parameters required by a nontrivial decision rule, and the dimensionalities of the ROC spaces involved, increase rapidly with the number of classes.

Nevertheless, we have shown that certain conclusions regarding ideal observers can in fact be carried over from the two-class task to the N-class task. In particular, the two standard definitions of the ideal observer—the observer which maximizes expected utility, and the observer which satisfies the Neyman-Pearson criterion (optimizing ROC performance)—are consistent, as in the two-class case. Furthermore, a direct relation between maximizing expected utility and ROC performance, originally developed for the two-class case, was found to generalize in a straightforward fashion to the N-class case.

Whether these promising results can be extended to more practical issues in characterizing observer performance—e.g., fitting an ROC hypersurface to a set of estimated observer operating points—is currently unknown. It is our hope that the present work may at least provide a starting point for addressing these more difficult questions.

Acknowledgments

This work was supported in parts by the National Cancer Institute under Grant R01-CA60187 (R.M. Nishikawa, principal investigator) and in part by the National Institutes of Health under Grant R01-GM57622 (C. E. Metz, principal investigator). C. E. Metz is a shareholder in R2 Technology, Inc. (Sunnyvale, CA). The Associate Editor responsible for coordinating the review of this paper and recommending its publication was H.-P. Chan.

Contributor Information

Darrin C. Edwards, D. C. Edwards is with the Department of Radiology, the University of Chicago, Chicago, IL 60637 USA

Charles E. Metz, C. E. Metz is with the Department of Radiology, the University of Chicago, Chicago, IL 60637 USA

Matthew A. Kupinski, M. A. Kupinski is with the Optical Sciences Center, University of Arizona, Tucson, AZ 85721 USA

References

1.Van Trees HL. Detection, Estimation and Modulation Theory: Part I. New York: Wiley; 1968. [Google Scholar]
2.Egan JP. Signal Detection Theory and ROC Analysis. New York: Academic; 1975. [Google Scholar]
3.Metz CE. The optimal decision variable,” in unpublished lecture notes for the course “Mathematics for Medical Physicists. Dept. Radiology; University of Chicago: 2000. [Google Scholar]
4.Papoulis A. Probability, Random Variables, and Stochastic Processes. New York, NY: McGraw-Hill, Inc; 1991. [Google Scholar]
5.Mossman D. Three-way ROCs. Med Decis Making. 1999;19:78–89. doi: 10.1177/0272989X9901900110. [DOI] [PubMed] [Google Scholar]
6.Dreiseitl S, Ohno-Machado L, Binder M. Comparing three-class diagnostic tests by three-way ROC analysis. Med Decis Making. 2000;20:323–331. doi: 10.1177/0272989X0002000309. [DOI] [PubMed] [Google Scholar]
7.Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine. 1978;VIII(4):283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]

[R1] 1.Van Trees HL. Detection, Estimation and Modulation Theory: Part I. New York: Wiley; 1968. [Google Scholar]

[R2] 2.Egan JP. Signal Detection Theory and ROC Analysis. New York: Academic; 1975. [Google Scholar]

[R3] 3.Metz CE. The optimal decision variable,” in unpublished lecture notes for the course “Mathematics for Medical Physicists. Dept. Radiology; University of Chicago: 2000. [Google Scholar]

[R4] 4.Papoulis A. Probability, Random Variables, and Stochastic Processes. New York, NY: McGraw-Hill, Inc; 1991. [Google Scholar]

[R5] 5.Mossman D. Three-way ROCs. Med Decis Making. 1999;19:78–89. doi: 10.1177/0272989X9901900110. [DOI] [PubMed] [Google Scholar]

[R6] 6.Dreiseitl S, Ohno-Machado L, Binder M. Comparing three-class diagnostic tests by three-way ROC analysis. Med Decis Making. 2000;20:323–331. doi: 10.1177/0272989X0002000309. [DOI] [PubMed] [Google Scholar]

[R7] 7.Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine. 1978;VIII(4):283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]

PERMALINK

Ideal Observers and Optimal ROC Hypersurfaces in N -Class Classification

Darrin C Edwards

Charles E Metz

Matthew A Kupinski

Abstract

I. Introduction

II. Maximization of Expected Utility

III. The Neyman-Pearson Criterion

IV. Expected Utility and Optimal Performance: An Alternative View

V. Conclusion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Ideal Observers and Optimal ROC Hypersurfaces in N -Class Classification

Darrin C Edwards

Charles E Metz

Matthew A Kupinski

Abstract

I. Introduction

II. Maximization of Expected Utility

III. The Neyman-Pearson Criterion

IV. Expected Utility and Optimal Performance: An Alternative View

V. Conclusion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases