Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 1.
Published in final edited form as: Pattern Recognit Lett. 2020 May 22;136:94–100. doi: 10.1016/j.patrec.2020.04.035

Receiver Operating Characteristic Curves with an Indeterminacy Zone

Giovanni Parmigiani a,**
PMCID: PMC7351108  NIHMSID: NIHMS1603111  PMID: 32655204

Abstract

This work extends Receiver Operating Characteristic (ROC) curve to the situation where some cases, falling in an intermediate “indeterminacy zone” of the predictor, are not classified. It addresses two challenges: definition of sensitivity and specificity bounds for this case; and summarization of the large number of possibilities arising from different choices of indeterminacy zones.

Graphical Abstract (Optional)

The grayROC extends Receiver Operating Characteristc (ROC) visualization to classifiers that allow for some cases, falling in an indeterminacy zone, to remain unclassified.

1. Introduction

Receiver Operating Characteristic (ROC) curves help with the visual assessment of the performance of classifiers. Fawcett (2006) reviews the field and points out that “ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research”.

I consider here the basic case of binary classification using a continuous score, such as a classification probability, or a quantitative biomarker. Traditionally, classification is simply implemented by a cutoff dichotomizing the score. In more recent applications, classification may includes an intermediate area of indeterminacy, which I will call gray zone.

For a famous example, Parker et al. (2009) present the PAM50 risk predictor of breast cancers, which provides a continuous risk score. In clinical applications, this score is most often split into three categories: low, intermediate and high. Women in the low and high categories are directed to specific clinical strategies. Women in the intermediate category are considered on a case by case basis by their clinicians. From an algorithmic standpoint, the intermediate group is not classified. Similarly, machine learning algorithms for classification of pathology and radiology images may allow for certain areas to be routed to further human examination. In these cases indeterminacy helps with practical implementation, by handling safe cases algorithmically and complex ones by human intervention.

Here I describe an algorithm for visualizing bounds on sensitivity/specificity pairs, for short grayROC, to assess the performance range of classifiers allowing for a region of indeterminacy, or gray zone. I try to address two challenges. The first is the definition of sensitivity and specificity bound when there is indeterminacy. The second is the visual summarization of the large number of possibilities arising from different choices of gray zones.

2. Algorithm

Consider a validation study of n labeled subjects, with scores xi, i = 1,…,n. Without loss, let the first n0 subjects (0 < n0 < n) have label 0 and the remaining n1 have label 1. Also, low levels of the score are taken to predict class 0. The proportion of 1’s in the target population is π, and may differ from the validation study proportion n1/n, for example if the design of the validation study is a case-control.

A gray zone is defined by the interval (cL, cU). The extremes are the lower and upper cutoff. Cases with score below cL are classified as 0’s. Cases above cU are classified as 1’s. The rest remain unclassified.

Users of the grayROC need to specify a maximum tolerated percentage of unclassified cases, γ, based on the tradeoffs present in the practical application at hand. Let gj be the number of class j points falling in the gray zone. A gray zone (cL, cU) satisfies the γ-constraint if the proportion of cases in the gray zone is less than γ, that is if (g0 + g1)/n < γ. A gray zone (cL, cU) satisfies the target population γ-constraint if ((1 − π)g0 + πg1)/n < γ.

The grayROC algorithm is a model-free visualization. The basic building blocks are bounds on the cumulative frequencies associated with a given gray zone (cL, cU).

First, the most favorable bound on these frequencies is calculated assuming perfect discrimination within the gray zone. Imagine an oracle would take care of the points in the gray zone on behalf of the classifier, by moving them to the extremes of the gray zone so that they can be classified correctly. Formally, define the starred scores as follows:

ifxi(cL,cU)thenxi=xiifin0,xi(cL,cU)thenxi=cLifi>n0,xi(cL,cU)thenxi=cU.

Let IA be the indicator of the set A, and define the cumulative frequencies:

F0(cL,cU)=i=1n0Ixi<(cL+cU)/2. (1)
F1(cL,cU)=i=n0+1nIxi<(cL+cU)/2. (2)

Conversely, the least favorable frequencies are constructed considering the worst case scenario for the points within the gray zone. Imagine now that a saboteur may be in charge of the points in the gray zone, by moving them to extremes of the gray zone, so that they are all classified incorrectly. This would result in the ”daggered” scores, defined as:

ifxi(cL,cU)thenxi=xiifin0,xi(cL,cU)thenxi=cUifi>n0,xi(cL,cU)thenxi=cL.

Now define the cumulative frequencies:

F0(cL,cU)=i=1n0Ixi<(cL+cU)/2 (3)
F1(cL,cU)=i=n0+1nIxi<(cL+cU)/2. (4)

We can form a large number of starred and daggered pairs of cumulative frequencies satisfying the γ-constraint. The gray-ROC algorithm simplifies the visualization of these pairs by grouping them, and selecting a single higher and lower limit within each group, as follows.

Consider the r observed unique ranked values of the biomarker x(1),…x(r). These points will constitute the set of possible values for the extremes (cL, cU) of the gray zone. Now define the midpoints between two consecutive values as cj = (x(r−1) + x(r))/2 for j = 2, …, r. For each cj, consider the set of (cL, cU) pairs built by first adding the two neighboring observed points on either side, then the next two and so forth. This process continues as long as the gray zone satisfies the γ-constraint. If one of the extremes of the distribution is reached, the process continues on the other side. Among the resulting intervals, the grayROC chooses the ”best” for visualization, defined as follows. For each (cL, cU), it eliminates the cases in the gray zone and then computes the area under the ROC curve (AUC, Bradley (1997)) using the classified cases only. The (cL, cU) pair maximizing the AUC so defined is (cL(cj),cU(cj)). The generating cj is not necessarily the midpoint of this interval, but will be contained in it. If multiple gray zones are tied in this maximization, the algorithm minimizes gray zone width among optima. In this way, gray zones are not used in regions where discrimination is not helped by not classifying cases.

Then, the upper limits are defined by the set of points

(1F1(cL(cj),cU(cj)),1F0(cL(cj),cU(cj))) (5)

as cj varies. Conversely, the lower limits are defined by the set of points

(1F1(cL(cj),cU(cj)),1F0(cL(cj),cU(cj))). (6)

for j = 2,…, r. To implement, define the degenerate gray zones (x(i),x(i)) and (x(i),x(i+1)) as the empty set.

Fix y to be either 0 or 1. The sequences defined by Fy(cL(cj),cU(cj)) and Fy(cL(cj),cU(cj)) as j varies in 2, …, r do not necessarily define proper cumulative distributions, as they would in a standard ROC analysis. Rather the intent is to provide bounds to the sensitivity / specificity pairs available over a range of possible gray area strategies.

Starred and daggered curves are calculated using both classified and unclassified samples. The exclusion of the unclassified samples only affects the calculation of (cL(cj),cU(cj)).

In summary, the algorithm’s steps to produce the data needed for plotting a grayROC graph are as follows:

Algorithm 1:

The grayROC procedure for computation of upper and lower limits in expressions (5) and (6).

graphic file with name nihms-1603111-t0006.jpg

I explored an alternative implementation where the lower and upper limit of the gray area are used in turn to index the AUC optimization, instead of the midpoints. Upper and lower limits can produce markedly different results. Bounds are less stable than the midpoints when sample sizes are small. Nonetheless, this strategy provides a different view of the overlap in the tails, and may turn out to be useful in some applications.

3. Illustration

To illustrate the application and interpretation of the gray-ROC, I consider a gene expression biomarker for the prediction of suboptimal (class 0) versus optimal (class 1) surgical debulking in ovarian cancer patients. Data are available from the CuratedOvarianData Bioconductor package by Ganzfried et al. (2013). Clinical and biological background can be found in Riester et al. (2014). The specific biomarker presented here reflects the transcriptional level of the gene ZNF544, as measured using an Agilent microarray by Yoshihara et al. (2012).

Figure 1 shows the observed biomarker levels by class. Higher level of expression are generally associated with optimal debulking (class 1). Figure 1 also illustrates the type of hypothetical scenarios that enter as building block in the construction of the grayROC, to visually represent the definitions of x* and x.

Fig. 1.

Fig. 1.

Dotplots of biomarker levels by class as observed (top) and in the hypothetical scenarios used in the construction of the grayROC plot. The gray zone is (2.8,3.5). In the “oracle” scenario, class 1 points in the gray zone are moved to the upper limit 3.5 while the class 0 points in the gray zone are moved to 2.8. The reverse is true in the “saboteur” scenario.

Each of hypothetical scenarios in Figure 1 enter the optimization used to find the cU(cj)’s. These in turn are used to form the starred and daggered sensitvity and specificity bounds. Figure 2 shows segments connecting starred and daggered points corresponding to the two bounds associated with the same cj. These can be used to explore potential gray area strategies. Say one is interested in a classifier with approximately 80% specificity and 70% sensitivity. ZNF544 does not reach this performance. The upper points inform us that if one were allowed to pass 20% of suitably chosen observations to the oracle, than ZNF544 could reach close to the desired sensitivity/specificity trade-off. It also informs us that if the same observations were passed to the saboteur, the sensitivity and specificity would drop close to the diagonal line of no discrimination.

Fig. 2.

Fig. 2.

grayROC displays at maximum tolerated percentage of unclassified cases, γ, of .2. The top panel shows segments connecting starred and daggered points corresponding to the same cj. The segments collapse to a point when the optimal gray area for the corresponding cj is empty. The bottom panel shows, in addition, the area between the two curves defined by connecting the starred and daggered points. The thinner line corresponds to the standard ROC curve.

Figure 2 also shows, in the bottom panel, the region defined by the starred points as the upper limit, and by the daggered points as the lower limit. Points within the region are not easily interpretable in terms of the optimization of the previous section. The shading is purely a visual aid.

Figure 3 shows grayROC visualizations corresponding to four additional choices of γ.

Fig. 3.

Fig. 3.

grayROC displays for ZNF544 at maximum tolerated percentage of unclassified cases, γ, of .1 (top left) .15 (top right) .25 (bottom left) and .30 (bottom right.) The thinner line corresponds to the standard ROC curve.

Figure 2 also illustrates that the region defined by the upper and lower limits in the grayROC algorithm is not necessarily convex.

If γ = 0 the grayROC region collapses to the standard ROC line, also drawn in Figures 2 and 3.

In regions where the two class-specific distributions have little overlap, say left of 2, there can be little or no advantage in allowing for a gray zone. Conversely, where the density of biomarker points in the two classes is similar, a gray zone has the potential to improve the practical implementation of the biomarker. Figure 4 depicts this trade-off by elucidating where in the biomarker range the gray area is useful. Only in a narrow range of values does the grayROC algorithm needs to make full use of the 20% of data points allowed for the gray zone (top panel).

Fig. 4.

Fig. 4.

Proportion of points falling in the gray zone (left) and width of the gray zone in the biomarker scale (right) as a function of cj at γ = .2.

Lastly, Figure 5 shows grayROCs for four additional genes, chosen in part to illustrate less common features. Regions can be disjoint, when stretches of non-empty gray areas are followed by stretches of empty gray areas. Often this is associated with lack of monotonicity in the likelihood ratio of the two conditional biomarker distributions.

Fig. 5.

Fig. 5.

grayROC displays at maximum tolerated percentage of unclassified cases, γ, of .2 for the four genes indicated at the top of each panel. The thinner line corresponds to the standard ROC curve.

ZNF487 exemplifies a biomarker with relatively good discrimination. The upper bounds indicates that correct reclassification of as few as 20% of cases could lead to high discrimination. This reclassification could be achieved by biomarkers that prove effective in the gray zone for ZNF487. The lower bound indicates that, if unclassified observations are handled poorly, the performance suffers, but discrimination remains above chance by a clear margin even with a gray area of 20%.

4. Discussion

I am not aware of a good visualization approach to examine classification algorithms that allow for an area of indeterminacy. I hope the grayROC will prove of practical help.

A grayROC visualization depends on the specification of the proportion γ of cases falling in the indeterminacy zone. The grayROC is, by design, sensitive to γ. Also, the influence of γ will differ in each dataset. In general, a plausible choice of γ may reflect the trade-offs inherent to the practical implementation of the algorithm. A grayROC can help users quantify and communicate the consequences of adopting a specific γ.

A full decision analytic approach (Raiffa and Schleifer (1961)) for selecting upper and lower thresholds (and thus γ) is feasible if one is able to quantify the utility associated with classifications, as well as the utilities following assignment to the gray zone. While the grayROC is not a method for optimally selecting γ, it can assist if the decision can only be approached informally. For example, if indeterminate cases need to examined by a costly human reader for accurate classification, different grayROC plots at varying γ can be used to informally evaluate the trade-off between added accuracy and added cost.

The grayROC is helpful when all cases have a known binary label but some are not classified. This differs from multi-class ROC analysis (e.g. Hand and Till (2001)), where the number of labels is greater than two. It also differs from semisupervised analyses (Chapelle et al. (2006)), where some cases are not labeled. Lastly, it differs from systems where binary labels and/or classifications are replaced by fuzzy set memberships.

Evangelista et al. (2005) consider ensemble methods for classification. They visualize properties of the ensembles using a single ROC curve based on aggregating multiple classifiers through fuzzy logic operators including T-conorms and T-norms (Jang et al. (1997)). They term this approach ”fuzzy ROC”. In related work Castanho et al. (2007) generalize traditional ROC analysis to evaluate a single fuzzy-rule-based system, not necessarily arising through ensemble learning.

There are many valid alternatives to ROC curves for investigating and visualizing the properties of a threshold-based classifier. These include Total Operating Characteristic (Pontius and Si (2014)), Decision Curve Analysis (Vickers and Elkin (2006)) and Detection Error Tradeoff, which plots the false rejection rate versus the false acceptance rate (Martin et al. (1997)). I hope that the ideas illustrated here may be helpful in generalizing these methods to classifiers with indeterminacy zones.

The grayROC is not a visualization of uncertainty about the ROC curve in the standard statistical sense. Both the upper and lower bound are themselves point estimates, and their variability could be address by simple resampling approaches. Yet visualizing both the set and uncertainty about the set boundaries could be challenging. Also, γ is expressed in terms of the (potentially rescaled) proportion of cases in the validation study, without consideration for uncertainty.

The oracle and saboteur scenarios are extreme. Variants of this algorithm could be constructed by further specifying bounds on the proportion of cases that could be correctly classify by a human if left in the gray area. Then instead of moving all the gray area points to extremes, these known proportions could be used to move only some of the points and achieve less extreme bounds. These classification proportion could potentially depend on the biomarker region.

From a statistical perspective, indeterminacy can also help characterize regions of the score with poor discriminatory ability. Thus, compared to fully deterministic approaches, allowing for indeterminacy may lead to a different evaluation of classifiers and different approaches to biomarker discovery.

Research Highlights (Required).

  • The grayROC extends the ROC to classifiers that allow for cases, falling in an indeterminacy zone, to not be unclassified.

  • It is based on bounds on sensitivity and specificity derived accounting for the indeterminacy.

  • A simple optimization allows to focus on visualization of most interesting cases, avoiding indeterminacy when not needed.

Acknowledgments

Work supported by NIH-NCI grant 4P30CA006516-51 and NSF grant DMS-1810829. The companion repository at https://github.com/gp1d/grayROC.git includes: 1) an R package implementing Algorithm 1 and grayROC plots; 2) R markdown files to reproduce all the analyses in Section 3.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Bradley A, 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30 (7), 1145–1159. [Google Scholar]
  2. Castanho MJP, Barros LC, Yamakami A, Vendite LL, 2007. Fuzzy Receiver Operating Characteristic Curve: An Option to Evaluate Diagnostic Tests. IEEE Transactions on Information Technology in Biomedicine 11 (3), 244–250. [DOI] [PubMed] [Google Scholar]
  3. Chapelle O, Schlkopf B, Zien A, 2006. Semi-supervised learning. MIT Press, Cambridge. [Google Scholar]
  4. Evangelista PE, Embrechts MJ, Bonissone P, Szymanski BK, July 2005. Fuzzy roc curves for unsupervised nonparametric ensemble techniques. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005 Vol. 5 pp. 3040–3045 vol. 5. [Google Scholar]
  5. Fawcett T, 2006. An introduction to ROC analysis. Pattern Recognition Letters 27 (8), 861–874, ROC Analysis in Pattern Recognition; URL http://www.sciencedirect.com/science/article/pii/S016786550500303X [Google Scholar]
  6. Ganzfried BF, Riester M, Haibe-Kains B,Risch T, Tyekucheva S, Jazic I, Wang XV, Ahmadifar M, Birrer MJ, Parmigiani G, Huttenhower C, Waldron L, 2013. curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome. Database (Oxford) 2013, bat013, URL 10.1093/database/bat013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hand DJ, Till RJ, November 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45 (2), 171–186. URL 10.1023/A:1010920819831 [DOI] [Google Scholar]
  8. Jang J-SR, Sun CT, Mizutani E, 1997. Neuro-Fuzzy and soft computing: A computational approach to learning and machine intelligence. Prentice-Hall. [Google Scholar]
  9. Martin A, Doddington G, Kamm T, Ordowski M, 1997. The DET curve in assessment of detection task performance. Tech. rep., DTIC. [Google Scholar]
  10. Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, Quackenbush JF, Stijleman IJ, Palazzo J, Marron JS, Nobel AB, Mardis E, Nielsen TO, Ellis MJ, Perou CM, Bernard PS, Mar. 2009. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. Journal of Clinical Oncology 27 (8), 1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Pontius RGJ, Si K, 2014. The total operating characteristic to measure diagnostic ability for multiple thresholds. International Journal of Geographical Information Science 28 (3), 570–583. URL 10.1080/13658816.2013.862623 [DOI] [Google Scholar]
  12. Raiffa H, Schleifer R, 1961. Applied Statistical Decision Theory. MIT Press, Cambridge. [Google Scholar]
  13. Riester M, Wei W, Waldron L, Culhane AC, Trippa L, Oliva E, Kim S-H, Michor F, Huttenhower C, Parmigiani G, Birrer MJ, April 2014. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. J Natl Cancer Inst. URL 10.1093/jnci/dju048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Vickers AJ, Elkin EB, 2006. Decision curve analysis: a novel method for evaluating prediction models. Medical Decision Making 26 (6), 565–574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Yoshihara K, Tsunoda T, Shigemizu D, Fujiwara H, Hatae M, Fujiwara H, Masuzaki H, Katabuchi H, Kawakami Y, Okamoto A, Nogawa T, Matsumura N, Udagawa Y, Saito T, Itamochi H, Takano M, Miyagi E, Sudo T, Ushijima K, Iwase H, Seki H, Terao Y, Enomoto T, Mikami M, Akazawa K, Tsuda H, Moriya T, Tajima A, Inoue I, Tanaka K, 2012. High-risk ovarian cancer based on 126-gene expression signature is uniquely characterized by downregulation of antigen presentation pathway. Clinical Cancer Research 18 (5), 1374–1385. URL http://clincancerres.aacrjournals.org/content/18/5/1374 [DOI] [PubMed] [Google Scholar]

RESOURCES