Abstract
While the receiver operating characteristic (ROC) method is the acknowledged gold-standard for imaging system assessment, it ignores localization information and differentiation between multiple abnormalities per case. Since the free-response ROC (FROC) method uses localization information and more closely resembles the clinical reporting process, it is being increasingly used. A number of methods have been proposed to analyze the data that results from a FROC study: jackknife alternative FROC (JAFROC) and a variant termed JAFROC1, initial detection and candidate analysis (IDCA) and ROC analysis via the reduction of the multiple ratings on a case to a single rating. The focus of this paper was to compare JAFROC1, IDCA and the ROC analysis methods using a clinical FROC human data set. All methods agreed on the ordering of the modalities and all yielded statistically significant differences of the figures-of-merit, i.e., p<0.05. Both IDCA and JAFROC1 yielded much smaller p-values than ROC. The results are consistent with a recent simulation based validation study comparing these and other methods. In conclusion, IDCA or JAFROC1 analysis of FROC human data may be superior at detecting modality differences than ROC analysis.
INTRODUCTION
Observer performance studies are used for comparison of diagnostic accuracy of imaging systems. Receiver operating characteristic (ROC) analysis is the most common tool for quantitative evaluation of observer performance and imaging systems. It applies to binary tasks, in which the observer assigns each case to one of two classes, normal or abnormal(1). Since most clinical tasks involve localization of disease, efforts are underway to develop generalizations of the ROC method that include the localization factor(2–7). The free-response operating characteristic (FROC) paradigm applies to situations in which each image contains either no lesion or any number of lesions and the observer’s task is to search, detect and locate each lesion that is present. The unit of FROC data is a mark-rating pair, where a mark refers to the physical location of a suspicious region and the rating is a number representing the confidence level that the marked region is actually a lesion. Ideally, the analysis of FROC data should use all available data, taking into account dependencies between ratings observed on the same image; in this way the highest statistical power can be achieved. A recent jackknife alternative FROC (JAFROC) method and a variant termed JAFROC1 have become available to analyze FROC data that do not make independence assumptions and have been validated with simulators that have included strong correlations between the ratings(8,9).
Another approach to analyze location-specific data has been proposed, namely the initial detection and candidate analysis (IDCA) method(6). Of course, ROC analysis of FROC data is also possible (inferred ROC analysis).
Since patients benefit from better assessment methodology, which allows more precise and objective equipment selection, investigation about the different analysis methods is important. A simulation study comparing JAFROC1, IDCA and ROC showed that the statistical powers were ordered as: JAFROC1 > IDCA > ROC for human observers(10). However, these results have not been confirmed by clinical FROC studies. For this reason, the goal of this work was to compare JAFROC1, IDCA and ROC analysis methods on a clinical dataset of FROC data.
METHODS
Data set
Human observer FROC data from a previous study(11) was used. The observer’s task had been to localize simulated clusters of microcalcifications in 200 cranio-caudal digital mammograms, half of which were normal. All images were processed two times using the commercially available image processing algorithms OPView v2 (Modality 1) and OPView v1 (Modality 2) (Siemens, Erlangen, Germany). Three breast imaging radiologists located regions suspicious for the simulated clusters and rated them using a five point rating scale.
Statistical Analysis
The investigated methods for the statistical analysis are the JAFROC1 method, the IDCA method and the ROC method. The difference between these methods is the definition of the figure-of-merit (FoM) used to quantify image quality (described in next sections). All methods (JAFROC1, IDCA and ROC) analyze the ratings using the Dorfman-Berbaum-Metz (DBM) approach(12), originally developed for multi-reader multi-case ROC analysis. This technique involves an analysis of variance (ANOVA) of a transformation of the observed data, computed by the Quenouille-Tukey(13,14) version of the jackknife. Jackknifing is a re-sampling technique used in statistical inference to estimate the standard error in a statistic of interest (the FoM in this case, for example the area under the ROC curve for ROC analysis). The basic idea behind the jackknife estimator lies in systematically re-computing the statistic estimate leaving out one observation unit (each image in this case) at a time from the sample set. In the Quenouille-Tukey method, an estimated statistic that is obtained in this way is called a “pseudovalue”. The jackknife method is applied to the data of each reader separately, originating a matrix for all readers and one modality. The jackknife estimate of the statistic is calculated for each modality separately by the mean of the pseudovalues, and the standard error by the standard error of the mean of the pseudovalues. The reason why a method developed for ROC analysis can be applied to FROC data is that the DBM-ANOVA model applies to any FoM, e.g. the area under the ROC curve, sensitivity at a fixed specificity, etc., and JAFROC1 or IDCA can be regarded as simply providing another FoM which also accounts for localization information. Since each image is characterized by a single pseudovalue to which all mark-rating pairs of the considered case contribute, the analysis makes no independence assumptions(10).
JAFROC1 analysis
The JAFROC1 figure-of-merit is the area under the alternative free-response receiver operating characteristic (AFROC1) curve(2,15). The AFROC1 curve is defined as the plot of lesion localization fraction (LLF) along the y-axis versus the false-positive fraction (FPF1) along the x-axis. The LLF is defined as the fraction of true positive decisions with correct localization; the FPF1 is defined as the fraction of false positive decisions. To calculate the FPF1 from FROC data, the highest rated false positive event on each image (normal and abnormal) is used as the equivalent single “ROC” rating for that image, and all other FP responses on that image are neglected. This plot is constrained to lie within the unit square. Note that as in ROC abnormal images cannot yield false positives, FPF1 can only be defined in the free-response context since it is computed over both normal and abnormal cases.
IDCA analysis
The initial detection and candidate analysis (IDCA) method was developed for analyzing computer aided detection data(6,16,17). In this study it was applied to human observer data to calculate a figure of merit.
The IDCA FoM used in this work is the area under the FROC curve to the left of a specified value NLF = gamma (AUFCgamma). The FROC curve is defined as the plot of LLF along the y-axis versus the mean number of FPs per image (non-lesion localization fraction, NLF) along the x-axis. The value γ was chosen as the highest NLF for the most conservative reader (the one with the lowest number of false positives), divided by 1.2 (this value was determined empirically to ensure that none of the jackknifes yielded a highest NLF value smaller than the chosen γ-value). The IDCA method is based on fitting the ratings of lesions and non-lesions localization to a pseudo-ROC curve, by considering each score as arising from an image that can be normal or abnormal, depending on the score being a true positive or a false positive. The pseudo-ROC curve is then scaled to obtain the fitted FROC curve. The mapping operation to scale the curve consists of a point-by-point multiplication of the (x,y) coordinates of each point of the pseudo-ROC curve by the (x,y) coordinates of the observed end-point of the FROC curve (Fig. 1). The end-point of the FROC curve corresponds to all marks rated above the lowest threshold (equal to 1); as it includes all FPs for the reader on all normal images, it corresponds to the highest value of NLF. In this study, the pseudo-ROC curve was fitted using the binormal model algorithm ROCFIT(18).
Figure 1.
This figure illustrates an example of the IDCA approach to fit FROC operating points. IDCA regards the TPs and FPs counts as arising from normal and abnormal “cases” in a pseudo-ROC study. The counts are analyzed by conventional ROC curve-fitting software yielding the fitted upper curve (bold). The FROC curve, shown in the lower part of the graph, is obtained by a mapping operation, consisting of a point-by-point multiplication of the pseudo-ROC curve with a scaling factor.
ROC analysis
For ROC analysis the overall image rating was assumed to be the rating of the highest rated mark on the image, which could be associated with an actually positive or actually negative finding. As FoM the area under the ROC curve defined by these ratings was calculated using PROPROC(19). The ROC curve is defined as the plot true-positive fraction (TPF) versus false-positive-fraction (FPF).
RESULTS
Figure 1 shows that IDCA yields an excellent fit also for human observer data. Table 1 shows the F-statistics, with numerator and denominator degrees of freedom, and the p-values calculated, respectively, for JAFROC1, IDCA and ROC methods. Fig. 2 and Fig. 3 represent the AFROC and PROPROC curves respectively, averaged over the readers. The AFROC end-point is connected to (1,1) with a dotted line, as the FoM used for JAFROC1 analysis is the area under the complete curve.
Table 1.
F-statistics and p-values calculated for each statistical analysis.
Statistical Analysis | F-statistics (degree of freedom, numerator/denominator) | p-value |
---|---|---|
JAFROC1 | 16.60(1/12.1) | 0.0011 |
IDCA | 19.33 (1/15.4) | 0.0004 |
ROC | 11.04 (1/5) | 0.0198 |
Figure 2.
This figure illustrates the pooled AFROC1 curves (JAFROC1 analysis) for Modality 1 and Modality 2, for the average reader.
Figure 3.
This figure illustrates the pooled PROPROC curves (ROC analysis) for Modality 1 and Modality 2, for the average reader.
DISCUSSION
We have applied three methods to analyze human observer FROC data. Figure 2 and Figure 3 show that, for both pooled AFROC1 and PROPROC curves, performance is significantly higher for modality 1. The figures-of-merit for all methods agreed on the ordering of the modalities and all yielded statistically significant differences, i.e., p< 0.05. While IDCA gave the smallest p-value (0.0004), JAFROC1 also gave a very small p-value (0.001), and the difference is easily explainable by sampling variability. In the upper tail of the F-distribution a small change in F can cause a large relative change in p. The true power can only be determined using simulation methods; a simulation study in which FROC data for one reader and 2 modalities were generated using a search model(7) showed that JAFROC1 yielded substantially greater power than IDCA.
One expects IDCA and JAFROC1 to have higher statistical power than ROC. ROC analysis accepts only one score per case, while both IDCA and JAFROC1 take all marks into account. Also, in ROC analysis, where no localization information is required, the rating on an abnormal image could be due to a non-lesion (NL) that is more suspicious than the lesion. This possibility leads to more noise in the measurement, as sometimes the lesion has the highest rating and sometimes a NL has the highest rating. The higher noise in the ROC measurement explains why even though the ROC curves are more clearly separated than the AFROC1 curves, the p-value is smaller for JAFROC1.
Since the IDCA method uses more data than JAFROC1 (and obviously ROC), namely all marks on normal and abnormal images, one would expect that it achieves more statisticalpower than JAFROC1, which uses all marks for lesion-localization but only the highest rated non-lesion localization on all images, normal and abnormal. On the other side, JAFROC1 assigns a rating to each and every normal image, even when the image contains no marks (in which case the default - 2000 rating is assigned, which contributes to the figure of merit). Likewise, unmarked lesions are assigned the default rating and the information that they went undetected is used: the fact that they are less suspicious than any marked lesion tends to decrease the figure- of-merit. This is not true for the IDCA method where only the marked regions are used for statistical analysis and the area AUFCγ under the FROC curve is relatively insensitive to the distribution of the marks between the cases.
CONCLUSION
In conclusion, IDCA or JAFROC1 analysis of FROC clinical data suggest superiority to ROC analysis.
ACKNOWLEDGEMENT
The authors are grateful to Hong-Jun Yoon for implementation of the JAFROC and IDCA software. We thank Jurgen Jacobs for the software used for collecting FROC data in the observer performance experiment. Thanks are due to Valerie Celis, Filip Claus and Chantal Van Ongeval for serving as observers. One of us (DPC) was supported in part by grants from the Department of Health and Human Services, National Institutes of Health, R01-EB005243 and R01-EB008688. This work was in part supported also by the Mevic project. Mevic is an IBBT-project in cooperation with the following companies and organizations: Barco, Hologic, Philips, University Hospital Leuven, University of Gent MEDISIP/IPI, Free University of Brussels ETRO. IBBT is an independent multi-disciplinary research institute founded by the Flemish government to stimulate ICT innovation.
REFERENCES
- 1.Metz CE. Receiver operating characteristic analysis: a tool for the quantitative evaluation of observer performance and imaging systems. J. Am. Coll. Radiol. 2006;3:413–422. doi: 10.1016/j.jacr.2006.02.021. [DOI] [PubMed] [Google Scholar]
- 2.Chakraborty DP, Winter LH. Free-response methodology: alternate analysis and a new observer- performance experiment. Radiology. 1990;174:873–881. doi: 10.1148/radiology.174.3.2305073. [DOI] [PubMed] [Google Scholar]
- 3.Chakraborty DP. Statistical power in observer- performance studies: comparison of the receiver operating characteristic and free-response methods in tasks involving localization. Acad. Radiol. 2002;9:147–156. doi: 10.1016/s1076-6332(03)80164-2. [DOI] [PubMed] [Google Scholar]
- 4.Chakraborty DP, Berbaum KS. Observer studies involving detection and localization: modeling, analysis, and validation. Med. Phys. 2004;31:2313–2330. doi: 10.1118/1.1769352. [DOI] [PubMed] [Google Scholar]
- 5.Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Med. Phys. 1996;23:1709–1725. doi: 10.1118/1.597758. [DOI] [PubMed] [Google Scholar]
- 6.Edwards DC, Kupinski MA, Metz CE, Nishikawa RM. Maximum likelihood fitting of FROC curves under an initial-detection-and- candidate-analysis model. Med. Phys. 2002;29:2861–2870. doi: 10.1118/1.1524631. [DOI] [PubMed] [Google Scholar]
- 7.Chakraborty DP. A search model and figure of merit for observer data acquired according to the free-response paradigm. Phys. Med. Biol. 2006;51:3449–3462. doi: 10.1088/0031-9155/51/14/012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chakraborty DP. Analysis of location specific observer performance data: validated extensions of the jackknife free-response (JAFROC) method. Acad. Radiol. 2006;13:1187–1193. doi: 10.1016/j.acra.2006.06.016. [DOI] [PubMed] [Google Scholar]
- 9.Chakraborty DP, Yoon HJ, Mello-Thoms C. Spatial localization accuracy of radiologists in free-response studies: Inferring perceptual FROC curves from mark-rating data. Acad. Radiol. 2007;14:4–18. doi: 10.1016/j.acra.2006.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chakraborty DP. Validation and statistical power comparison of methods for analyzing free-response observer performance studies. Acad. Radiol. 2008;15:1554–1566. doi: 10.1016/j.acra.2008.07.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zanca F, Jacobs J, Van Ongeval C, Claus F, Celis V, Geniets C, Provost V, Pauwels H, Marchal G, Bosmans H. Evaluation of clinical image processing algorithms used in digital mammography. Med. Phys. 2008;36:765–775. doi: 10.1118/1.3077121. [DOI] [PubMed] [Google Scholar]
- 12.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. Invest. Radiol. 1992;27:723–731. [PubMed] [Google Scholar]
- 13.Quenouille MH. Note on the elimination of insignificant variates in discriminatory analysis. Ann. Eugen. 1949;14:305–308. doi: 10.1111/j.1469-1809.1947.tb02408.x. [DOI] [PubMed] [Google Scholar]
- 14.Tukey JW. Bias and confidence in not-quite large samples. Ann. Math. Stat. 1958;29:614. [Google Scholar]
- 15.Chakraborty DP. Maximum likelihood analysis of free-response receiver operating characteristic (FROC) data. Med. Phys. 1989;16:561–568. doi: 10.1118/1.596358. [DOI] [PubMed] [Google Scholar]
- 16.Yoon HJ, Zheng B, Sahiner B, Chakraborty DP. Evaluating computer-aided detection algorithms. Med. Phys. 2007;34:2024–2038. doi: 10.1118/1.2736289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chakraborty DP, Yoon HJ. Investigation of methods for analyzing location specific observer performance data. Proc. SPIE. 2008;6917:69170C.1–69170C.12. [Google Scholar]
- 18.Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat. Med. 1998;17:1033–1053. doi: 10.1002/(sici)1097-0258(19980515)17:9<1033::aid-sim784>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
- 19.Dorfman DD, Berbaum KS, Metz CE, Lenth RV, Hanley JA, Abu Dagga H. Proper receiver operating characteristic analysis: the bigamma model. Acad. Radiol. 1997;4:138–149. doi: 10.1016/s1076-6332(97)80013-x. [DOI] [PubMed] [Google Scholar]