Binary and multi-category ratings in a laboratory observer performance study: A comparison

David Gur; Andriy I Bandos; Jill L King; Amy H Klym; Cathy S Cohen; Christiane M Hakim; Lara A Hardesty; Marie A Ganott; Ronald L Perrin; William R Poller; Ratan Shah; Jules H Sumkin; Luisa P Wallace; Howard E Rockette

doi:10.1118/1.2977766

. 2008 Sep 11;35(10):4404–4409. doi: 10.1118/1.2977766

Binary and multi-category ratings in a laboratory observer performance study: A comparison

David Gur ^1,^a), Andriy I Bandos ², Jill L King ³, Amy H Klym ³, Cathy S Cohen ³, Christiane M Hakim ³, Lara A Hardesty ⁴, Marie A Ganott ⁵, Ronald L Perrin ⁵, William R Poller ⁶, Ratan Shah ⁷, Jules H Sumkin ⁷, Luisa P Wallace ⁷, Howard E Rockette ⁸

PMCID: PMC2627510 NIHMSID: NIHMS82827 PMID: 18975686

Abstract

The authors investigated radiologists, performances during retrospective interpretation of screening mammograms when using a binary decision whether to recall a woman for additional procedures or not and compared it with their receiver operating characteristic (ROC) type performance curves using a semi-continuous rating scale. Under an Institutional Review Board approved protocol nine experienced radiologists independently rated an enriched set of 155 examinations that they had not personally read in the clinic, mixed with other enriched sets of examinations that they had individually read in the clinic, using both a screening BI-RADS rating scale (recall∕not recall) and a semi-continuous ROC type rating scale (0 to 100). The vertical distance, namely the difference in sensitivity levels at the same specificity levels, between the empirical ROC curve and the binary operating point were computed for each reader. The vertical distance averaged over all readers was used to assess the proximity of the performance levels under the binary and ROC-type rating scale. There does not appear to be any systematic tendency of the readers towards a better performance when using either of the two rating approaches, namely four readers performed better using the semi-continuous rating scale, four readers performed better with the binary scale, and one reader had the point exactly on the empirical ROC curve. Only one of the nine readers had a binary “operating point” that was statistically distant from the same reader’s empirical ROC curve. Reader-specific differences ranged from −0.046 to 0.128 with an average width of the corresponding 95% confidence intervals of 0.2 and p-values ranging for individual readers from 0.050 to 0.966. On average, radiologists performed similarly when using the two rating scales in that the average distance between the run in individual reader’s binary operating point and their ROC curve was close to zero. The 95% confidence interval for the fixed-reader average (0.016) was (−0.0206, 0.0631) (two-sided p-value 0.35). In conclusion the authors found that in retrospective observer performance studies the use of a binary response or a semi-continuous rating scale led to consistent results in terms of performance as measured by sensitivity-specificity operating points.

Keywords: observer performance, screening mammography, ROC curves, binary operating point

INTRODUCTION

Retrospective observer performance studies are performed routinely in the field of medical imaging for a number of purposes.¹^,²^,³^,⁴^,⁵ The primary reason for performing these studies is to assess a system or a practice performance with an ultimate goal to infer about the possible changes that may occur if the system or the practice being investigated were to be actually implemented in the clinic. These studies are primarily performed retrospectively because of the cost and complexity associated with prospective studies. In addition, if it is necessary for study participants to undergo additional diagnostic tests as a result of a prospective clinical decision made under a defined study protocol, there may be both an added risk and a concern as to how disagreements will be managed and potentially affect actual patient care. There are a number of reasons why results of a retrospective observer performance study may not be easily generalized to the clinic. Among these are (a) the possible effects of using datasets enriched with positive subtle cases in order to magnify differences, if any; (b) the often large variability among participating radiologists, both in the “laboratory” and the clinic; (c) the knowledge that observers are actually participating in a study, hence decisions (ratings) will not affect actual clinical management; and (d) the specific reporting task and scale used during the experiment often differs substantially from ratings used in the clinic for the “same” task.⁶^,⁷^,⁸^,⁹

In the laboratory we are frequently interested in a comparison between two or more imaging modes (e.g., reading environments and∕or practices) even if the results do not exactly mimic the actual clinical task. Hence, as long as the relative performance levels between modes are consistent with what one expects to occur in the clinical setting, the study results may remain valid for this purpose. One potential problem with this assumption is that clinical decisions are frequently binary, namely whether an abnormality is depicted or not, often resulting in recommendations that are quite specific (e.g., recall the patient for additional diagnostic imaging procedures or not, or perform a biopsy or not). The comparison of performance levels of two imaging modes is based on the comparison of the two sets of binary characteristics (sensitivity and specificity). The typical rating procedures used to generate receiver operator characteristic (ROC) curves and associated overall measures (summary indices) of observer performance in laboratory experiments are fundamentally different. In the majority of ROC studies participants are asked to provide a level of confidence using a discrete, semi-continuous, or a continuous scale as to the likelihood that the abnormality in question actually exists or that a specific abnormality in question is either benign or malignant. Unlike the binary measures of performance, such as sensitivity and specificity which have a direct clinically relevant interpretation, ROC results are typically reported in more general terms of an overall performance related summary index such as the area under the curve or a complete performance curve which demonstrates how the true positive fraction increases as a function of the false positive rate. The binary and ROC approaches both assume that each reader may be operating in some possibly complex yet self-consistent manner in determining whether the image in question actually depicts (or not) a well defined abnormality. Typically the ROC score is assumed to be latent in the binary protocol. The score is internally compared to each reader’s threshold to obtain the binary response warranting a recommendation for further action, such as additional diagnostic imaging procedures, a biopsy, or a specific treatment.¹⁰^,¹¹ Binary performance, including one resulting from application of either an explicit or an implicit threshold, can be characterized by a single point (termed here operating point) on a plane with coordinates 1-Specificity (FPF) and Sensitivity (TPF). The ROC curve describes the operating points obtained by using varying thresholds.

One expects a binary rating based performance level or “an operating point” in the laboratory to lie on the ROC curve produced from discrete or semi-continuous rating scale when reading the same cases in the laboratory because according to the theoretical model this binary performance is achievable by merely changing the decision threshold.¹¹ The issue of concordance between performance levels when using binary or multiple rating scales and the possible discrepancies between the two approaches have been noted for audible signals.¹²^,¹³ However, there have been little experimental data validating of this assumption in general and in radiology in particular. Use of a specific rating scale in the laboratory could lead to a change in performance level due to various phenomena including a change in the interpretive process. For example, there is some evidence that “training” readers to “spread” their answers over a wide range of responses could alter their measured performance.¹⁴ Furthermore, some studies suggest that although there should be little, if any, difference between using discrete or semi-continuous rating scales in ROC studies, a difference in rating scale may produce different results due primarily to possible observer behavioral changes in response to the scale being used.¹⁵^,¹⁶^,¹⁷ Finally, experimentally there may be a number of other variables that could affect the expected theoretical correspondence between an ROC curve and a binary operating point.

We investigated experimentally the relationship between the performance levels when binary and∕or semi-continuous ratings scales are used by the same radiologists to assess enriched sets of mammograms during the performance of a multi-mode retrospective study. The results of this assessment are presented here.

MATERIALS AND METHODS

General study design

Each of nine board-certified Mammography Quality Standards Act (MQSA) qualified radiologists interpreted between 275 and 300 screen-film mammography (SFM) examinations that were obtained under an Institutional Review Board approved protocol, informed consent was waived, and in compliance with the Health Insurance Portability and Accountability Act (HIPAA). Radiologists viewed and rated each examination three times over a period of 20 months in a mode balanced design. The reading modes included (1) a “clinical” screening mode, using ratings analogous to those of the screening Breast Imaging Reporting and Data System (BI-RADS) recommendations,¹⁸ (2) an ROC mode, rating the likelihood of the presence of an abnormality using a probability rating scale of 0–100, and (3) a Free-Response ROC (FROC) mode.¹⁹ A comparison of the results from the clinical mode (binary) and the ROC mode is the focus of this paper. A detailed description of the methodology and justification for using screening mammography as used in this multi-mode project is provided elsewhere.²⁰ In brief, a four view “current” examination and a four view comparison (“prior”) examination were made available to the readers if these had been read with priors in the clinic. After viewing each examination radiologists rated the right and left breast separately. The set read by each radiologist included the “common” set of the 155 SFM examinations originally read in the clinic by other radiologists not participating in the study and an “individualized” set of examinations that had been clinically read by him∕her between 2 and 6 years previously. Radiologists read all cases in one mode before moving to the next after a time delay in a mode balanced, case randomized study that was managed by a comprehensive computer program. Ratings were recorded electronically.

Selection of examinations

We included a predetermined enriched distribution of verified examinations in the following categories. Actually positive examinations included (1) all sequentially available examinations depicting pathology confirmed cancers detected as a result of the diagnostic follow-up of a recall and (2) all false negative examinations actually depicting an abnormality that had been originally rated as negative (BI-RADS 1 or 2, for both breasts) but later verified as positive within 1 year. Actually negative examinations included (1) examinations originally rated as BI-RADS 1 or 2 (for both breasts) and verified for not having cancer at least 1 year thereafter and (2) examinations originally rated as BI-RADS 0 for either breast (namely had been actually recalled) and later rated as negative during a subsequent diagnostic workup and also confirmed to remain negative with a 1 year follow-up. Negative examinations were selected in a manner that approximately one third of the examinations did not have a prior examination (similar to our clinical practice). As a result, 83% of actually positive and 63% of actually negative examinations in our dataset had a “prior” comparison examination.

A total of 1367 examinations were selected for this study out of which 354 (25.9%) examinations depicted verified cancers. Among 1013 verified cancer-free examinations, 375 (37.0%) had been recalled (i.e., given a BI-RADS rating of 0 for either breast), 146 (14.4%) had been rated benign (i.e., given a BI-RADS rating of 2 for one breast, and a BI-RADS rating of either 1 or 2 for the other breast), and 492 (48.6 %) examinations had been rated negative (i.e., given a BI-RADS rating of 1 for both breasts). For those 875 examinations depicting an abnormality (either benign or malignant), 522 (59.7%) depicted a mass, 276 (31.5%) depicted micro-calcifications, and the remaining 77 (8.8%) depicted both mass and micro-calcifications.

Since the data analyses are performed on a breast based rating we note that the total number of breasts depicting a verified cancer is 356 (only 2 examinations were bi-laterally positive) and the total number of actually negative breasts (BI-RADS rating of 0, 1, or 2) is 2378. The average age of women whose examinations were selected was 53.96 and ranged between 32 and 93. Each examination was assigned a random identification number, cleaned, and all identifying information, including time marks, were taped over with black photographic tape. Study ID labels were affixed to all films. Prior films were identified as such and specifically marked with the number of months between the “current” and “prior” examination.

Study performance

Nine radiologists with varying experiences (6–32 years) in interpreting breast imaging procedures participated in this study. Observers were unaware of the specific aims of the study and received a general and a mode specific “Instruction to Observers” document which included details about how certain abnormalities (e.g., asymmetric density) should be rated and other mode specific instructions. Additionally, a training and discussion session was implemented prior to the commencement of interpretations under each mode.

All readers viewed and rated the 155 examinations in the common set and a varying number of “individualized” examinations ranging from 120 to 145. Therefore, the total number of examinations evaluated by every reader varied from 275 (155+120=275) to 300 (155+145=300). The different number of examinations read by each observer stemmed from the different number of cancers detected in the clinic by each observer during the period in question which in turn determined the mix of “individualized” examinations in the set presented to each observer. Examinations to be read during a specific session were loaded onto a film alternator according to a computerized list generated by the management software for the study. After matching the alternator case number with that of the computer generated rating form that appeared in the same sequence, observers reported their recommendations electronically.

During the clinical mode observers were first presented with a choice of rating each breast as “negative,” “benign,” or recommended for “recall,” equivalent to the screening BI-RADS ratings. If a “benign” or a “recall” rating was entered, observers were asked to identify the type of abnormality(s) in question (e.g., “mass,” “micro-calcifications”) and to select one or more recommended procedures (e.g., ultrasound, spot CC∕spot 90). During the ROC mode radiologists gave three separate multi-category assessment ratings (0% to 100%) regarding each breast in question: one indicating their confidence that a mass was present, another indicating their confidence that a micro-calcification cluster was present, and the last indicating that some “other” predefined abnormality was present. For any rating greater than 4% a second slider appeared specific for that abnormality and “assuming that the abnormality in question was actually present” readers indicated their assessment of the likelihood (0%–100%) that the abnormality in question was actually a cancer. However, since the screening BIRADS ratings (0, 1, and 2) do not address probability of malignancy, only the “detection” ratings (presence versus absence) in the ROC mode were analyzed in this comparison. All readers continued their routine clinical practice in women’s imaging during the study.

Data analysis

We focus our analysis on a breast based rating. The rating reflects a recommendation for recall during the “clinical” (screening BIRADS) mode or a detection probability rating during the ROC mode. In the “clinical” mode sensitivity (or True Positive Fraction, denoted TPF₀) was computed as the fraction of the “positive” breasts recommended for recall out of all breasts depicting verified cancers. Specificity (or 1-False Positive Fraction, denoted 1-FPF₀) was computed as the fraction of “negative” breasts not recommended for recall out of all verified “cancer free” breasts. ROC based performance was assessed using the maximum rating of all possible abnormalities (masses, micro-calcification clusters, or “other”) in a breast.

In order to compare the performances under the “clinical” (binary) mode and the ROC mode we consider the vertical distance between the binary “operating point” (FPF₀, TPF₀) and the empirical ROC curve. This index is a direct extension of a previously proposed statistic²¹ to the case of discrete data. Specifically, we use a linear interpolation between the empirical ROC points to generate a continuous ROC curve and use this curve to compute a signed vertical distance (i.e., differences in sensitivity levels at the same specificity levels) to the corresponding binary rating operating point. A negative value of the signed distance corresponds to the instance when the “clinical” mode operating point falls below the curve; a zero value is when the point falls on the curve; and a positive value is when the point lies above the curve.

To assess the statistical uncertainty regarding the observed proximity between the operating points from the binary mode and the empirical ROC curves we used a bootstrap resampling technique.²² We separately resampled examinations with and without known abnormalities independently for each reader-specific set of examinations and for the common set of examinations. This approach allows us to conduct the statistical analysis while accounting for the complex correlation structure of the data (i.e., correlation due to the same breasts being rated by different readers, or due to the fact that within an examination the two breasts belonged to the same woman). Each bootstrap p-value or confidence interval is based on 100 000 bootstrap samples. We emphasize that we did not resample readers; hence, the analysis does not take into account between-reader variability and thus we effectively conducted a “fixed reader” analysis. For the purpose of demonstrating equivalence between the performance levels under the binary and multi-category rating scales, this “fixed reader” analysis constitutes a conservative approach, namely it is easier to reject the null hypothesis of “no difference.”

RESULTS

Table 1 shows the estimates of the binary operating points, the points on the empirical ROC curves that are located either directly below or directly above (same specificity) the corresponding binary operating points, the bootstrap two-sided p-values, and 95% confidence intervals for the differences between the performance levels under binary and ROC-type rating scales. The reader-specific empirical ROC curves along with the binary operating points are shown in Fig. 1. From Fig. 1 and Table 1 one can see that four of the nine readers have their binary rating operating point located below the empirical ROC curve (TPF₀-TPF∣_FPF₀<0). Reader-specific values of the signed vertical distance (TPF₀-TPF∣_FPF₀) range from −0.046 to 0.128, with the average of 0.0159.

Table 1.

Estimates of the binary operating points, differences in sensitivities between binary points and the corresponding empirical ROC curves at the same specificity levels, and the corresponding p-values and confidence intervals for each reader as well as for the average over all readers.

Readers	Binary scale (recall∕not recall)						ROC scale (0:100) TPF∣_FPF₀ from the empirical ROC curve¹	Vertical distance from the point to the curve (TPF₀− TPF∣_FPF₀)²	Two-sidedp-values for testing equality of the vertical distance to 0	95% confidence interval for the vertical distance
	Actually negative breasts			Actually positive breasts
	Recall rate	No. of breasts		Recall rate	No. of breasts
	Recall rate	Total	(=individualized + common)	Recall rate	Total	(=individualized + common)
1	0.346	488	(=242+246)	0.887	106	(=42+64)	0.925	−0.0377	0.2351	(−0.1038, 0.0283)
2	0.419	492	(=246+246)	0.837	98	(=34+64)	0.875	−0.0383	0.3756	(−0.1238, 0.0468)
3	0.197	487	(=241+246)	0.761	113	(=49+64)	0.761	0.0000	0.9659	(−0.1062, 0.1150)
4	0.139	483	(=237+246)	0.752	105	(=41+64)	0.657	0.0952	0.0501	(0.0011, 0.2170)
5	0.298	484	(=238+246)	0.833	108	(=44+64)	0.880	−0.0463	0.2058	(−0.1111, 0.0278)
6	0.251	470	(=224+246)	0.732	82	(=18+64)	0.604	0.1281	0.0868	(−0.0122, 0.2805)
7	0.364	462	(=216+246)	0.943	88	(=24+64)	0.898	0.0455	0.1396	(−0.0114, 0.1023)
8	0.107	476	(=230+246)	0.603	78	(=14+64)	0.628	−0.0257	0.8994	(−0.1410, 0.1539)
9	0.187	504	(=258+246)	0.811	90	(=26+64)	0.789	0.0222	0.5544	(−0.0667, 0.1333)
Average	0.256			0.795			0.780	0.0159	0.3495	(−0.0206, 0.0631)

Open in a new tab

TPF is the recall rate for the actually positive breasts corresponding to the point on the empirical ROC curve at FPF₀, denoted as ∣TPF∣_FPF₀. In cases where the empirical ROC curve does not have an empirical point with FPF=FPF₀, the TPF∣_FPF₀ is estimated as the weighted average of the TPFs corresponding to the two nearest empirical FPF points to FPF₀.

Primary index for the difference in performances: the difference between the sensitivity (TPF₀) under the binary scale and the sensitivity corresponding to the point on the empirical ROC curve corresponding to FPF₀: (TPF₀−TPF∣_FPF₀). The p-values and confidence intervals were obtained from the bootstrap distribution of the vertical distances. (1−α) confidence interval is the interval between the α∕2 and 1−α∕2 percentiles of the bootstrap distribution. Two-sided p-value is the probability (in the bootstrap space) of observing a vertical distance with a magnitude more extremely distant from the mean of the bootstrap distribution than 0 is. An apparent discrepancy of the p-value and confidence interval of reader #4 is due to the asymmetry of the bootstrap confidence intervals. Confidence intervals and p-values reflect only variability due to cases and ignore the between reader variability, thus providing a conservative approach for assessing the hypothesis of no difference.

Reader-specific empirical ROC curves (dashed curves), binary operating points (small dots), average binary operating point (large dot), and the empirical ROC curve for the pooled set of ratings. Each operating point (small dot) is connected to the corresponding point on the empirical ROC curve with a vertical line. The lengths of the vertical segments correspond to the absolute value of the distances (∣TPF₀-TPF∣_FPF₀∣). A point being above the related ROC curve (the vertical segment extending downward from the point) results in a positive value of the signed distance (TPF₀-TPF∣_FPF₀>0). The solid curve depicts the empirical ROC curve for the pooled ratings of all readers. This curve is shown solely for illustration purposes and was not used in the actual analysis. The vertical distance between the average binary operating point and the pooled ROC curve is 0.0093 and the average of the vertical distances actually used in the analysis is 0.0159.

The average difference between performance levels under the two scales is close to zero (average of TPF₀-TPF∣_FPF₀ over nine readers is 0.0159; two-sided bootstrap p-value=0.35). The 95% bootstrap confidence interval²² for the fixed-reader average difference is from −0.021 to 0.063. Only one of the nine readers had a binary operating point that was statistically significantly above∕below the reader’s empirical ROC curve. Reader-specific two-sided p-values ranged from 0.050 to 0.966.

DISCUSSION

Our study shows that during a retrospective laboratory study, the use of a binary response for detection tasks produces experimental operating points that are neither systematically above nor below the estimated ROC curves obtained from a semi-continuous (101 categories) rating scale. We note that the observed small average difference in performance levels can be explained by the “between case” variability alone. Had we taken into consideration the between reader variability the 95% confidence intervals would have widened, making the observed difference even more negligible in the context of the population of readers. These results are in general agreement with prior studies in other areas¹²^,¹³ as well as with the results obtained when comparing discrete (5) and semi-continuous (101) rating scales in medical imaging.¹⁶ Only one out of nine readers exhibited a statistically large difference in the performance levels when using either of the two approaches to rating examinations. We note relatively wide confidence intervals for reader-specific differences. This potentially limiting factor for detecting reader-specific differences does not affect the conclusion of the overall consistency between the binary and the ROC approaches since (1) there was no systematic tendency among the readers for performing worse under either of the paradigms and (2) the confidence interval for the fixed-reader-averaged differences is reasonably tight (0.08). We also note that in our study the recommended recall rates were high because our case sets included a large fraction of examinations depicting benign findings and a substantial number of these could have been recalled for verification reasons in the actual clinical environment, as well.

We emphasize that the consistency expected from the results of an observer performance study when using either of the approaches presented here provides grounds supporting reasonable transference between the use of a binary response or a semi-continuous rating scale in laboratory retrospective studies. However, the actual validity or generalizeability of results from retrospective laboratory studies to the clinical environment is beyond the scope of this paper.

We note that the original study involved several primary hypotheses in regards to the relationship amongst measured performance levels under different experimental scenarios and varying data ascertainment paradigms. Among the considered paradigms the binary and ROC approaches have an extensive theoretical connection via the concept of “thresholds,” and the experimental verification of the relationship between performance levels under these two paradigms is the focus of this paper. The experimental relationship of either the binary or the ROC paradigms to the FROC paradigm remains a question that has not been fully investigated to date, although some hypotheses and corresponding methodologies have been proposed.²³^,²⁴

In summary, the use of a binary response or a semi-continuous rating scale in observer performance studies leads to reasonably consistent performance levels as measured by sensitivity-specificity operating points. Hence, inferences based on sensitivity-specificity operating points using either of these rating scales should be generally concordant.

ACKNOWLEDGMENT

This work is supported in part by Grant Nos. EB001694, EB003503, and EB002106 (to the University of Pittsburgh) from the National Institute for Biomedical Imaging and Bioengineering (NIBIB), National Institute of Health.

References

Awai K., Murao K., Ozawa A., Nakayama Y., Nakaura T., Liu D., Kawanaka K., Funama Y., Morishita S., and Yamashita Y., “Pulmonary nodules: estimation of malignancy at thin-section helical CT--effect of computer-aided diagnosis on performance of radiologists,” Radiology 10.1148/radiol.2383050167 239(1), 276–284 (2006). [DOI] [PubMed] [Google Scholar]
Monnier-Cholley L., Carrat F., Cholley B. P., Tubiana J. M., and Arrivé L., “Detection of lung cancer on radiographs: receiver operating characteristic analyses of radiologists’, pulmonologists’, and anesthesiologists’ performance,” Radiology 233(3), 799–805 (2004). [DOI] [PubMed] [Google Scholar]
Fenlon H. M., Tello R., deCarvalho V. L., and Yucel E. K., “Signal characteristics of focal liver lesions on double echo T2-weighted conventional spin echo MRI: observer performance versus quantitative measurements of T2 relaxation times,” J. Comput. Assist. Tomogr. 24(2), 204–211 (2000). [DOI] [PubMed] [Google Scholar]
Fultz P. J., Jacobs C. V., Hall W. J., Gottlieb R., Rubens D., Totterman S. M., Meyers S., Angel C., Del Priore G., Warshal D. P., Zou K. H., and Shapiro D. E., “Ovarian cancer: comparison of observer performance for four methods of interpreting CT scans,” Radiology 212(2), 401–410 (1999). [DOI] [PubMed] [Google Scholar]
Slasky B. S., Gur D., Good W. F., Costa-Greco M. A., Harris K. M., Cooperstein L. A., and Rockette H. E., “Receiver operating characteristic analysis of chest image interpretation with conventional, laser-printed, and high-resolution workstation images,” Radiology 174(3 Pt 1), 775–780 (1990). [DOI] [PubMed] [Google Scholar]
Sica G. T., “Bias in research studies,” Radiology 238(3), 780–789 (2006). [DOI] [PubMed] [Google Scholar]
Rutter C. M. and Taplin S., “Assessing mammographers’ accuracy. A comparison of clinical and test performance,” J. Clin. Epidemiol. 53(5), 443–450 (2000). [DOI] [PubMed] [Google Scholar]
Tan A., D. H.Freemans, Jr., Goodwin J. S., and Freeman J. L., “Variation in false-positive rates of mammography reading among 1067 radiologists: a population-based assessment,” Breast Cancer Res. Treat. 100(3), 309–318 (2006). [DOI] [PubMed] [Google Scholar]
Rockette H. E., Campbell W. L., Britton C. A., Holbert J. M., King J. L., and Gur D., “Empiric assessment of parameters that affect the design of multireader receiver operating characteristic studies,” Acad. Radiol. 6(12), 723–729 (1999). [DOI] [PubMed] [Google Scholar]
Eng J., “Receiver operating characteristic analysis: a primer,” Acad. Radiol. 12(7), 909–916 (2005). [DOI] [PubMed] [Google Scholar]
Wagner R. F., Metz C. E., and Campbell G., “Assessment of medical imaging systems and computer aids: a tutorial review,” Acad. Radiol. 14(6), 723–748 (2007). [DOI] [PubMed] [Google Scholar]
Green D. M. and Swets J. A., “Statistical decision theory and psychophysical procedures,” in Signal Detection Theory and Psychophysics, Original ed. (Krieger, Huntington, NY, 1966), pp. 40–43. [Google Scholar]
Egan J. P., Schulman A. I., and Greenberg G. Z., “Operating characteristics determined by binary decisions and by ratings,” in Signal Detection and Recognition by Human Observers, 1988 reprint ed., edited by Swets J. A. (Peninsula, Los Altos, CA, 1988), pp. 174–181. [Google Scholar]
Gur D., Rockette H. E., Good W. F., Slasky B. S., Cooperstein L. A., Straub W. H., Obuchowski N. A., and Metz C. E., “Effect of observer instruction on ROC study of chest images,” Invest. Radiol. 25(3), 230–234 (1990). [DOI] [PubMed] [Google Scholar]
Berbaum K. S., Dorfman D. D., E. A.Franken, Jr, and Caldwell R. T., “An empirical comparison of discrete ratings and subjective probability ratings,” Acad. Radiol. 9(7), 756–763 (2002). [DOI] [PubMed] [Google Scholar]
Rockette H. E., Gur D., and Metz C. E., “The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques,” Invest. Radiol. 10.1097/00004424-199202000-00016 27(2), 169–172 (1992). [DOI] [PubMed] [Google Scholar]
Gur D., Rockette H. E., and Bandos A. I., “‘Binary’ and ‘non-binary’ detection tasks: Are current performance measures optimal?” Acad. Radiol. 14(7), 871–876 (2007). [DOI] [PubMed] [Google Scholar]
American College of Radiology (ACR). “Breast imaging reporting and data system atlas (BI-RADS atlas),” American College of Radiology, Reston, VA, 2003. (Accessed December 5, 2007) at http://www.acr.org/SecondaryMainMenuCategories/quality_safety/BIRADSAtlas/BIRADSAtlasexcerptedtext/BIRADSMammographyFourthEdition.aspx.
Chakraborty D. P. and Berbaum K. S., “Observer studies involving detection and localization; modeling, analysis, and validation,” Med. Phys. 10.1118/1.1769352 31(8), 2313–2330 (2004). [DOI] [PubMed] [Google Scholar]
Gur D., Bandos A. I., Cohen C. S., Hakim C. M., Hardesty L. A., Ganott M. A., Perrin R. L., Poller W. R., Shah R., Sumkin J. H., Wallace L. P., and Rockette H. E., “The laboratory effect: Comparing radiologists’ performance and variability during clinical prospective and laboratory mammography interpretations,” Radiology (in press). [DOI] [PMC free article] [PubMed] [Google Scholar]
Beam C. A. and Wieand H. S., “A statistical method for the comparison of a discrete diagnostic test with several continuous diagnostic tests,” Biometrics 47(3), 907–919 (1991). [PubMed] [Google Scholar]
Efron B. and Tibshirani R. J., An Introduction to the Bootstrap (Chapman & Hall, New York, 1993). [Google Scholar]
Chakraborty D. P., “A search model and figure of merit for observer data acquired to the free-response paradigm,” Phys. Med. Biol. 10.1088/0031-9155/51/14/012 51, 3449–3462 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
Song T., Bandos A. I., Rockette H. E., and Gur D., “On comparing methods for discriminating between actually negative and actually positive subjects with FROC type data,” Med. Phys. 10.1118/1.2890410 35(4), 1547–1558 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c1] Awai K., Murao K., Ozawa A., Nakayama Y., Nakaura T., Liu D., Kawanaka K., Funama Y., Morishita S., and Yamashita Y., “Pulmonary nodules: estimation of malignancy at thin-section helical CT--effect of computer-aided diagnosis on performance of radiologists,” Radiology 10.1148/radiol.2383050167 239(1), 276–284 (2006). [DOI] [PubMed] [Google Scholar]

[c2] Monnier-Cholley L., Carrat F., Cholley B. P., Tubiana J. M., and Arrivé L., “Detection of lung cancer on radiographs: receiver operating characteristic analyses of radiologists’, pulmonologists’, and anesthesiologists’ performance,” Radiology 233(3), 799–805 (2004). [DOI] [PubMed] [Google Scholar]

[c3] Fenlon H. M., Tello R., deCarvalho V. L., and Yucel E. K., “Signal characteristics of focal liver lesions on double echo T2-weighted conventional spin echo MRI: observer performance versus quantitative measurements of T2 relaxation times,” J. Comput. Assist. Tomogr. 24(2), 204–211 (2000). [DOI] [PubMed] [Google Scholar]

[c4] Fultz P. J., Jacobs C. V., Hall W. J., Gottlieb R., Rubens D., Totterman S. M., Meyers S., Angel C., Del Priore G., Warshal D. P., Zou K. H., and Shapiro D. E., “Ovarian cancer: comparison of observer performance for four methods of interpreting CT scans,” Radiology 212(2), 401–410 (1999). [DOI] [PubMed] [Google Scholar]

[c5] Slasky B. S., Gur D., Good W. F., Costa-Greco M. A., Harris K. M., Cooperstein L. A., and Rockette H. E., “Receiver operating characteristic analysis of chest image interpretation with conventional, laser-printed, and high-resolution workstation images,” Radiology 174(3 Pt 1), 775–780 (1990). [DOI] [PubMed] [Google Scholar]

[c6] Sica G. T., “Bias in research studies,” Radiology 238(3), 780–789 (2006). [DOI] [PubMed] [Google Scholar]

[c7] Rutter C. M. and Taplin S., “Assessing mammographers’ accuracy. A comparison of clinical and test performance,” J. Clin. Epidemiol. 53(5), 443–450 (2000). [DOI] [PubMed] [Google Scholar]

[c8] Tan A., D. H.Freemans, Jr., Goodwin J. S., and Freeman J. L., “Variation in false-positive rates of mammography reading among 1067 radiologists: a population-based assessment,” Breast Cancer Res. Treat. 100(3), 309–318 (2006). [DOI] [PubMed] [Google Scholar]

[c9] Rockette H. E., Campbell W. L., Britton C. A., Holbert J. M., King J. L., and Gur D., “Empiric assessment of parameters that affect the design of multireader receiver operating characteristic studies,” Acad. Radiol. 6(12), 723–729 (1999). [DOI] [PubMed] [Google Scholar]

[c10] Eng J., “Receiver operating characteristic analysis: a primer,” Acad. Radiol. 12(7), 909–916 (2005). [DOI] [PubMed] [Google Scholar]

[c11] Wagner R. F., Metz C. E., and Campbell G., “Assessment of medical imaging systems and computer aids: a tutorial review,” Acad. Radiol. 14(6), 723–748 (2007). [DOI] [PubMed] [Google Scholar]

[c12] Green D. M. and Swets J. A., “Statistical decision theory and psychophysical procedures,” in Signal Detection Theory and Psychophysics, Original ed. (Krieger, Huntington, NY, 1966), pp. 40–43. [Google Scholar]

[c13] Egan J. P., Schulman A. I., and Greenberg G. Z., “Operating characteristics determined by binary decisions and by ratings,” in Signal Detection and Recognition by Human Observers, 1988 reprint ed., edited by Swets J. A. (Peninsula, Los Altos, CA, 1988), pp. 174–181. [Google Scholar]

[c14] Gur D., Rockette H. E., Good W. F., Slasky B. S., Cooperstein L. A., Straub W. H., Obuchowski N. A., and Metz C. E., “Effect of observer instruction on ROC study of chest images,” Invest. Radiol. 25(3), 230–234 (1990). [DOI] [PubMed] [Google Scholar]

[c15] Berbaum K. S., Dorfman D. D., E. A.Franken, Jr, and Caldwell R. T., “An empirical comparison of discrete ratings and subjective probability ratings,” Acad. Radiol. 9(7), 756–763 (2002). [DOI] [PubMed] [Google Scholar]

[c16] Rockette H. E., Gur D., and Metz C. E., “The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques,” Invest. Radiol. 10.1097/00004424-199202000-00016 27(2), 169–172 (1992). [DOI] [PubMed] [Google Scholar]

[c17] Gur D., Rockette H. E., and Bandos A. I., “‘Binary’ and ‘non-binary’ detection tasks: Are current performance measures optimal?” Acad. Radiol. 14(7), 871–876 (2007). [DOI] [PubMed] [Google Scholar]

[c18] American College of Radiology (ACR). “Breast imaging reporting and data system atlas (BI-RADS atlas),” American College of Radiology, Reston, VA, 2003. (Accessed December 5, 2007) at http://www.acr.org/SecondaryMainMenuCategories/quality_safety/BIRADSAtlas/BIRADSAtlasexcerptedtext/BIRADSMammographyFourthEdition.aspx.

[c19] Chakraborty D. P. and Berbaum K. S., “Observer studies involving detection and localization; modeling, analysis, and validation,” Med. Phys. 10.1118/1.1769352 31(8), 2313–2330 (2004). [DOI] [PubMed] [Google Scholar]

[c20] Gur D., Bandos A. I., Cohen C. S., Hakim C. M., Hardesty L. A., Ganott M. A., Perrin R. L., Poller W. R., Shah R., Sumkin J. H., Wallace L. P., and Rockette H. E., “The laboratory effect: Comparing radiologists’ performance and variability during clinical prospective and laboratory mammography interpretations,” Radiology (in press). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c21] Beam C. A. and Wieand H. S., “A statistical method for the comparison of a discrete diagnostic test with several continuous diagnostic tests,” Biometrics 47(3), 907–919 (1991). [PubMed] [Google Scholar]

[c22] Efron B. and Tibshirani R. J., An Introduction to the Bootstrap (Chapman & Hall, New York, 1993). [Google Scholar]

[c23] Chakraborty D. P., “A search model and figure of merit for observer data acquired to the free-response paradigm,” Phys. Med. Biol. 10.1088/0031-9155/51/14/012 51, 3449–3462 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c24] Song T., Bandos A. I., Rockette H. E., and Gur D., “On comparing methods for discriminating between actually negative and actually positive subjects with FROC type data,” Med. Phys. 10.1118/1.2890410 35(4), 1547–1558 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Binary and multi-category ratings in a laboratory observer performance study: A comparison

David Gur

Andriy I Bandos

Jill L King

Amy H Klym

Cathy S Cohen

Christiane M Hakim

Lara A Hardesty

Marie A Ganott

Ronald L Perrin

William R Poller

Ratan Shah

Jules H Sumkin

Luisa P Wallace

Howard E Rockette

Abstract

INTRODUCTION

MATERIALS AND METHODS

General study design

Selection of examinations

Study performance

Data analysis

RESULTS

Table 1.

Figure 1.

DISCUSSION

ACKNOWLEDGMENT

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Binary and multi-category ratings in a laboratory observer performance study: A comparison

David Gur

Andriy I Bandos

Jill L King

Amy H Klym

Cathy S Cohen

Christiane M Hakim

Lara A Hardesty

Marie A Ganott

Ronald L Perrin

William R Poller

Ratan Shah

Jules H Sumkin

Luisa P Wallace

Howard E Rockette

Abstract

INTRODUCTION

MATERIALS AND METHODS

General study design

Selection of examinations

Study performance

Data analysis

RESULTS

Table 1.

Figure 1.

DISCUSSION

ACKNOWLEDGMENT

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases