Abstract
Rationale and Objectives
To assess whether or not prevalence levels affected the confidence ratings of readers during the interpretation of cases in a laboratory ROC type observer performance study.
Materials and Methods
We re-analyzed a previously conducted observer performance study that included 14 readers and 5 different levels of prevalence. The previous study yielded the observation that in the laboratory we could not detect a “prevalence effect” in terms of differences in areas under the ROC curves. The detection ratings (for presence or absence) of lung nodules, interstitial disease, and pneumothorax for the five prevalence levels were compared and a test for trend in averaged ratings as a function of abnormality prevalence was performed within a mixed model setting which accounts for different sources of variability and correlations induced by the study design.
Results
The ratings of the cases in terms of confidence that the specific abnormality in question is present tend, on average, to be larger when actual disease prevalence is lower. The rate of the increase of the average confidence ratings with the decreasing prevalence of a specific abnormality is very similar for actually positive and actually negative cases for every considered abnormality. The observed trend in the changes of the average confidence ratings as a function of prevalence levels was statistically significant (p<0.01).
Conclusion
Expectations of disease prevalence in the case mix during a laboratory observer performance study may systematically affect the behavior of observers in terms of their actual confidence ratings.
INTRODUCION
Receiver Operating Characteristic (ROC) type studies continue to be the preferred method of assessing performance when the observer is considered an integral part of the diagnostic imaging system. Although ROC methodology has been refined extensively over the years, the issue of the generalizability of study results to the clinical practice remains [1–5]. One potential bias that we evaluated previously is the possible effect of abnormality prevalence in the study population on observer performance in the laboratory environment. Frequently, laboratory studies include a significantly larger fraction of positive, as well as difficult (subtle), cases than would be observed in a typical clinical practice. To investigate this issue, we had performed a large multi-observer, multi-abnormality, ROC study to assess if observers’ diagnostic performance was affected at different levels of prevalence [6]. The results of our own study suggested that despite the substantial changes in prevalence levels in the different reading modes, observers’ performance as measured by the area under the ROC curve (AUC) was not significantly affected. Namely, experimentally we did not measure a “prevalence effect”. There is little additional work that has been published on the “prevalence effect” in other fields (7). Clearly appropriate changes in confidence ratings are expected when the measured performance level changes as well (8). However, there are no published observations regarding shifts in confidence ratings during changes in prevalence levels that result in comparable performance. Since our original study used AUC as the primary index (measure) of performance and we assessed AUC as a function of prevalence levels, we did not analyze at the time whether or not observers changed their actual rating patterns in terms of their recorded confidence ratings and, if they did, whether or not the ratings changed in a systematic manner.
Intuitively, one might expect that for higher prevalence levels observers would rate all cases as “more” positive for a variety of reasons and one could simulate how an ideal observer would be expected to behave under this scenario. In this study we analyzed a series of datasets from our own “prevalence effect” study to search for consistent patterns in observers’ ratings, if any.
MATERIAL AND METHODS
Original Study
The original study was performed several years previously and is described in detail elsewhere (6). During the original observer performance study a total of 1632 PA chest images were read independently by eight board-certified radiologists, two fellows, and four third-year residents under each of five different reading modes. These modes included five different levels of prevalence of different abnormalities ranging from 28% to 2%. The abnormalities of interest that were each rated independently were interstitial disease, nodule, pneumothorax, alveolar infiltrate, and rib fracture. Readers scored each image using a quasi “continuous” rating scale of 101 categories from 0 (the abnormality is definitely not present) to 100 (the abnormality is definitely present). These ratings represented individual reader’s estimates of the likelihood (“confidence”) that the abnormality of interest was present (depicted) or absent (not depicted) on each of the images. The study included a common core group of cases in all modes. This allowed us to compare performance levels as measured by the area under the ROC curve for this core set of 194 cases (144 positive and 50 negative). An increasing number (50, 200, 550, and 1,383) of negative cases was added to the core set resulting in four modes with decreasing levels of prevalence. The highest prevalence mode consisted of the core set enhanced by an additional set of 55 positive cases.
Readers completed all readings in one prevalence mode before being allowed to start another mode with a minimum required time between modes. The order in which the modes were presented to each reader, along with the cases within the mode, was managed by computer software. Readers by modes were counter-balanced and cases within a mode were randomized. Readers were given as much time as needed to interpret each case. The primary analysis was performed on the core group of 194 cases that were included in all five modes. Areas under the ROC curves were estimated for the five modes and 14 readers for the detection of each of the five abnormalities and compared. The original study results did not find significant differences in performance levels (AUC) among the five different levels of prevalence for any of the abnormalities in question albeit differences were observed between the performance levels of faculty (the eight board-certified radiologists) as a group and non-faculty (residents and fellows combined) as a group (6).
ANALYSES
For the present re-analysis we evaluated observers’ rating patterns to see if overall observers’ confidence ratings for the core group of 194 cases systematically changed in one direction or another with changes in prevalence in the specific modes. We used the original rating data of the fourteen observers ascertained during the interpretation of these 194 cases for the presence of three abnormalities (interstitial disease, nodule, and pneumothorax) at all five prevalence modes. The three abnormalities were selected based on their higher prevalence rates in the dataset and the fact that these are the most commonly used abnormalities in chest x-ray based observer performance studies.
We first computed changes in the average confidence ratings for all readers, each of the abnormalities and the subsets of cases in which the abnormality in question was actually present or absent. Lines were fitted (least square) to the data sets for illustration of the observed trend. However, qualitative illustration of the magnitude and consistency of the changes in average ratings do not provide enough information on how likely it is that the observed trends are not due to chance alone.
To formally assess the statistical significance of the observed trend we have to account for various correlations that are inherent to the study design and for different sources of variability associated with random samples of cases and readers. Using “proc mixed” SAS software (version 9.1; SAS Institute, Cary, NC) we estimated the model that accounts for the variability due to readers, cases, abnormality status as well as various correlations that are inherent in our experimental setting between confidence ratings assigned to the same cases by the same or different readers under the same or different modes and for the same or different abnormalities. The model we chose for inferential purposes includes some of the covariance parameters that are not statistically significant but allow for a more complete description of the correlation structure induced by the study design. This model provided a more conservative set of inferences (in terms of the p-value) than models with fewer terms.
RESULTS
Table 1 illustrates the magnitude of the change in the average confidence ratings between the two extreme (highest and lowest) prevalence modes for each of the three abnormalities and for both positive and negative cases for the abnormality in question.
Table 1.
Average confidence ratings over all readers for the highest and the lowest prevalence modes for the three abnormalities of interest.
| Abnormality | Mode | Prevalence level | Actually Negative Cases | Actually Positive Cases |
|---|---|---|---|---|
| Interstitial disease | 1 | 28% | 17.41 | 34.83 |
| Nodule | 1 | 21% | 15.42 | 52.57 |
| Pneumothorax | 1 | 21% | 4.69 | 59.81 |
| Interstitial disease | 5 | 3% | 21.14 | 38.83 |
| Nodule | 5 | 2% | 20.07 | 57.60 |
| Pneumothorax | 5 | 2% | 6.00 | 61.74 |
Figures 1 – 3 depict the pattern of changes in the average confidence ratings in positive and negative cases with decreasing prevalence levels (increasing mode number), for each abnormality. There are several interesting observations that can be made. First, within each abnormality the vertical distance between the fitted lines (slope) for actually negative and actually positive cases is quite similar for all modes. This observation is consistent with our previous observation and study conclusion about the absence of significant change in the AUCs under the considered prevalence levels (6). Second, the average separation between the fitted lines for actually negative and actually positive cases is different for different abnormalities with the greatest distance being that for the detection of pneumothorax and the smallest for interstitial disease. This observation is consistent with the difficulty of the detection tasks in question in our case set (i.e., detection performance of pneumothorax is the highest leading to the largest separation). Finally, average confidence ratings increased for every abnormality in both actually positive and actually negative cases. This is a somewhat unexpected observation that constitutes the main theme of the current paper.
Figure 1.

Average confidence ratings for the detection of interstitial disease by actually positive and actually negative cases over all readers in each mode. The prevalence of interstitial disease in modes 1 – 5 was 28%, 17%, 10%, 6%, and 3%, respectively. The fitted lines are linear least squares fit of the observed average ratings.
Figure 3.

Average confidence ratings for the detection of pneumothorax by actually positive and actually negative cases over all readers in each mode. The prevalence of pneumothorax in modes 1 – 5 was 21%, 16%, 10%, 5%, and 2%, respectively. The fitted lines are linear least squares fit of the observed average ratings.
The observed increase of the average confidence ratings implies that on average readers tend to increase the confidence ratings with decreasing prevalence. Table 2 illustrates how common such an increase is among the readers. From this table one can see that the majority of readers had an increasing trend in the average confidence ratings for both negative and positive cases within each abnormality specific detection task. The increasing trend of confidence ratings with decreasing prevalence levels was highly statistically significant (p<0.01) in the complete model that was used for the inferential purposes. We note that the increasing trend stayed highly significant regardless of the number of confounders for various fixed effects, correlations and sources of variability in other models that were considered.
Table 2.
Number of readers with increasing trends in confidence ratings with decreasing prevalence levels.
| Abnormality | Actually Negative cases (%)* | Actually Positive cases (%)* |
|---|---|---|
| Interstitial | 12 (86) | 10 (71) |
| Nodule | 13 (93) | 9 (64) |
| Pneumothorax | 8 (57) | 10 (71) |
percent is based on a total of 14 readers
DISCUSSION
The design of our original study enabled us to assess, by direct comparison of the core subset of 194 cases, how observers behave under different prevalence levels in a laboratory experiment. The original observation we made was a surprise to the investigating team in that we do expect a “prevalence effect” in the clinical practice and we have come to the conclusion that for several reasons the observation we made as related to a laboratory experiment could be adequately explained (9). The study by Egglin et al. was conducted on a small number of cases assessing the presence of pulmonary emboli (PE) using angiograms and it did show a positive “prevalence effect” (8) which was largely attributed to an increase in the correct ratings of positive cases (i.e. increase in sensitivity). However, their study was conducted in an experimental setting with very high prevalence levels (20 and 60 percent) and it included a large fraction of positive cases (3 of 8) that were defined by the authors as “obviously abnormal” (8). The difference in findings between the two studies may be due to the substantially lower prevalence levels and the case mix difficulty that was included in our study. However, despite the fact that AUCs did not change significantly in our study, the fact that observers changed their confidence ratings in a relatively systematic manner is an interesting and potentially important observation.
We do not know if indeed the changes we observed in rating patterns would have made a difference in actual decisions under a binary setting. Namely, when prevalence decreases it is possible that actual thresholds (cut point) for calling a case negative (or positive) change as well resulting potentially in a comparable outcome (diagnosis). However, the fact that observers did tend to move in the direction of rating cases as “more positive” in the laboratory with decreasing prevalence levels is underlining the importance of a possible “laboratory effect” that one may expect; hence, potentially limiting the ability to generalize results from a laboratory observer performance study to the clinical practice.
The simplest possible explanation to our observation, which is contradictory to the expected behavior of an ideal observer, is that observers are aware that in these studies there is generally a large fraction of positive cases and in many studies they anticipate at least some of the cases would be quite difficult (subtle). Since there is no real impact on patient care as a result of the experimental ratings, when observers do not “see” too many “easy” positive cases during the interpretation session they move their ratings toward being “more positive” and as a result, their confidence ratings shift. This “chasing of positive findings” behavior is potentially different than what would be expected in the clinical environment where readers are likely to rate cases as “more” negative with expectations of lower prevalence. This tendency, combined with practice parameters that may encourage low false positive rates, can affect their ratings in the other direction. Hence, what we have observed may constitute a “laboratory effect” that should be further investigated and better understood.
The fact that our original study was quite large and it included a substantial fraction of difficult cases (both positive and negative) and that we see a change in the confidence ratings during the first half of the readings that seem to persist during the second half, suggests that the radiologist may individually develop quite rapidly an expectation of the disease prevalence in the study and “adjust” their ratings to “follow” their own perceived expectations. Unfortunately, we do not have additional datasets that can be used to confirm or reject our observation and it would be important to do so.
Figure 2.

Average confidence ratings for the detection of nodules by actually positive and actually negative cases over all readers in each mode. The prevalence of nodules in modes 1 – 5 was 21%, 16%, 10%, 5%, and 2%, respectively. The fitted lines are linear least squares fit of the observed average ratings.
Footnotes
This work is supported in part by Public Health Service grants EB003503 and EB001694 (to the University of Pittsburgh) from the National Institute for Biomedical Imaging and Bioengineering (NIBIB), National Health Institutes, Department of Health and Human Services. The authors thank Dr. Lorenzo Pesce from the University of Chicago for raising some of the issues we attempt to address in this manuscript.
References
- 1.Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol. 1992;27:723–731. [Medline] [PubMed] [Google Scholar]
- 2.Roe CA, Metz CE. Variance-component modeling in the analysis of receiver operating characteristic index estimates. Acad Radiol. 1997;4:587–600. doi: 10.1016/s1076-6332(97)80210-3. [Medline] [DOI] [PubMed] [Google Scholar]
- 3.Beiden SV, Wagner RF, Campbell G. Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. Acad Radiol. 2000;7:341–349. doi: 10.1016/s1076-6332(00)80008-2. [Medline] [DOI] [PubMed] [Google Scholar]
- 4.Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Components-of-variance models for random-effects ROC analysis: the case of unequal variance structures across modalities. Acad Radiol. 2001;8(7):605–615. doi: 10.1016/S1076-6332(03)80685-2. [DOI] [PubMed] [Google Scholar]
- 5.Beiden SV, Wagner RF, Campbell G, Chan HP. Analysis of uncertainties in estimates of components of variance in multivariate ROC analysis. Acad Radiol. 2001;8(7):616–622. doi: 10.1016/S1076-6332(03)80686-4. [DOI] [PubMed] [Google Scholar]
- 6.Gur D, Rockette HE, Armfield DR, et al. Prevalence effect in a laboratory environment. Radiology. 2003;228(1):10–4. doi: 10.1148/radiol.2281020709. [DOI] [PubMed] [Google Scholar]
- 7.Wolfe JM, Horowitz TS, Kenner NM. Cognitive psychology: rare items often missed in visual searches. Nature. 2005;435(7041):439–440. doi: 10.1038/435439a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Egglin TK, Feinstein AR. Context bias. A problem in diagnostic radiology. JAMA. 1996 Dec 4;276(21):1752–5. doi: 10.1001/jama.276.21.1752. [DOI] [PubMed] [Google Scholar]
- 9.Gur D, Rockette HE, Warfel T, Lacomis JM, Fuhrman CR. From the laboratory to the clinic: the "prevalence effect". Acad Radiol. 2003;10(11):1324–6. doi: 10.1016/s1076-6332(03)00466-5. [DOI] [PubMed] [Google Scholar]
