Abstract
Rationale and Objectives
The aim of this study is to compare the ratings of a group of readers that used two different rating scales in a Receiver Operating Characteristic (ROC) study and to clarify some remaining issues when selecting a rating scale for such studies.
Materials and Methods
We reanalyzed a previously conducted ROC study where readers used both a five point and a 101 point scale to identify abdominal masses in 95 cases. Summary statistics include the distribution of scores by reader for each of the rating scales, the proportion of tied scores when using the five point scale that correctly resolved when using the 101 point scale and the proportion of paired normal - abnormal cases where the two rating scales resulted in a different selection of an abnormal case.
Results
As a group the readers used 84 of the rating categories when using the 101 point scale but the categories used differed for individual readers. All readers tended to resolve the majority of ties on the five point scale in favor of correct decisions and to maintain correct decisions when a more refined scale was used.
Conclusion
The reanalysis presented here provides additional evidence that readers in an ROC study can adjust to a 101 point scale and the use of such a refined scale can increase discriminative ability. However, the decision of selecting an appropriate scale should also consider the underlying abnormality in question and relevant clinical considerations.
Keywords: observer performance, ROC, rating scale
INTRODUCTION
In some fields, including but not limited to radiology, the application of Receiver Operating Characteristic (ROC) type rating systems often assume an underlying continuous scale that is approximated by a discrete categorization. Historically a five (or a six) point rating scale had been used for this purpose and this method may have advantages when it is closely related to a set of commonly used diagnostic decisions/recommendations [1]. More recently a 101 point scale has been suggested for this purpose [2–4]. Because of the large number of categories, a 101 point scale can be treated as a continuous scale and therefore avoid some of the analytical complexities associated with a discrete ordinal scale. Although several authors have discussed the limitations associated with either of these two approaches [1, 5, 6] some general issues remain. Furthermore, the possibilities that some decisions in radiology should be viewed more appropriately as an inherently binary decision [7] potentially increases the magnitude of the differences that can occur between discrete and continuous scales since a binary decision may be viewed as using a two point (discrete) scale. The purpose of this paper is to clarify several issues in regard to the selection of a rating scale in an ROC study by comparing the actual ratings used by a group of readers in a study that employed both a 5 point and a 101 point scale to identify abdominal masses [2]. We also present some summary statistics useful in describing the effect of refining a given ordinal scale.
First, it should be recognized that the statistical aspects of contrasting different rating systems depends on the true underlying categorization. If the true underlying scale is continuous then the use of a discrete scale has by definition less information and will ultimately be inferior when compared using standard statistical measures. Wagner et al [6] demonstrated this in a comparison between a five point and a continuous scale when data are generated from an underlying continuous scale. Conversely, if the true underlying scale is discrete then using a larger number of possible ratings may increase the variance. Gur et al [7] demonstrated this in a simulation study comparing a dichotomous rating to a continuous rating when the true underlying scale is dichotomous. Thus, the scale with the most desirable statistical properties often depends upon the scale that is conceptually considered as “correct” (or perhaps clinically relevant).
However, simulations usually do not take into consideration possible behavioral changes of raters. For example, a scale with too many categories and beyond the ability of the rater to distinguish among the ratings may result in additional variability due to an increase in the “within” reader variability resulting from lack of consistency. Although one published paper compares a 5-point ordinal scale to a 101-point scale for several abnormalities [5], there is still limited information available to assess raters’ behavior and performance when using different rating scales with different underlying assumptions [8, 9]. Furthermore, different readers may not behave similarly under the same rating conditions [9]. One of the objectives in this paper will be to contrast the ratings used by individual readers using a five point scale with the ratings obtained for the same set of cases with a 101 point scale. The study on which this analysis is based was previously published and showed no statistically significant difference in the estimated areas under the ROC (AUC) curves for the two scales, but the behavior of the readers when using these scales was not described in detail. We also present a useful approach to summarize the potential benefit of scale refinement as well as a possible change in discriminating effect due to increases in variability.
Specifically we wish to address the following questions:
How much of the 101 point scale did the readers actually use in the study in question and did the number of categories actually used by the readers differ?
Was there an approximate range of values in the 101 point scale that corresponded to specific discrete rating categories and if so, did it differ by reader?
Did use of the more refined 101 point scale tend to improve the discrimination between disease and non-disease cases or did the large number of rating categories result in an unacceptable number of classifications that were inconsistent with the original five-point scale?
In answering these questions we use several simple summary statistics that we believe are useful for contrasting the effect of using a refined scale as compared with a five category scale in an ROC setting.
METHODS
Analysis was conducted on ratings by five readers interpreting 95 examinations where identification of the presence or absence of one or more abdominal masses was the primary diagnostic task. Ratings were done by each reader using both a 101 point scale and a five point scale, with higher ratings indicating a greater likelihood of the presence of an abdominal mass. Each reader interpreted approximately 20 cases per session and either the five point, or the 101 point, rating scale was used throughout each of the sessions. A minimum of three weeks was required between the scoring of the same case session and the sequence in which the two scales were used was randomized. There were 57 cases with and 38 cases without the abnormalities in question. Detailed methodology of the actual original study has been provided elsewhere (2). The original study focused on a comparison of the areas under the ROC curves for the two rating scales while the present study investigates the possible impact of changes in the rating scale on individual cases by different readers and by the group of readers as a whole. The summary statistics that were used in this analysis are based on the pairs of normal-abnormal cases, and therefore can be related directly to the nonparametric estimate of the area under the ROC curve based on the Wilcoxon Statistic.
RESULTS
As a group the raters used 84 of the 101 possible rating categories. There was variability among readers and the actual number used by each of the five readers was 29, 14, 43, 52, and 42, respectively. The readers were less decisive in declaring a negative examination with only 3.4% of the scores rating a 0 (i.e., definitely negative for any of the abnormalities in question). Conversely, 21.5% of the scores were 100 (i.e., at least one of the abnormalities in question is definitely present). Of the remaining scores (N = 357) 42.9% were in categories representing multiples of 5 (e.g. 5, 10, 15, ….). Table 1 shows the range of scores when using the 101 point scale corresponding to the same percentile associated with a specific score on the five point scale. Note that this represents a uni-directional mapping and it does not imply that for each reader or cases with ratings on the 5 category scale always produces ratings on the 101 category scale that lay in non-overlapping ranges. We note that there is some variability among readers’ ratings on the 101 point scale with values between 0–10 and 91–100 roughly corresponding to 1 and 5, respectively on the five point ordinal scale and the discrete category 3 (i.e. the mid range category) in the five point scale frequently covers the widest range.
Table 1.
Range of Continuous Scores Corresponding to Discrete Scores for each of the Readers that were Evaluated in this Study
| Discrete rating | Reader | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | |
| 1 | 0–5 | 0–10 | 0–12 | 0–8 | 0–5 |
| 2 | 5–21 | 10–20 | 12–50 | 8–35 | 5–24 |
| 3 | 21–80 | 20–60 | 50–71 | 35–62 | 24–63 |
| 4 | 80–95 | 60–90 | 71–89 | 62–84 | 63–94 |
| 5 | 95–100 | 90–100 | 89–100 | 84–100 | 94–100 |
Table 2 summarizes the results after reclassifying all possible pairs of ratings based on a five point ordinal discrete rating to a 101 point rating when ratings are combined for all readers. We grouped the pairs of normal, abnormal ratings into three categories: (1) the rating of the normal case is lower than the rating of the case with an abnormality, (2) the ratings of the normal and abnormal cases are the same, and (3) the rating of the abnormal case is lower than rating of the normal case. As expected, the fraction of pairs with tied scores was lower with the 101 point scale than that with the five point scale (2.5% and 11.3%, respectively). If the refined 101 point scale is an improvement one would expect ties to be resolved in favor of “truth” when the more “precise” classification is used. This is in fact what occurred in 64.6% of the cases tied in the discrete scale being resolved correctly, 9.8% remaining tied and 25.5% being resolved incorrectly. Thus, of the ties that were resolved, “correct” orderings outnumbered “incorrect” orderings by more than a two to one ratio, and this tendency to favor a correct resolution of a tie was statistically significant for each individual reader (p < .001).
Table 2.
Cross Classification of all Possible Pairings of Negative and Positive Cases Compared with the Verified Clinical Truth
| 101 point | Correct (ND<D) | Tied | Incorrect (D<ND) | Total |
|---|---|---|---|---|
| 5 point | ||||
| Correct (ND<D) | 8339 | 122 | 341 | 8802 |
| Tied | 790 | 120 | 312 | 1222 |
| Incorrect (D<ND) | 327 | 33 | 446 | 806 |
| Total | 9456 | 275 | 1099 | 10830 |
One would also hope that under the refined scale there would not be too many reversals of relative ordering of scores for normal-abnormal pairs. For example, the relative ordering of the normal - abnormal pair on the five point scale should be the same as the relative ordering on the 101 point scale. For orderings that were initially correct on the five point scale, 94.7% remained correct when the 101 point scale was used. For orderings that were initially an incorrect reclassification on the five point scale, the more refined scale resulted in 55.3% remaining incorrect, 40.6% changing/reversing to correct ordering and 4.1% changing from incorrect ratings to ties. For each reader the proportion of correctly ordered pairs changing to incorrect was significantly lower (p < .001) than the proportion of incorrectly ordered pairs changing to correct ordering.
The tendency for correct resolution of ties in the majority of instances and in a fraction of the incorrect decisions changed the nonparametric estimate of the AUC by −0.005, +0.027, +0.043, +0.028, and −0.011 for the five individual readers, respectively albeit the overall change was not statistically significant. We note that even though the general direction of the changes suggests that the refined scale should increase the nonparametric estimate of the AUC, this does not always occur. It is possible that the small fraction of occurrences in which the correctly ordered pairs are changed to incorrect ordering will represent a greater actual number of changes than the actual number of correctly ordered pairs that occur due to the resolution of ties or the reversal of the relative orders of scores in pairs that were rated incorrectly on the five point scale. This actually occurred in our analysis for two of the readers (i.e. #1 and #5).
DISCUSSION
We acknowledge that there is no general optimal rating system and that the choice of a rating scale to be used in a specific study is not purely a statistical issue. Reasons for the choice of a rating system include better correspondence/association with the clinical decision making process and the fact that for some abnormalities readers may have different underlying “conceptual” scales. Information on these scales can sometimes be obtained by examining the distribution of actual ratings for a representative set of cases and readers. There is also the possibility that refinement of a rating system beyond what the rater can distinguish may alter the rater’s behavior and introduce more variability into the system. We did not observe this in our analyses and the 101 point scale resulted in resolution of more than 90% of tied observations between normal and abnormal cases for the five point scale and approximately 70% of these were resolved in favor of correctly distinguishing the diseased case from the non-diseased case. This was done without substantial loss in the proportion of pairs that were correctly identified with the five point scale. The vast majority of the pairs classified correctly on the five point scale remained correctly classified when the more refined scale was used.
Our results are consistent with those of Berbaum et al [5] in that not all scores were used by all readers when the 101 point scale was used. However, our raters still tended to use more ratings than those in [5] and as a group they actually used a total of 84 of the 101 possible ratings. Metz and Pan [10] showed that in the bi-normal model, whenever the variances in the normal and abnormal cases are unequal an improper ROC curve will result. The reason why a change in scale would impact on the tendency of an improper ROC curve is unclear. Nevertheless, it is likely that such a tendency may vary by the abnormality in question or be related to the actual spread in the observed scores. We note that using the solely data driven non-parametric approach eliminates the issue of “properness” of the performance curve.
CONCLUSION
The current study provides additional experimental evidence that raters can adjust to the use of a 101 point scale and at least in this setting the refinement appears to increase their discriminative ability. Although the choice of a 101 rating scale continues to be a viable option, the decision as to the most appropriate scale for a given diagnostic task being investigated should also consider the underlying abnormality in question and other clinical practice considerations.
Acknowledgments
This work is supported in part by grant numbers EB002106 and EB001694 (to the University of Pittsburgh) from the National Institute of Biomedical Imaging and Bioengineering (NIBIB), National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Zhou XH, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine, Methods in Diagnostic Medicine. John Wiley & Sons; New York: 2002. [Google Scholar]
- 2.Rockette HE, Gur D, Metz CE. The use of continuous and discrete confidence judgments in Receiver Operating Characteristic studies of diagnostic imaging techniques. Investigative Radiology. 1992;27:169–172. doi: 10.1097/00004424-199202000-00016. [DOI] [PubMed] [Google Scholar]
- 3.Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of ROC curves from continuously distributed data. Stat Med. 1998;17(9):1033–1053. doi: 10.1002/(sici)1097-0258(19980515)17:9<1033::aid-sim784>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
- 4.Houn FH, Bright RA, Busher HF, et al. Study design in the evaluation of breast cancer imaging technologies. Acad Radiol. 2000;7:684–692. doi: 10.1016/s1076-6332(00)80524-3. [DOI] [PubMed] [Google Scholar]
- 5.Berbaum KS, Dorfman DD, Franken EA, Jr, Caldwell RT. An empirical comparison of discrete ratings and subjective probability ratings. Acad Radiol. 2002;9:756–763. doi: 10.1016/s1076-6332(03)80344-6. [DOI] [PubMed] [Google Scholar]
- 6.Wagner RF, Beiden SV, Metz CE. Continuous vs. categorical data for ROC analysis. Some quantatative consideration. Acad Radiol. 2001;8:328–334. doi: 10.1016/S1076-6332(03)80502-0. [DOI] [PubMed] [Google Scholar]
- 7.Gur D, Rockette HE, Bandos AI. “Binary” and “non-binary” detection tasks: are current performance measures optimal? Acad Radiol. 2007;14:871–876. doi: 10.1016/j.acra.2007.03.014. [DOI] [PubMed] [Google Scholar]
- 8.Walsh SJ. Limitations to the robustness of binormal ROC curves: effects of model misspecification and location of decision thresholds on bias, precision, size and power. Stat Med. 1997;16(6):669–79. doi: 10.1002/(sici)1097-0258(19970330)16:6<669::aid-sim489>3.0.co;2-q. [DOI] [PubMed] [Google Scholar]
- 9.Hadjiiski L, Chan HP, Sahiner B, Helvie MA, Roubidoux MA. Quasi-continuous and discrete confidence rating scales for observer performance studies: Effects on ROC analysis. Acad Radiol. 2007;14(1):38–48. doi: 10.1016/j.acra.2006.09.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Metz CE, Pan X. “Proper” binormal ROC curves: theory and maximum likelihood estimation. J Math Psychol. 1999;43(1):1–33. doi: 10.1006/jmps.1998.1218. [DOI] [PubMed] [Google Scholar]
