A reasonable level of agreement (P < .01) between actual clinical reporting and test set conditions in screening mammography can be achieved when describing group performance if prior images are provided.
Abstract
Purpose:
To establish the extent to which test set reading can represent actual clinical reporting in screening mammography.
Materials and Methods:
Institutional ethics approval was granted, and informed consent was obtained from each participating screen reader. The need for informed consent with respect to the use of patient materials was waived. Two hundred mammographic examinations were selected from examinations reported by 10 individual expert screen readers, resulting in 10 reader-specific test sets. Data generated from actual clinical reports were compared with three test set conditions: clinical test set reading with prior images, laboratory test set reading with prior images, and laboratory test set reading without prior images. A further set of five expert screen readers was asked to interpret a common set of images in two identical test set conditions to establish a baseline for intraobserver variability. Confidence scores (from 1 to 4) were assigned to the respective decisions made by readers. Region-of-interest (ROI) figures of merit (FOMs) and side-specific sensitivity and specificity were described for the actual clinical reporting of each reader-specific test set and were compared with those for the three test set conditions. Agreement between pairs of readings was performed by using the Kendall coefficient of concordance.
Results:
Moderate or acceptable levels of agreement were evident (W = 0.69–0.73, P < .01) when describing group performance between actual clinical reporting and test set conditions that were reasonably close to the established baseline (W = 0.77, P < .01) and were lowest when prior images were excluded. Higher median values for ROI FOMs were demonstrated for the test set conditions than for the actual clinical reporting values; this was possibly linked to changes in sensitivity.
Conclusion:
Reasonable levels of agreement between actual clinical reporting and test set conditions can be achieved, although inflated sensitivity may be evident with test set conditions.
© RSNA, 2013
Supplemental material: http://radiology.rsna.org/lookup/suppl/doi:10.1148/radiol.13122399/-/DC1
Introduction
Within radiology, observer performance studies have been performed to investigate the impact of factors such as prevalence (1,2), display technology (3), expertise (4,5), and environmental factors (6,7), among many others. These types of experiments often rely on test set interpretations to represent real-life clinical performance. However, the extent of the relationship between performance in experimental tests and real-life clinical readings has not been satisfactorily resolved.
This situation is particularly important in breast screening programs (4,5,8). The success of screening programs depends on accurate interpretation of mammographic examinations by screen readers; therefore, reader efficacy is monitored so that underperformance can be identified and addressed. In clinical practice, however, most screen readers are exposed to relatively few cancers each year because of breast cancer’s low incidence (approximately seven cancers per 1000 women [9]), and therefore it can take years for sufficient data to be collected by clinical audit to identify adherence (or nonadherence) to national standards (10,11). Also, once quality enhancement programs are introduced, it may again take years for any improvement in performance to be detected (10). There is therefore a need for more immediate and responsive methods to identify performance levels, and this is the basis of the introduction and implementation of test set strategies such as the BreastScreen Reader Assessment Strategy, or BREAST (8), and the PERsonal perFORmance in Mammographic Screening, or PERFORMS, strategy (4). These systems present readers with a test set of challenging mammographic examinations to interpret and then provide feedback on their performance; however, the ability of these popular and widely adapted strategies to accurately represent clinic-based reporting performance is unknown.
Three previous studies (12–14) have examined the relationship between test and clinical readings, and while only limited levels of correlation were shown, differences between the clinical and test set situations in these studies may have confounded the findings. In particular, owing to the nature of clinical audit, it is not always possible to use the same metric for comparison (14). For example, while specificity is easily calculated in test set conditions, it is a much more ill-defined parameter in the clinical setting because the actual total number of normal findings and those correctly identified can only be determined after subsequent breast screenings and may never be fully known. Also in the previous work, the level of information provided to the reader was not always consistent, with computer-aided diagnosis (13) or additional imaging views (12) sometimes available at the clinical reading but missing in the test set condition. The current work is designed to address these limitations by using the same metrics and the same level of information for all circumstances (apart from the specific variable being tested) and aims to establish the extent to which test set reading can represent actual clinical reporting.
Materials and Methods
This work is based on comparing the decisions made by individual expert screen readers regarding cases reported in the clinic with the decisions made when identical images are presented to the same reader in a series of test set conditions. Institutional ethics approval for this study was granted, and informed consent was obtained from each participating screen reader. The need for informed consent with respect to the use of patient materials was waived.
Participants
Participants were 10 expert screen readers (nine radiologists and one nonradiologist physician, the latter being responsible for the clinical management of a screening service center; all had expertise in screen reading) who reported for BreastScreen New South Wales (BSNSW). The readers had a median of 19 years (range, 4–38 years) of experience in interpreting screening mammograms and read between 2000 and 20 000 (median, 6000) breast examinations each year.
Selection of Mammograms from Previous Clinical Reporting
The initial stage of this study involved looking at the past 5 years (between January 2007 and October 2011) of clinical reports generated by 10 expert screen readers and selecting for each reader 200 mammographic examinations that he or she had previously reported. This formed 10 reader-specific test sets that were designed to contain 10 true-positive (TP), 20 false-positive (FP), 160 true-negative (TN), and 10 false-negative (FN) cases that were used to represent actual clinical reporting. These image types were defined as follows: TP cases represented side-specific (correct side of breast) recalls of pathologically confirmed cancers; FP cases, recalled cases determined to have normal findings by two other readers or through subsequent diagnostic work-up; TN cases, examinations with findings that were reported to be normal by the reader responsible for the test set and one other reader or that were found to be normal through further diagnostic work-up; and FN cases, pathologically confirmed cancers missed by the reader but detected at screening by second and third readers and confirmed at the recall assessment clinic.
These definitions made use of the independent double-reading conditions in the Screening Service (a public program in New South Wales, Australia). In particular, the FN categorization within the Screening Service’s clinical audit does not include negative screening results at prior rounds when reported as negative by both readers, even when cancers are detected at the next screening round. Therefore, a follow-up normal screening round was not required for this study to confirm a TN screen.
All test sets were gathered and created by an individual (B.P.S, a radiologic technologist with 3 years of experience) who was not involved in any of the study’s readings. For each image type, the most recent mammographic examinations that fulfilled the image type definitions were selected to be included in the test sets. No other specific criteria were used in the selection of cases. The distribution of mammographic examinations in all reader-specific test sets was designed to resemble clinical prevalence, albeit with a higher number of abnormalities to create an enriched set. Because FN examinations were relatively rare, even over a 5-year period, five test sets had to have fewer FN cases than originally intended owing to the insufficient number of FN cases; these were replaced by TN examinations to maintain the total set number. The distributions of image types for each test set, along with the proportion of sets in which prior images were available, are detailed in Table 1.
Table 1.
Distribution of TP, FP, TN, and FN Image Types and Availability of Prior Images in Each Test Set

All test set images were acquired by using digital detectors, were sourced from the BreastScreen Digital Imaging Library, and were de-identified of all health record data. Each case comprised two-view (craniocaudal and mediolateral oblique) bilateral mammograms. Cases that showed biopsy markers or surgical scars were excluded. The test set from reader 7 was reused in a second study phase, described in the Second-Read Effect and Intrareader Variability section below.
Test Set Conditions
This study included three test set conditions (excluding the actual clinical reporting), and individual readers were involved with one or two of these test set conditions, as described below: Test set condition 1 was a clinical test set reading with prior images that involved individual readers reading their own sets of 200 images (as a single test set) in their own clinical reporting environment; test set condition 2, a laboratory test set reading with prior images where readers read their own sets of images, this time within a laboratory that simulated the clinical environment; and test set condition 3, a laboratory test set reading without prior images where the circumstances were the same as in test set condition 2 except that prior images were unavailable.
The decisions made with each of the test set conditions were compared with the actual clinical reporting by way of a series of comparisons (A, B, and C), as illustrated in the Figure. Details of the allocation of readers to each of these test set conditions are as follows: Five of 10 readers were involved in comparisons A and B (readers 1–5 [Table 1]), while the rest performed comparison C (readers 6–10 [Table 1]). For the five readers (readers 1–5) who were involved in two comparisons (A and B), each test set reading was separated by a minimum of 4 months and was counter balanced so that three of five readers performed the test set condition 1 first while the other two read test set condition 2 first. The reading order of images was randomized separately for each reading. All test set reading sessions were performed between February 2012 and August 2012.
Graph shows the comparisons between actual clinical reporting and test set conditions used in the study.
Second-Read Effect and Intrareader Variability
In addition to the 10 readers described above, a further set of five expert screen readers (four radiologists [including W.L.] and one nonradiologist physician, the latter being responsible for the clinical management of a screening service center; all had expertise in screen reading) from BSNSW who had a median of 25 years (range, 2–39 years) of experience in interpreting screening mammograms and who read 2000–5000 (median, 4500) breast examinations each year formed the second read group. They were asked to undergo two reading sessions in test set condition 2, during which all readers interpreted the same common set of images, which were randomly chosen from one of the five reader-specific test sets (reader 7’s test set) that demonstrated an ideal composition of each diagnostic outcome intended for each test set, as shown in Table 1. This group was introduced to establish the existence of any second-read effect and the level of intrareader variability when readers were asked to read the same set of images in two identical conditions. Resultant data formed a basis to judge the results from comparisons A, B, and C.
Viewing and Reading
Test set reading sessions took place in either a clinical or a laboratory environment, depending on the test set condition (Figure). The location of the clinical environment was determined by the readers’ usual reporting site within the BSNSW service centers, with the viewing conditions adhering to national recommendations for mammography (15). In the laboratory environment, ambient lighting levels were kept at 25−40 lux, as measured with an illuminance meter (CL-200; Konica Minolta, Osaka, Japan), throughout the study (6,16).
The laboratory workstation was set up to closely resemble the clinical workstations used in BSNSW. A pair of Radiforce GS510 (EIZO, Ishikawa, Japan) 5-MP medical-grade monochrome liquid crystal display monitors with a resolution of 2049 × 2560 pixels were used and were calibrated to the grayscale display function (17) by using an EIZO UX1 sensor and the quality control software RadiCS (18). The same picture archiving and communication system (Sectra Imtec, Linköping, Sweden), reporting keypad (Sectra Imtec), and mammographic hanging protocol were used in the clinical and laboratory situations.
The readers were unaware of the specific aims of the study and the composition of the test sets, and there was no time limit for test set interpretation. Readers were asked to interpret each mammographic examination as they would normally do in the clinic: That is, return a patient to routine screening for normal or benign findings or recall a patient for assessment if further diagnostic work-up was required. If the recommendation was for recall, readers were required to indicate the side, site, and nature of any abnormality and to rate the lesion as equivocal, suspicious, or malignant. Similar to clinical practice, readers were also given the option to indicate a “technical recall” decision whenever they deemed appropriate, and these were excluded from the analysis (Table 2). Before each reading session, readers were given instructions on how to review, report, and rate the mammograms.
Table 2.
Numbers of Normal and Abnormal Cases Given a Technical Recall Decision by the Readers

Note. No other reader made technical recall decisions; no technical recall decision was given in test set condition 3.
Data Analyses
A side-specific analysis was used in which a TP score was given when the correct side of the breast with pathologically confirmed cancer was given a recall rating, because this is the method used in the BreastScreen clinical audit. For the purpose of data analysis and in line with clinical practice, the following confidence scores were assigned for the respective decision made by each reader during the actual clinical reporting and the various test set conditions: A score of 1 indicated “return to routine screening”; a score of 2, equivocal findings; a score of 3, suspicious findings; and a score of 4, malignant findings.
On the basis of the selection of cases (described above) and previous clinical reports, region-of-interest (ROI) figures of merit (FOMs), sensitivity, and specificity were calculated for the actual clinical reporting, and these metric values were used for comparison with the data collected from the various test set conditions.
Analyses focused on the following items: (a) The ability of test set conditions to represent performance in actual clinical reporting. This was determined in four ways: with the ROI FOM, with sensitivity and specificity, with confidence scores, and with the Kendall coefficient of concordance. (b) The second-read effect and intrareader variability.
Ability of test set conditions to represent performance in actual clinical reporting.—The data consisted of a rating for each breast—that is, two ratings per patient. The ground truth was known for each breast: It either contained a cancer (abnormal) or it did not (normal). This type of clustered data constitutes the ROI paradigm (19,20) with two ROIs per patient. Details of the ROI analysis, which yielded a P value and 95% confidence intervals (CIs) for the ROI FOM for each reader-specific data set, are given in Appendix E1 (online).
Sensitivity and specificity, respectively, were defined as the proportion of abnormal breast examinations correctly given a recall rating (confidence score of 2, 3, or 4) and the proportion of normal breast examinations correctly given a nonrecall rating (confidence score of 1). Sensitivity and specificity were compared between actual clinical reporting and various test set conditions as demonstrated in the Figure by using a nonparametric Wilcoxon matched-pairs signed rank test.
Further analysis was performed to identify the test set condition that demonstrated the least difference from the actual clinical reporting. The absolute difference in confidence scores in comparisons A, B, and C were calculated and were compared against each other by using the nonparametric Wilcoxon matched-pairs signed rank test.
The level of agreement between actual clinical reporting and the test set conditions was assessed through the Kendall coefficient of concordance (also known as the Kendall W) (21), using the confidence scores assigned to the respective decisions made by readers in the actual clinical reporting and test set conditions. Levels of agreement ranged from 0.0 to 1.0, where a level of 0.9 or greater represented excellent agreement; a level of 0.8 to less than 0.9, good agreement; a level of 0.7 to less than 0.8, acceptable agreement; a level of 0.6 to less than 0.7, moderate agreement; and a level of less than 0.6, no agreement, interpreting Charter’s (22) debate regarding various proposed cutoffs.
Second-read effect and intrareader variability.—The second-read effect and intrareader viability were assessed by using the ROI FOM and sensitivity and specificity values, along with the Kendall coefficient of concordance, as described above.
P < .05 was considered to indicate a statistically significant difference for all statistical comparisons. ROI analysis was performed by using JAFROC software, version 4.1 (D.P.C., Pittsburgh, Pa). For the analyses of sensitivity, specificity, confidence score, and the Kendall coefficient of concordance, we used the SPSS (SPSS, Chicago, Ill) software.
Results
Ability of Test Set Conditions to Represent Performance in Actual Clinical Reporting
Significant changes in ROI FOMs were noted for two readers (readers 4 and 5) in comparison A (P < .05) and for one reader in each of the other comparisons (reader 4 in comparison B [P < .05] and reader 10 in comparison C [P < .05]) (Table 3). All changes showed an increase in FOMs in the test set conditions over the actual clinical reporting. No significant differences were observed for sensitivity and specificity.
Table 3.
Results for ROI FOM, Sensitivity, Specificity, and Kendall Coefficient of Concordance for Comparisons A, B, and C

Note.—The middle descriptive value is presented as the median. Data in parentheses are 95% CIs.
No significant differences.
Based on confidence score.
When test set conditions were examined to demonstrate levels of differences from the actual clinical reporting by using the confidence scores analysis, no statistically significant difference was seen between the median confidence ratings of comparisons A and B. Statistically significant differences in median confidence ratings were seen for comparison C (where prior images were excluded) versus each of comparisons A and B (P < .001).
Significant levels of agreement (P < .01) were shown for all readers for all comparisons, with the W value for comparisons being either moderate or acceptable according to the conventional standards, the ranges being 0.63−0.73 (group value, 0.72) for comparison A, 0.64−0.78 (group value, 0.73) for comparison B, and 0.64−0.72 (group value, 0.69) for comparison C (Table 3).
Second-Read Effect and Intrareader Variability
No significant changes in ROI FOMs, sensitivity, or specificity were observed when we investigated the second-read effect. When agreement was considered, significant levels of agreement (P < .01), varying from moderate to good according to the conventional standards, were demonstrated, with W values ranging from 0.68 to 0.81 (group value, 0.77) (Table 4).
Table 4.
Results for ROI FOM, Sensitivity, Specificity, and Kendall Coefficient of Concordance for Second-Read Group

Note.—The middle descriptive value is presented as the median. Data in parentheses are 95% CIs.
No significant differences.
Based on confidence score.
Discussion
This study assessed the ability of test set conditions to represent performance in actual clinical reporting in screening mammography. While screen reading test sets have been used by breast screening programs in several countries to augment clinical audits and monitor the performance of readers (12–14), the ability of screening test sets to represent real-life clinical performance has never been fully understood. Intrareader variability may contribute to any potential differences between actual clinical reporting and test set reading (23,24), and to explore this factor, a second-read group was deliberately introduced in our study design to allow us examine the variability in reading the same test set twice in identical conditions. The results from this comparison should theoretically present an upper limit to potential agreement for comparisons A–C and form a baseline for judging results from these comparisons. In fact, this second-read comparison demonstrated consistent median ROI FOMs (0.82 [first reading] and 0.83 [second reading]), sensitivity (70% [both readings]), and specificity (82% [both readings]), and, therefore, any overall differences seen between test set conditions for each comparison is unlikely to be fully explained by a second-reading effect.
Compared with those for actual clinical reporting, higher median ROI FOMs were demonstrated in all the relevant test set conditions, with little change in specificity (comparisons A, B, and C). While there was no significant difference shown in sensitivity, the consistent and substantial changes in median values suggests that increased sensitivity may have been responsible, at least in part, for the increases demonstrated in ROI FOMs in the test set conditions as compared with actual clinical reporting. This increase in performance is contrary to the findings of two previous studies (12,13), where clinical sensitivity was higher than test set values. This discrepancy may be explained by the fact that in the previous work, computer-aided diagnosis (13) and supplementary ultrasonographic and magnification images were available (12) only during the clinical reads. In our study, the situations for each comparison were kept as consistent as possible, apart from the specific variable being tested, and therefore our potential increase in sensitivity is a feature that future users of test sets may need to recognize. Nonetheless, it is reassuring that even with this sensitivity effect, a reasonable level of agreement between the actual clinical reporting and test set conditions when prior images were included was shown, with W values (median, 0.72 and 0.73, respectively) being only slightly less than the value demonstrated with the second-read comparison (median, 0.77). The inclusion of prior images does seem to be important, because agreement fell to a moderate level when these were excluded (median, 0.69), and the confidence score analysis suggested that test sets without prior images tended to demonstrate the greatest change from the actual clinical reporting. One factor that should be noted, however, is that while for groups of screen readers, agreement is reasonably high, the ability of a test set to represent actual clinical reporting is reader-dependent, with six of the 15 individual readers’ comparisons having moderate rather than acceptable agreement. This emphasizes the importance of having good numbers of expert readers in observational radiologic studies when trying to reproduce the clinical situation.
While nonsignificant, the pattern of increased sensitivity for test set conditions compared with actual clinical reporting is worth exploring, particularly because the possibility of a type II error cannot be excluded. First, the Australia National Accreditation Standards (11) suggest that the clinical recall rate should be kept as low as 10% and 5% for first and subsequent screening examinations, respectively. To adhere with this guideline, the readers in the real-life clinical readings may have adopted a stricter reporting criterion compared with the test set conditions, where individuals were not restrained by the guideline. If this was the case, however, a concomitant decrease in test set specificity would have been expected but was not seen. Second, it may be argued that because the test set conditions occurred between 1 and 5 years after the clinical reports, observer experience (and expertise level) may have increased. However, this effect (if real), should be quite small, because the readers had a high level of experience even at the time of the initial clinical reading. Finally, a recent review article (10) suggested that the external validity of test set readings is affected when the test environment is artificial, the response options are oversimplified, reader scrutiny by evaluators is high, and the prevalence of abnormal images is larger than in clinical work. While oversimplification of responses can be excluded, as the same decision options applied to both actual clinical and test set readings, the other three factors cannot be ruled out.
The main limitation to our work relates to the fact that analysis of location sensitivity was not performed. Without this, we cannot say with certainty that cancers were definitely identified; however, because we were relying on actual clinical reporting practice, this was not possible. Two other potential limitations worth considering include the inability to collect accurate and timely data regarding interval cancers and that for some cases, “truth” was assumed from the concurrent interpretation of two readers, without a subsequent screening round to help confirm the diagnostic status of the examinations used in this study. Nevertheless, both of these issues should not impact the study, because this work focuses on the similarity between the decisions made in the clinic and in the various test set conditions when identical images are presented to the same reader, instead of the ability of readers to detect breast abnormalities.
In conclusion, this study has shown that reasonable levels of agreement between actual clinical reporting and test set conditions in screening mammography can be achieved when describing group performance if prior images are provided.
Advances in Knowledge.
• Reasonable levels of agreement (P < .01) between actual clinical reporting and test set readings can be achieved when describing group performance if prior images are provided.
• There may be increases in performance levels in test set readings when compared with actual clinical performance.
Implication for Patient Care.
• Test set experiments can, to a reasonable level, be relied on to describe actual clinical performance in mammography, thus providing a system for assessing diagnostic accuracy.
Disclosures of Conflicts of Interest: B.P.S. No relevant conflicts of interest to disclose. W.L. No relevant conflicts of interest to disclose. M.F.M. No relevant conflicts of interest to disclose. P.L.K. No relevant conflicts of interest to disclose. W.M.R. No relevant conflicts of interest to disclose. R.H. No relevant conflicts of interest to disclose. D.P.C. No relevant conflicts of interest to disclose. P.C.B. No relevant conflicts of interest to disclose.
Supplementary Material
Acknowledgments
The authors thank BreastScreen New South Wales for the essential collaboration and all radiologists who contributed their time to participate in this work with so much enthusiasm. Special thanks to EIZO and Sectra for sponsoring the hardware and software for our laboratory workstation that made this work possible.
Received October 27, 2012; revision requested December 7; revision received December 18; accepted January 3, 2013; final version accepted January 10.
B.P.S. supported by University of Sydney International Scholarship.
Funding: D.P.C. was supported by the National Institutes of Health (grants R01-EB005243 and R01-EB008688).
Abbreviations:
- BSNSW
- BreastScreen New South Wales
- CI
- confidence interval
- FN
- false-negative
- FOM
- figure of merit
- FP
- false-positive
- ROI
- region of interest
- TN
- true-negative
- TP
- true-positive
References
- 1.Gur D, Rockette HE, Armfield DR, et al. Prevalence effect in a laboratory environment. Radiology 2003;228(1):10–14 [DOI] [PubMed] [Google Scholar]
- 2.Reed WM, Ryan JT, McEntee MF, Evanoff MG, Brennan PC. The effect of abnormality-prevalence expectation on expert observer performance and visual search. Radiology 2011;258(3):938–943 [DOI] [PubMed] [Google Scholar]
- 3.Toomey RJ, Ryan JT, McEntee MF, et al. Diagnostic efficacy of handheld devices for emergency radiologic consultation. AJR Am J Roentgenol 2010;194(2):469–474 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gale A. PERFORMS: A self assessment scheme for radiologists in breast screening. Semin Breast Dis 2003;6(3):148–152 [Google Scholar]
- 5.Scott HJ, Gale AG. Breast screening: PERFORMS identifies key mammographic training needs. Br J Radiol 2006;79(Spec No 2):S127–S133 [DOI] [PubMed] [Google Scholar]
- 6.Brennan PC, McEntee M, Evanoff M, Phillips P, O’Connor WT, Manning DJ. Ambient lighting: effect of illumination on soft-copy viewing of radiographs of the wrist. AJR Am J Roentgenol 2007;188(2):W177–W180 [DOI] [PubMed] [Google Scholar]
- 7.Brennan PC, Ryan J, Evanoff M, et al. The impact of acoustic noise found within clinical departments on radiology performance. Acad Radiol 2008;15(4):472–476 [DOI] [PubMed] [Google Scholar]
- 8.Brennan PC, Tapia K, Lee W. BreastScreen reader assessment strategy (BREAST). University of Sydney; http://sydney.edu.au/health-sciences/breastaustralia/. Updated October 9, 2012. Accessed October 19, 2012 [Google Scholar]
- 9.Patnick J. Annual review 2005: one vision—NHS Breast screening programme. Sheffield, England: Fulwood House, 2005 [Google Scholar]
- 10.Soh BP, Lee W, Kench PL, et al. Assessing reader performance in radiology, an imperfect science: lessons from breast screening. Clin Radiol 2012;67(7):623–628 [DOI] [PubMed] [Google Scholar]
- 11.BreastScreen Australia National Accreditation Standards Australia Government - Department of Health and Aging. http://www.cancerscreening.gov.au/internet/screening/publishing.nsf/Content/br-accreditation/$File/standards.pdf. Published April 2008. Accessed October 19, 2012
- 12.Rutter CM, Taplin S. Assessing mammographers’ accuracy: a comparison of clinical and test performance. J Clin Epidemiol 2000;53(5):443–450 [DOI] [PubMed] [Google Scholar]
- 13.Gur D, Bandos AI, Cohen CS, et al. The “laboratory” effect: comparing radiologists’ performance and variability during prospective clinical and laboratory mammography interpretations. Radiology 2008;249(1):47–53 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Scott HJ, Evan A, Gale AG, Murphy A, Reed J. The relationship between real life breast screening and an annual self-assessment scheme. In: Sahiner B, Manning DJ, eds. Proceedings of SPIE: medical imaging 2009—image perception, observer performance, and technology assessment. Vol 7263. Bellingham, Wash: SPIE–The International Society for Optical Engineering, 2009; 72631E-1–72631E-7 [Google Scholar]
- 15.Heggie JCP, McLean ID, Herley J, et al. ACPSEM position paper: recommendations for a digital mammography quality assurance program v3.0. Australasian College of Physical Scientists & Engineers in Medicine. www.acpsem.org.au/~acpsem/index.php/nmdocuments/doc_download/652-digital-mammography-qa-v30-2012. Published July 20, 2012. Updated August 2, 2012. Accessed October 5, 2012 [DOI] [PubMed] [Google Scholar]
- 16.McEntee M, Brennan P, Evanoff M, Philips P, O’Connor WT, Manning D. Optimum ambient lighting conditions for the viewing of softcopy radiological images. In: Jiang Y, Eckstein MP, eds. Proceedings of SPIE: medical imaging 2006—image perception, observer perfor mance, and technology assessment. Vol 6146. Bellingham, Wash: SPIE–The International Society for Optical Engineering, 2006; 260–268 [Google Scholar]
- 17. Digital Imaging and Communications in Medicine (DICOM) - Part 14: Grayscale display standard function. National Electrical Manufacturers Association. http://medical.nema.org/Dicom/2011/11_14pu.pdf. Published 2011. Accessed October 9, 2012.
- 18. RadiCS - Quality Control Software. EIZO NANAO Corporation. http://www.eizo.com/global/products/radiforce/radics/index.html. Accessed October 9, 2012.
- 19.Obuchowski NA, Lieber ML, Powell KA. Data analysis for detection and localization of multiple abnormalities with application to mammography. Acad Radiol 2000;7(7):516–525 [DOI] [PubMed] [Google Scholar]
- 20.Rutter CM. Bootstrap esitmation of diagnostic accuracy with patient-clustered data. Acad Radiol 2000;7(6):413–419 [DOI] [PubMed] [Google Scholar]
- 21.Siegel S, Castellan NJ. Nonparametric statistics for the behavioral sciences. 2nd ed. New York, NY: McGraw Hill, 1988 [Google Scholar]
- 22.Charter RA. A breakdown of reliability coefficients by test type and reliability method, and the clinical implications of low reliability. J Gen Psychol 2003;130(3):290–304 [DOI] [PubMed] [Google Scholar]
- 23.Al-Khawari H, Athyal RP, Al-Saeed O, Sada PN, Al-Muthairi S, Al-Awadhi A. Inter- and intraobserver variation between radiologists in the detection of abnormal parenchymal lung changes on high-resolution computed tomography. Ann Saudi Med 2010;30(2):129–133 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lawson CC, LeMasters MK, Kawas Lemasters G, Simpson Reutman S, Rice CH, Lockey JE. Reliability and validity of chest radiograph surveillance programs. Chest 2001;120(1):64–68 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

