Correlation of Test Sets and Actual Clinical Performance

David J Seidenwurm; Robert D Rosenberg

doi:10.1148/rycan.2021200139

letter

. 2021 Jan 29;3(1):e200139. doi: 10.1148/rycan.2021200139

Correlation of Test Sets and Actual Clinical Performance

David J Seidenwurm ^*, Robert D Rosenberg ^†,^✉

PMCID: PMC7983787 PMID: 33778761

Editor:

Dr Chen and colleagues (1) are to be commended for their significant contribution regarding reader performance in breast cancer screening in the September 2020 issue of Radiology: Imaging Cancer. The goal of such programs is to select the mammography readers, to improve their performance continuously, to recognize superior and substandard performance immediately and reliably, and remediate or remove persistent underperformers objectively.

They report statistical data suggesting that reader performance at test set reading correlates significantly to reader performance in mammographic interpretation in real life (IRL). Their data suggest that approximately 3% of cancer detection rate (CDR), 2% of recall rate (RR), and 7% of positive predictive value (PPV) interreader variation IRL are explained by reader performance at test set reading.

They find that outliers in test set reading had significantly lower CDR and PPV, but similar RR IRL when compared with the other readers. However, even the CDR and PPV data demonstrate considerable overlap in performance, likely precluding determinations regarding the performance of individual readers IRL based on test set performance.

It would be of interest to view Bland-Altman plots of the individual reader data and to determine whether the relative ranking of individual readers differed when comparing relative rankings based upon test set performance with rankings based upon performance IRL. The discriminant value of test set reading as a training tool and evaluative procedure would be validated if performance at the individual reader level is demonstrated.

A limitation of test sets as acknowledged by the authors is the laboratory effect. Readers know that their readings do not influence patient outcomes, and this situation has a different cancer proportion and higher recall rate compared with reading IRL. As suggested previously (2), similar data and greater benefit might be generated by “seeding” 1%–2% of known positive and negative cases into the daily reading volume (2–4). Those results would more likely be representative of actual practice, and if monitored, could allow real-time evaluation and immediate feedback. A side benefit of this approach is the doubling of cancers present in the daily reading sets.

Footnotes

Disclosures of Conflicts of Interest: D.J.S. Activities related to the present article: employed by Sutter Medical Group (salary and benefits). Activities not related to the present article: institution board membership of Sutter Care at Home (food); author employed at Sutter Health (salary and benefits); institution provides expert testimony (plaintiff and defense for civil and criminal matters); institution receives grant from Gordon and Betty Moore Foundation (diagnostic excellence); institution receives travel/accommodation/meeting expense from ACR, NQF, CMS/Acumen; ACR MR Accreditation Program (reviewer). Other relationships: disclosed no relevant relationships. R.D.R. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: disclosed no relevant relationships. Other relationships: member of the Clinical Advisory Board for Therapixel (this relationship has had no compensation or other financial involvement to date).

References

1.Chen Y, James JJ, Cornford EJ, Jenkins J. The Relationship between Mammography Readers’ Real-Life Performance and Performance in a Test Set–based Assessment Scheme in a National Breast Screening Program. Radiol Imaging Cancer 2020;2(5):e200016. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Rosenberg RD, Seidenwurm D. Optimizing Breast Cancer Screening Programs: Experience and Structures. Radiology 2019;292(2):297–298. [DOI] [PubMed] [Google Scholar]
3.Gordon PB, Borugian MJ, Warren Burhenne LJ. A true screening environment for review of interval breast cancers: pilot study to reduce bias. Radiology 2007;245(2):411–415. [DOI] [PubMed] [Google Scholar]
4.Evans KK, Birdwell RL, Wolfe JM. If you don’t find it often, you often don’t find it: why some cancers are missed in breast cancer screening. PLoS One 2013;8(5):e64366. [DOI] [PMC free article] [PubMed] [Google Scholar]

Radiol Imaging Cancer. 2021 Jan 29;3(1):e200139. doi: 10.1148/rycan.2021200149

Response

Yan Chen ^*,^✉, Jonathan J James ^†, Eleanor J Cornford ^‡, Jacquie Jenkins ^§

We would like to thank you for the opportunity to respond to the comments raised in the Letter to the Editor regarding our article “The Relationship between Mammography Readers’ Real-Life Performance and Performance in a Test Set–based Assessment Scheme in a National Breast Screening Program” (1). We would also like to thank the authors of the letter for their interesting suggestion for future work involving the “seeding” of test set cases into routine workflow.

We have undertaken the proposed further data analysis using Bland-Altman plots of the individual reader data to determine whether the relative ranking of individual readers differed when comparing rankings based upon test set performance with rankings based upon real-life performance data.

Readers were ranked into percentiles according to their CDR, RR, and PPV, both in terms of real-life screening and in terms of their test set performance. The agreement between each reader’s percentile rank in real life and on the test set was assessed using Bland-Altman plots. The plot for CDR is shown in the Figure. The mean differences in percentile rank in real life and on the test set were 1.08 for CDR (standard deviation = 36.34, 95% CI: −70.14, 72.31), 0.34 for RR (standard deviation = 36.82, 95% CI: −71.83, 72.51), and 0.00 for PPV (standard deviation = 35.77, 95% CI: −70.11, 70.11). The diamond shape in the plot indicates that differences were smallest and agreement stronger for those individuals at the top and bottom of the performance rankings (2). For instance, the red dots in the plot present poorer performance in the Personal Performance in Mammographic Screening (PERFORMS) scheme, and they are seen to cluster toward the left apex of the plot (Figure). The Pearson correlations between real-life and test set ranks showed a significant positive correlation for each performance metric (r = 0.21 for CDR [P < .0001], r = 0.24 for PPV [P < .0001], and r = 0.19 for RR [P < .0001]), and the Bland-Altman plot demonstrates that the association was strongest for those individuals with the highest and lowest real-life performance metrics. This adds further weight to the validity of the test set–based approach to individual performance testing in the PERFORMS scheme.

Bland-Altman plot for cancer detection rate (CDR): The mean of each reader’s real-life and Personal Performance in Mammographic Screening (PERFORMS) percentile rank is plotted against the difference between each reader’s real-life and PERFORMS percentile ranks for CDR. Individual poor performance outliers in the PERFORMS scheme are shown as red dots.

References

1.Chen Y, James JJ, Cornford EJ, Jenkins J. The Relationship between Mammography Readers’ Real-Life Performance and Performance in a Test Set–based Assessment Scheme in a National Breast Screening Program. Radiol Imaging Cancer 2020;2(5):e200016. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1(8476):307–310. [PubMed] [Google Scholar]

[r1] 1.Chen Y, James JJ, Cornford EJ, Jenkins J. The Relationship between Mammography Readers’ Real-Life Performance and Performance in a Test Set–based Assessment Scheme in a National Breast Screening Program. Radiol Imaging Cancer 2020;2(5):e200016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Rosenberg RD, Seidenwurm D. Optimizing Breast Cancer Screening Programs: Experience and Structures. Radiology 2019;292(2):297–298. [DOI] [PubMed] [Google Scholar]

[r3] 3.Gordon PB, Borugian MJ, Warren Burhenne LJ. A true screening environment for review of interval breast cancers: pilot study to reduce bias. Radiology 2007;245(2):411–415. [DOI] [PubMed] [Google Scholar]

[r4] 4.Evans KK, Birdwell RL, Wolfe JM. If you don’t find it often, you often don’t find it: why some cancers are missed in breast cancer screening. PLoS One 2013;8(5):e64366. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Correlation of Test Sets and Actual Clinical Performance

David J Seidenwurm, MD, FACR

Robert D Rosenberg, MD

Editor:

Footnotes

References

Response

Yan Chen, PhD

Jonathan J James, FRCR

Eleanor J Cornford, FRCR

Jacquie Jenkins, MSc

Figure:

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Correlation of Test Sets and Actual Clinical Performance

David J Seidenwurm, MD, FACR

Robert D Rosenberg, MD

Editor:

Footnotes

References

Response

Yan Chen, PhD

Jonathan J James, FRCR

Eleanor J Cornford, FRCR

Jacquie Jenkins, MSc

Figure:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases