See also the article by Healy et al in this issue.

Dr Hofvind is head of BreastScreen Norway, the national population-based breast cancer screening program headed by the Cancer Registry of Norway. She is also a professor at Oslo Metropolitan University, Faculty of Health Sciences. Her main interests are the quality assurance, evaluation, and improvements of mammography screening.

Dr Lee is professor of radiology, adjunct professor of health services, and director of the Northwest Screening and Cancer Outcomes Research Enterprise at the University of Washington School of Medicine. He is a practicing breast imager with a research program focused on emerging breast cancer screening technology assessment.
U.S. radiologists continue to perform single reading for screening mammography, which results in an average recall rate of more than 10%. Most European countries are using (independent) double reading with consensus or arbitration, which results in a recall rate of less than 5%, without a decrease in cancer detection rate (1). A recall for further assessment that turns out to be negative (a false-positive screening examination) is considered a harm of mammography screening (2). Early detection of breast cancer and minimizing the rate of false-negative screening examinations are critical for the value of screening programs. Even after accounting for different age groups of women screened, 1-year versus 2–3-year screening intervals, and the medical-legal environment, the differing recall rates in the United States versus Europe are striking. Much of this difference likely rests with the use of consensus or arbitration meetings in the double-reading setting (1).
In this issue of Radiology, Healy and colleagues (3) present the results of the Irish National Breast Cancer Screening Program’s 5-year review of discordant imaging findings discussed at consensus meetings after the transition from screen-film mammography (SFM) to full-field digital mammography (FFDM). The consensus meetings took place biweekly and consisted of three to five radiologists. Sometimes the meeting included radiologists who had read the original screening mammograms. The transition from SFM to FFDM resulted in a lower recall rate (4.4% vs 4.1%, respectively) but also a lower cancer detection rate (7.5 per 1000 women screened vs 6.5 per 1000 women screened). No significant difference existed in consensus sensitivity, positive predictive value, and negative predictive value, and consensus specificity showed slight improvement. The most common imaging finding discussed at consensus was an asymmetry at FFDM.
Despite the informal consensus system that included the nonblinded format used in the Irish screening program, discussion of discordant screening cases still provided benefit in terms of improved screening accuracy. The positive predictive value for recall after consensus was 10.4%, and few interval cancers were missed after consensus (negative predictive value, 99.1%). Only eight interval cancers occurred after consensus over the 5-year period, three of which developed at the site of concern discussed during the consensus meetings. These results demonstrate the effectiveness of consensus meetings in decreasing recall rates without hindering cancer detection. The results of Healy et al are in line with those of other studies (4,5) and seem to indicate that any type of consensus system, even if nonblinded, is better than none.
Mammography screening in the United States and in several European countries is now transitioning to digital breast tomosynthesis (DBT). It remains uncertain how DBT findings at screening will influence the case mix among discordant findings at screening examinations in a double-reading environment. Although Healy et al suggest that asymmetries are the most common feature among discordant findings, DBT is likely to resolve the overlapping fibroglandular tissue that leads to most one-view findings at FFDM (6). We believe that DBT screening will lead to a higher rate of consensus meetings for subtle areas of architectural distortion; such areas are more easily visualized with DBT. Healy et al found that two-thirds of cancers detected after FFDM screening and consensus were invasive rather than ductal carcinoma in situ, with an average tumor size of 17 mm at the time of detection (3). Moreover, only two cancers demonstrated tubular morphologic characteristics after FFDM screening and consensus. With DBT screening, it is likely that small areas of distortion will be discussed at consensus, and many of these may represent areas of incidental radial scar or indolent tubular cancers. If this ends up being the case, performance values after consensus may change after a transition from FFDM to DBT.
Nonetheless, the Irish study highlights that even informal consensus meetings regarding equivocal screening cases can improve overall screening accuracy. In the United States, this is not entirely unlikely in settings where multiple radiologists are concurrently working in a breast clinic. It is not uncommon for radiologists to “sideline consult” a colleague for a second opinion on an equivocal screening mammogram. One major U.S. academic institution employed an intervention to double read all DBT call backs in an effort to keep recall rates low without decreasing the cancer detection rate (7). They found that consensus double reading of all recalls, which required two radiologists to agree if recall was necessary, led to significant decreases in recall rate and positive predictive value without negatively affecting cancer detection rate. An average of only 2.3 minutes was spent consulting on each potential recall.
While many would argue that double reading remains too costly in the United States, the clear benefits observed from consensus meetings should give pause. A currently much-hyped hope is that emerging artificial intelligence (AI) algorithms can serve as a “second reader” in both single- and double-reading environments to aid in equivocal screening examinations (8,9). In a sense, AI could have the potential to serve as a consensus interpretation when the radiologist is unsure about recalling a woman from screening. Even in double-reading settings, the avoidance of consensus meetings among radiologists would lead to considerable time savings.
Robust evidence will be needed to ensure the validity of AI software and algorithms compared with consensus meetings. From traditional computer-aided detection for mammography, we learned that early promise does not necessarily translate into real-world effectiveness (10). However, many avenues should be explored for AI to assist the radiologist in decreasing false-positive findings without diminishing screening detection. In one scenario for European programs, an AI algorithm interpretation could be combined with a double human-read interpretation. Consensus could be held only for those examinations where at least one radiologist and the AI algorithm gave the screening mammogram a positive interpretation rather than just one radiologist. The additional time saved could be used on more complex problem-solving tasks, such as diagnostic work-up for women with clinical symptoms or performance of preoperative advanced imaging. Several other scenarios likely exist in which AI has the potential to decrease the burden of consensus meetings.
Healy et al confirm that the more sets of eyes interpreting a mammogram, the better. Not only are two sets of eyes reviewing every examination, but several sets of eyes are reviewing and discussing discordant screening examination findings. In the United States, where there is still a single-reader paradigm, more sets of eyes would likely benefit patients and would prevent harms from false-positive screenings. Currently, this is in the form of ineffective traditional computer-aided detection software for some, residents and fellows for academic breast imagers, and potentially sideline consultations in group practice settings. However, studies like that of Healy et al should encourage us to question the single-reader paradigm. We need to think of new and innovative ways to get multiple sets of eyes looking at screening mammograms. In the future, these extra sets of eyes may not necessarily need to be human.
Footnotes
C.I.L. supported by the National Cancer Institute (grants P01CA154292 and R37 CA 240403).
Disclosures of Conflicts of Interest: S.H. disclosed no relevant relationships. C.I.L. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: disclosed grant from GE Healthcare paid to author’s institution and royalties paid to author from UpToDate, Oxford University Press, and McGraw-Hill, Inc. Other relationships: disclosed no relevant relationships.
References
- 1.Hofvind S, Bennett RL, Brisson J, et al. Audit feedback on reading performance of screening mammograms: An international comparison. J Med Screen 2016;23(3):150–159 10.1177/0969141315610790. [DOI] [PubMed] [Google Scholar]
- 2.IARC. Breast Cancer Screening . IARC Handbook in Cancer Prevention, Vol 15. Lyon, France: IARC, 2016. [Google Scholar]
- 3.Healy NA, O’Brian A, Knox M, et al. Consensus review of discordant imaging findings after the introduction of digital screening mammography: Irish National Breast Cancer Screening Program experience. Radiology 2020;295:35–41. [DOI] [PubMed] [Google Scholar]
- 4.Coolen AMP, Lameijer JRC, Voogd AC, et al. Characteristics of screen-detected cancers following concordant or discordant recalls at blinded double reading in biennial digital screening mammography. Eur Radiol 2019;29s(1):337–344. [DOI] [PubMed] [Google Scholar]
- 5.Hofvind S, Geller BM, Rosenberg RD, Skaane P. Screening-detected breast cancers: discordant independent double reading in a population-based screening program. Radiology 2009;253(3):652–660. [DOI] [PubMed] [Google Scholar]
- 6.Chong A, Weinstein SP, McDonald ES, Conant EF. Digital Breast Tomosynthesis: Concepts and Clinical Practice. Radiology 2019;292(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mullen LA, Panigrahi B, Hollada J, Panigrahi B, Falomo ET, Harvey SC. Strategies for Decreasing Screening Mammography Recall Rates While Maintaining Performance Metrics. Acad Radiol 2017;24(12):1556–1560. [DOI] [PubMed] [Google Scholar]
- 8.Geras KJ, Mann RM, Moy L. Artificial Intelligence for Mammography and Digital Breast Tomosynthesis: Current Concepts and Future Perspectives. Radiology 2019;293(2):246–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rodriguez-Ruiz A, Lång K, Gubern-Merida A, et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol 2019;29(9):4825–4832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee CI, Houssami N, Elmore JG, Buist DSM. Pathways to breast cancer screening artificial intelligence algorithm validation. Breast 2019 Sep 9 [Epub ahead of print] 10.1016/j.breast.2019.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
