Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2023 Oct 25;5(6):e230304. doi: 10.1148/ryai.230304

Can AI Reduce the Harms of Screening Mammography?

Elizabeth S McDonald 1,, Emily F Conant 1
PMCID: PMC10698594  PMID: 38074781

See also the article by Ezeana et al in this issue.

Elizabeth S. McDonald, MD, PhD, is an associate professor of radiology and co-director of the Penn Breast Cancer Translational Research Group, a multidisciplinary team committed to advancing breast cancer care while training the next generation of translational researchers. Dr McDonald is the Director of Breast MRI and has co-authored more than 75 manuscripts and book chapters, nearly all involving breast imaging and tumor characterization. She has been a fellow of the Society of Breast Imaging since 2017 and was inducted into the “Council of Distinguished Investigators” by the Academy for Radiology & Biomedical Imaging Research in 2022.

Elizabeth S. McDonald, MD, PhD, is an associate professor of radiology and co-director of the Penn Breast Cancer Translational Research Group, a multidisciplinary team committed to advancing breast cancer care while training the next generation of translational researchers. Dr McDonald is the Director of Breast MRI and has co-authored more than 75 manuscripts and book chapters, nearly all involving breast imaging and tumor characterization. She has been a fellow of the Society of Breast Imaging since 2017 and was inducted into the “Council of Distinguished Investigators” by the Academy for Radiology & Biomedical Imaging Research in 2022.

Emily F. Conant, MD, is a professor of radiology and the vice chair of faculty development in the department of radiology at the University of Pennsylvania. She is an internationally renowned clinician and researcher and past president of the Society of Breast Imaging. She has published more than 220 peer-reviewed articles, including in high-impact journals like JAMA and NEJM, on topics including imaging techniques such as digital breast tomosynthesis and quantitative analysis of multimodality imaging to guide tailored, personalized screening beyond mammography.

Emily F. Conant, MD, is a professor of radiology and the vice chair of faculty development in the department of radiology at the University of Pennsylvania. She is an internationally renowned clinician and researcher and past president of the Society of Breast Imaging. She has published more than 220 peer-reviewed articles, including in high-impact journals like JAMA and NEJM, on topics including imaging techniques such as digital breast tomosynthesis and quantitative analysis of multimodality imaging to guide tailored, personalized screening beyond mammography.

The U.S. Preventive Services Task Force recently released draft recommendations on breast cancer screening. While reducing the age to begin mammographic screening from 50 years to 40 years was a welcome change, the draft report recommends only biennial screening mammography for women aged 40 to 74 years. One reason for recommending biennial instead of annual screening is the cost, emotional impact, and morbidity associated with false-positive findings. Reducing false-positive recall and biopsy recommendations would have substantial clinical impact, and artificial intelligence (AI) can augment radiologist screening performance. According to the Data Science Institute of the American College of Radiology, the U.S. Food and Drug Administration (FDA) has approved 14 AI medical products for breast cancer detection and/or lesion characterization, an increase from five similar products in December 2021. Still, implementation of these tools into the clinical workflow has been slow, and false-positive results continue to impact up to 70% of women recommended for biopsy (1).

In this issue of Radiology: Artificial Intelligence, Ezeana and colleagues (2) tested a biopsy decision support algorithmic model called the intelligent-augmented breast cancer risk calculator (iBRISK) to calculate the risk of malignancy for mammographic Breast Imaging Reporting and Data System (BI-RADS) category 4 findings in both asymptomatic and symptomatic patients. When a patient's examination is assigned BI-RADS category 4, a tissue diagnosis is recommended for further evaluation of the mammographic, sonographic, or MR-detected finding. The iBRISK model was tested on Houston Methodist Hospital's system-wide data warehouse (the same data source for model development and improvement). The investigators curated data consecutively from the electronic medical records from MD Anderson Cancer Center and The University of Texas Health Science Center San Antonio. The dataset was retrospective and cancer-enriched; only BI-RADS 4 studies were used for training and testing. The study focused on abnormalities that were visible at mammography. The iBRISK model, comprising clinical risk factors and BI-RADS mammographic descriptors, achieved an accuracy of 89.5%, area under the receiver operating characteristic curve (AUC) of 0.93 (95% CI: 0.92, 0.95), sensitivity of 100%, and specificity of 81%. The findings were triaged into low, moderate, and high probability of malignancy (POM). The low POM category had a malignancy rate of 0.16%. The authors estimated that avoiding biopsies in the low and moderate POM categories could reduce biopsies yielding benign results in up to 50% of patients.

Much of AI's emphasis in breast imaging has been on predicting future malignancy from mammograms. Currently, the majority of AI tools authorized by the FDA for use in breast cancer screening assist with improving the accuracy of cancer detection. This standard facilitates testing on large retrospective datasets. Additionally, improved accuracy leads to greater overall specificity and fewer false-positive findings, meaning that even general detection tools can potentially assist with reducing unnecessary intervention. Two excellent meta-analyses evaluating the combined performance of different AI models on large datasets have recently been published (3,4).

Several models that assist specifically in biopsy triage have been developed (5). For example, Cui and colleagues (6) focused on mammographically visible breast masses in a prospective study involving a 1:1 malignant-to-benign ratio, with the proposed model achieving an AUC of 0.96. Meng and colleagues (7) explored different strategies to reduce unnecessary biopsies through a deep transfer learning method. Multiple groups have also looked at the classification of calcifications to avoid unnecessary intervention, with all models demonstrating similar accuracy (810). He et al (11) previously reported on the development of this tool (initially called BRISK instead of iBRISK) to target patients with BI-RADS category 4 lesions. They hypothesized that targeting this subgroup might lead to greater accuracy than would a generalized predictor of risk. The original validation set was 1247 patients at Houston Methodist Hospital between January 2006 and May 2015 with a BI-RADS 4 categorization and available mammograms and medical reports. Patients had to have undergone a biopsy within 3 months of their abnormal mammogram to provide a reference standard. Model input also included mammographic and sonographic images, clinical risk factors, and demographics. Sensitivity of the BRISK model to predict malignancy was 100%, specificity was 74%, accuracy was 81%, and the overall AUC was 0.93. US features were subsequently excluded and not used in the updated model for the current study.

There were several limitations to the current study by Ezeana and colleagues. For the cost analysis, one must consider the downstream impact of not recommending a biopsy for the low and moderate POM cases. What will be the recommendation given to these patients? For example, a mass recommended for biopsy is generally solid or complex cystic, or it is unknown whether the mass is solid or cystic. Would this mass be assessed as benign if biopsy is not performed, or would it be placed into short-term follow-up? If the patient returns for short-interval follow-up, then the cost of the follow-up examinations should be balanced with the cost of avoiding biopsy. Additionally, short-interval follow-up can be a source of anxiety in patients, so prospective evaluation of the emotional impact of AI-based triage versus immediate intervention should also be considered.

Second, the iBRISK model was developed and tested using two-dimensional digital mammography; thus, its relevance to digital breast tomosynthesis (DBT) or synthetic reconstructed two-dimensional images is unknown. The authors state that using DBT for diagnostic evaluation does not affect the positive predictive value of biopsies performed (ie, PPV3). While this was true in some multicenter tomosynthesis studies (12), it would still be interesting to see if the model's accuracy extends to three-dimensional images, as around 87% of facilities have adopted at least one DBT unit (https://www.fda.gov/radiation-emitting-products/mqsa-insights/mqsa-national-statistics).

Third, the amount of required retrospective data needed for model accuracy resulted in excluding 33%–46% of potential study patients. One of the data points needed for the algorithm was calcification descriptors (with 68% of the included findings having calcifications). Understanding the impact on model performance of not having calcification descriptors will be important. Also, the inclusion of a disproportionate number of women with prior breast cancer (43% in the training set and 46% in the validation set) could lead to a disproportionate number of benign BI-RADS 4 lesions containing calcifications from postsurgical changes, so the accuracy of the algorithm could be impacted by including a relatively high percentage of women in surveillance. Therefore, it will be important to test the algorithm in a population without prior surgical intervention.

Additionally, there was considerable variability of results across sites on low versus moderate and high POMs (fig S2), which raises the possibility of variability of intra- and interreader BI-RADS descriptors across practice types and reader experience. “Interpretation variability” could impact algorithm accuracy, as the iBRISK model tested in this multicenter study is based on data from medical records and descriptors used in mammographic reports, not image-derived variables. It would be interesting to see the impact of image-derived data on model output. For example, absolute volume of breast density is better determined in an automated way, and calcification morphology and distribution could be classified based on image-derived features. While the BRISK model can use features extracted directly from images, when a radiology report is not available, this multicenter study relied on a web-based interface that required input of BI-RADS descriptors. One benefit of this study design is that “automation bias” (13) is avoided because the radiologist predetermines the presence of a finding and creates the mammographic descriptors without model input.

Although the model exhibited similar performance across all age groups and races, the report does describe the impact of breast density on the model's performance. Prior mammograms were not included and could also enhance accuracy. Finally, it will be important to test model performance in a nontertiary center, as a second opinion by a dedicated breast imaging radiologist may reduce unnecessary biopsies by up to 60% (14); thus, the impact of using the iBRISK model could potentially be greater in this setting.

In conclusion, it is exciting to imagine a future where AI is used successfully to avoid unnecessary biopsies in thousands of women and men each year. Still, the increasing diagnostic accuracy of AI-based decision support must be complemented by demonstrating value when these tools are integrated into the standard clinical workflow by improving outcomes with a favorable cost-effectiveness ratio. One necessary next step is delineating the path for women and men who forgo intervention based on AI. If they are then placed into short-interval follow-up (BI-RADS category 3), this may not result in cost savings over biopsy and could lead to morbidity from anxiety and a prolonged state of uncertainty. Using AI to avoid biopsy would have the greatest impact in a new diagnostic paradigm where there is evidence from prospective trials supporting return to yearly routine screening for these patients, as current practice is generally to perform a 6-month follow-up when a recommended biopsy is not performed. Although the lofty goal of successful AI integration might seem elusive, the potential benefit to our patients is worth our collective efforts to achieve optimal screening and diagnostic performance.

Footnotes

Supported by the Pennsylvania Breast Cancer Coalition.

Disclosures of conflicts of interest: E.S.M. Member of Clinical Breast Cancer editorial board; received research funding from the National Cancer Institute, Congressionally Directed Medical Research Program, American Roentgen Ray Society, METAvisor, and the Pennsylvania Breast Cancer Coalition; has never received any industry funding. E.F.C. Grants or contracts from the National Institutes of Health, the National Cancer Institute, the American College of Surgeons (ACS), Hologic, iCAD, and OM1; consulting fees from iCAD and Hologic; payment or honoraria for lectures, presentations, speakers bureaus, manuscript writing, or educational events from Medscape, Medality, Aunt Minnie, and iiCME; support for attending meetings and/or travel from the Radiological Society of North America and the Society of Breast Imaging (SBI); participation on a Data Safety Monitoring Board or Advisory Board for Epic, iCAD, Hologic, SBI, the American Joint Committee on Cancer, the National Comprehensive Cancer Network, ACS, and BreastCancer.org.

References

  • 1. Sprague BL , Arao RF , Miglioretti DL , et al. ; Breast Cancer Surveillance Consortium . National Performance Benchmarks for Modern Diagnostic Digital Mammography: Update from the Breast Cancer Surveillance Consortium . Radiology 2017. ; 283 ( 1 ): 59 – 69 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Ezeana CF , He T , Patel TA , et al . A deep learning decision support tool to improve risk stratification and reduce unnecessary biopsies in BI-RADS 4 mammograms . Radiol Artif Intell 2023. ; 5 ( 6 ): e220259 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Aggarwal R , Sounderajah V , Martin G , et al . Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis . NPJ Digit Med 2021. ; 4 ( 1 ): 65 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Yoon JH , Strand F , Baltzer PAT , et al . Standalone AI for breast cancer detection at screening digital mammography and digital breast tomosynthesis: a systematic review and meta-analysis . Radiology 2023. ; 307 ( 5 ): e222639 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Balkenende L , Teuwen J , Mann RM . Application of deep learning in breast cancer imaging . Semin Nucl Med 2022. ; 52 ( 5 ): 584 – 596 . [DOI] [PubMed] [Google Scholar]
  • 6. Cui Y , Li Y , Xing D , Bai T , Dong J , Zhu J . Improving the prediction of benign or malignant breast masses using a combination of image biomarkers and clinical parameters . Front Oncol 2021. ; 11 : 629321 . [Published correction appears in Front Oncol 2021;11:694094.] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Meng M , Li H , Zhang M , He G , Wang L , Shen D . Reducing the number of unnecessary biopsies for mammographic BI-RADS 4 lesions through a deep transfer learning method . BMC Med Imaging 2023. ; 23 ( 1 ): 82 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Gerbasi A , Clementi G , Corsi F , et al . DeepMiCa: Automatic segmentation and classification of breast MIcroCAlcifications from mammograms . Comput Methods Programs Biomed 2023. ; 235 : 107483 . [DOI] [PubMed] [Google Scholar]
  • 9. Leong YS , Hasikin K , Lai KW , Mohd Zain N , Azizan MM . Microcalcification discrimination in mammography using deep convolutional neural network: towards rapid and early breast cancer diagnosis . Front Public Health 2022. ; 10 : 875305 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Cai H , Huang Q , Rong W , et al . Breast microcalcification diagnosis using deep convolutional neural network from digital mammograms . Comput Math Methods Med 2019. ; 2019 : 2717454 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. He T , Puppala M , Ezeana CF , et al . A deep learning-based decision support tool for precision risk assessment of breast cancer . JCO Clin Cancer Inform 2019. ; 3 ( 3 ): 1 – 12 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Conant EF , Barlow WE , Herschorn SD , et al. ; Population-based Research Optimizing Screening Through Personalized Regimen (PROSPR) Consortium . Association of digital breast tomosynthesis vs digital mammography with cancer detection and recall rates by age and breast density . JAMA Oncol 2019. ; 5 ( 5 ): 635 – 642 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Dratsch T , Chen X , Rezazade Mehrizi M , et al . Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance . Radiology 2023. ; 307 ( 4 ): e222176 . [DOI] [PubMed] [Google Scholar]
  • 14. Pistolese CA , Lamacchia F , Tosti D , et al . Reducing the number of unnecessary percutaneous biopsies: the role of second opinion by expert breast center radiologists . Anticancer Res 2020. ; 40 ( 2 ): 939 – 950 . [DOI] [PubMed] [Google Scholar]

Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES