Abstract
Background
The ability of deep learning (DL) models to classify women as at risk for either screening mammography–detected or interval cancer (not detected at mammography) has not yet been explored in the literature.
Purpose
To examine the ability of DL models to estimate the risk of interval and screening-detected breast cancers with and without clinical risk factors.
Materials and Methods
This study was performed on 25 096 digital screening mammograms obtained from January 2006 to December 2013. The mammograms were obtained in 6369 women without breast cancer, 1609 of whom developed screening-detected breast cancer and 351 of whom developed interval invasive breast cancer. A DL model was trained on the negative mammograms to classify women into those who did not develop cancer and those who developed screening-detected cancer or interval invasive cancer. Model effectiveness was evaluated as a matched concordance statistic (C statistic) in a held-out 26% (1669 of 6369) test set of the mammograms.
Results
The C statistics and odds ratios for comparing patients with screening-detected cancer versus matched controls were 0.66 (95% CI: 0.63, 0.69) and 1.25 (95% CI: 1.17, 1.33), respectively, for the DL model, 0.62 (95% CI: 0.59, 0.65) and 2.14 (95% CI: 1.32, 3.45) for the clinical risk factors with the Breast Imaging Reporting and Data System (BI-RADS) density model, and 0.66 (95% CI: 0.63, 0.69) and 1.21 (95% CI: 1.13, 1.30) for the combined DL and clinical risk factors model. For comparing patients with interval cancer versus controls, the C statistics and odds ratios were 0.64 (95% CI: 0.58, 0.71) and 1.26 (95% CI: 1.10, 1.45), respectively, for the DL model, 0.71 (95% CI: 0.65, 0.77) and 7.25 (95% CI: 2.94, 17.9) for the risk factors with BI-RADS density (b rated vs non-b rated) model, and 0.72 (95% CI: 0.66, 0.78) and 1.10 (95% CI: 0.94, 1.29) for the combined DL and clinical risk factors model. The P values between the DL, BI-RADS, and combined model’s ability to detect screen and interval cancer were .99, .002, and .03, respectively.
Conclusion
The deep learning model outperformed in determining screening-detected cancer risk but underperformed for interval cancer risk when compared with clinical risk factors including breast density.
© RSNA, 2021
Online supplemental material is available for this article.
See also the editorial by Bae and Kim in this issue.
Summary
Deep learning using prior negative mammograms and clinical risk factors accurately identified women who did not develop breast cancer from those who developed screening-detected or interval breast cancer.
Key Results
■ Among 25 096 digital screening mammograms obtained in 6369 women, a deep learning (DL) model determined the risk of interval cancers not detected with routine screening mammography (matched C statistic: 0.64).
■ Combining mammograms with clinical risk factors using DL achieved better performance in determining screening-detected breast cancer risks than using clinical risk factors alone (matched C statistics: 0.66 vs 0.62, respectively).
Introduction
Randomized controlled trials have shown that screening mammography reduces breast cancer mortality by reducing the incidence of advanced cancer (1,2). However, mammography has reduced sensitivity in the detection of breast cancers in breasts with radiologically dense and complex tissue (3,4). These cancers discovered within 12 months after normal screening mammograms are defined as interval cancers, and the reduction of mammographic sensitivity from breast density is commonly called masking. Roughly 13% of breast cancers diagnosed in the United States are interval cancers (3). Interval cancers usually have more aggressive tumor biology and are typically discovered at an advanced stage (3–6). It is therefore important to identify women who have a high risk for interval breast cancer and provide additional prevention strategies such as supplemental screenings (3).
Previous studies have shown breast density is both a risk factor for breast cancer and a masking factor of interval breast cancer. One study of more than 547 women found that the cancer detection rate with mammography was 80% in women with predominantly fatty breasts and 30% in women with extremely dense breasts (7). A larger study of more than 240 000 women found that combined relative risks of incident breast cancer showed positive correlation with the percentage breast density, reporting relative risks of 1.79 (95% CI: 1.48, 2.16), 2.11 (95% CI: 1.70, 2.63), 2.92 (95% CI: 2.49, 3.42), and 4.64 (95% CI: 3.64, 5.91) for categories of 5%–24%, 25%–49%, 50%–74%, and at least 75% relative to less than 5% (8). Because of this, the American College of Radiology has asked for the development of direct measures of masking and interval cancer risk (9). In addition, computer vision methods have been applied to mammography to identify breast cancer risk using deep learning (DL) with artificial neural network models (10–12). Previous studies using DL models achieved areas under the receiver operating characteristic curve (AUCs) ranging from 0.65 to 0.81 for the differentiation between women with and women without cancer (12–14). However, these studies did not consider how the cancer was detected (eg, regular mammographic screening or other means).
Our previous study using DL has shown promise in being able to learn mammographic features beyond density to distinguish between interval and screening-detected breast cancer risk (15), with an AUC of 0.65 versus 0.82 for density alone versus DL alone, respectively. However, the study was limited by the total number of cancers (n = 355) and the lack of mammograms from women who did not develop breast cancer.
DL models that differentiate negative mammograms among women who later developed interval invasive breast cancer, developed screening-detected breast cancer, or remained cancer-free are lacking in the literature. Thus, we wanted to test whether a DL model could predict screening-detected cancer risk and interval cancer risk from negative mammograms before cancer more accurately than a model using clinical risk factors and breast density alone. Our data consist of mammograms from multiple mammography centers that had the ability to determine interval cancer status through linkage with clinic or state cancer registries. We compared this DL model to and integrated it with standard risk factors to produce an optimal combined model for breast cancer risk.
Materials and Methods
The recruitment sites received institutional review board approval for either active or passive consenting processes or a waiver of consent. All procedures in our study were compliant with the Health Insurance Portability and Accountability Act. GE Medical Systems provided partial support for this study through an investigator-initiated study grant. The authors had control of the data and the information submitted for publication. No authors are employees or consultants. The collection of cancer data used in this study was supported in part by several state public health departments and cancer registries throughout the United States, primarily the Breast Cancer Surveillance Consortium (BCSC). For a full description of the BCSC, please see http://www.bcsc-research.org/.
Study Sample
This is a prospective study of breast cancer screening data. Mammograms and clinical breast cancer risk factors were acquired in women from 2006 to 2014 retrieved from two established case-control studies sampled from underlying breast screening cohorts: clinic 1 and clinic 2 (Fig 1) (16,17). Clinic 1 is a nonprofit academic medical center, and clinic 2 is a medical center within a public research university. Both clinic 1 and clinic 2 are in the United States. Inclusion criteria were as follows: (a) screening mammography was performed 6 months to 5 years before diagnosis, (b) mammograms were negative, (c) information about demographic characteristics and breast health history was available, (d) women had no personal history of breast cancer or implants, and (e) women were between the ages of 40 and 74 years. On average, cancer was diagnosed 3 years after the mammogram was obtained (standard deviation = 1.6 years, median = 2.8 years, quartile 1 = 1.6 years, and quartile 3 = 4.1 years). Women were excluded if only unilateral mammograms were available. Women were also excluded if mammograms failed the preprocessing steps. Of the 6369 women, 355 have been previously studied (15). The previous study was a pilot study evaluating the use of DL to distinguish between types of cases (screening detected vs interval) without including women without breast cancer for comparison.
Figure 1:
Flowchart shows structure of our data. Data came from two sources—clinic 1 and clinic 2. The clinics split the data into a training set (“train”), a testing set (validation set), and an external testing set (test set). During training, the “train” set is split into a training set and a validation set with a size ratio of 4:1. Images with technical irregularities (eg, those in which one or more views were missing or those that did not pass the preprocessing steps [Proc2Pres]) were removed before training. DICOM = Digital Imaging and Communications in Medicine.
Cancer outcomes and tumor characteristics were linked to the imaging data through either the local pathology databases at the clinical sites or through linkage to the state tumor registries. A positive mammogram is defined as having Breast Imaging Reporting and Data System (BI-RADS) assessment of 0, 3, 4, or 5, and a negative mammogram was defined as BI-RADS category 1 or 2. Interval cancer status was defined as invasive cancers that occurred within 12 months after a negative screening mammogram. Screening-detected cancer was defined as cancer that occurred within 12 months after a positive screening mammogram. After identification of the women with cancer, controls were selected from patients who did not develop cancer using the following matching criteria: age within 5 years, race, date of screening examination within 1 year, mammography machine make (both clinics); facility (clinic 1); and state of residence (clinic 2).
Imaging Examinations
Full-field digital mammograms were collected from the two clinics: 16 650 mammograms from 4217 women were obtained from clinic 1, and 8446 mammograms from 2152 women were obtained from clinic 2. The mammograms from clinic 1 were acquired from 2006 to 2013, and the mammograms from clinic 2 were acquired from 2006 to 2013. The combination of these two datasets resulted in 25 096 images from 6369 women (Table 1). The mammograms (full-field digital mammograms) were in the raw “For-Processing” Digital Imaging and Communications in Medicine, or DICOM, format and included the four common screening views: left craniocaudal, right craniocaudal, left mediolateral oblique, and right mediolateral oblique.
Table 1:
Characteristics of the Study Sample
Breast density was measured using a variety of methods to better understand how breast density and our DL models interact. The methods include clinical BI-RADS density categories a through d (18) and Volpara Volumetric Breast Density (version 1.5.4.0, Volpara Solutions), represented as a dense volume in liters, percentage dense breast volume, or an automated BI-RADS category (19).
Statistical Analysis
The DL approach is detailed in Appendix E1 (online). Risk factors and density measures were summarized on the entire set combined (training, validation, and test sets) and on the final independent test set. Dense volume and percentage volumetric density were analyzed as both continuous measures and using quartiles based on controls. Because these continuous measures had a skewed distribution, a natural logarithm transformation was applied before analysis. The matched C statistic (C statistic) was calculated only within matched case-control sets, according to the study design (20). In studies with nonmatched measures, the AUC and C statistic are used interchangeably and often refer to the same value. Analyses were stratified according to screening-detected and interval cancer. All models were adjusted for clinical risk factors including age (continuous), body mass index (continuous, inverse body mass index), first degree family history of breast cancer, history of biopsy, and race. For continuous density measures, the odds ratios are given per 1 standard deviation increase in the log transformed measure. Differences in strength of association between screening-detected and interval cancer were tested by pooling the groups and including an interaction term between each parameter of interest and type of cancer. To illustrate the final test set performance for each model (the risk factors–only model and the DL plus risk factor combined model), we used the AUC and calculated the associated 95% CI. SAS version 9.4 was used for analyses in the final test set. Conditional logistic and multivariable regression methods were used to test for association of density measures and the DL predictors with breast cancer status and results are summarized using odds ratios and the C statistic. P < .05 was used as a threshold for having a significant difference.
Results
Study Sample Characteristics
The characteristics of all 6369 women (mean age ± standard deviation, 60 years ± 12) are described in Table 1. The average percentage dense breast volume was 13.5% across all women with interval cancer, which was higher than that of the women with screening-detected cancer (7.3%) and matched controls (7%). The detailed characteristics of the cancer samples can be found in Table E1 (online).
Discrimination of Risk Prediction Models
The C statistic of models comparing women with screening-detected cancer to controls was 0.66 (95% CI: 0.63, 0.69) for the DL model, 0.62 (95% CI: 0.59, 0.65) for the clinical risk factors with BI-RADS density model, and 0.66 (95% CI: 0.63, 0.69) for the combined DL and clinical risk factors model. The C statistic of models comparing women with interval cancer versus matched controls was 0.64 (95% CI: 0.58, 0.71) for the DL model alone, 0.71 (95% CI: 0.65, 0.77) for the risk factors with BI-RADS density only model, and 0.72 (95% CI: 0.66, 0.78) for the combined DL and clinical risk factors model (Table 2). For comparison, the AUC was also calculated for each model and found to be similar or identical.
Table 2:
Univariable ORs and C Statistics for the DL Predictor, BMI, and Each of the Breast Density Measures on the Test Dataset
The odds ratios (per 10%) and C statistics using DL alone between screening-detected and interval cancers were similar (P = .99), with odds ratios of 1.25 and 1.26 and C statistics of 0.66 and 0.64, respectively. When screening-detected and interval cancers were combined, the DL predictor of overall breast cancer risk had an odds ratio and C statistic of 1.25 and 0.66, respectively (Table 2). Breast density alone, however, showed a reduced ability to help discriminate screening-detected cancers, with a C statistic of 0.61–62, no matter how measured. However, C statistics for breast density were higher for discrimination of interval cancer, ranging from 0.70 to 0.72 across density measures.
Multivariable Analysis
In the multivariable models for screening-detected cancers that include DL and breast density measures (dense volume in milliliters and percentage dense volume), the C statistic improved 5%–7% while the interval cancer models with both DL and breast density remained unchanged from the breast density C statistic values alone (Table 3). Adding body mass index to the DL predictors, however, improved the C statistic for both screening-detected and interval cancer without attenuating the DL risk estimates and/or odds ratios (Table 3).
Table 3:
Prediction on the Test Dataset with Our DL Models Combined with Either Breast Density or BMI after Adjustment for Clinical Risk Factors
The AUC and the associated curves for the DL predictor alone are shown in Figure 2, where each category is shown compared with the other two categories (ie, control vs noncontrol [screening-detected and interval cancers]).
Figure 2:
Receiver operating characteristic curves show performance of the deep learning predictor as a yes-no decision between any group and the rest of the two groups. AUC = area under the receiver operating characteristic curve.
Discussion
Our deep learning (DL) models performed better than models using clinical risk factors including Breast Imaging Reporting and Data System (BI-RADS) density in determining screening-detected cancer risk, where the DL model achieved a C statistic of 0.66 (95% CI: 0.63, 0.69) and the BI-RADS density model achieved a C statistic of 0.62 (95% CI: 0.59, 0.65). But we did not improve on the ability of breast density to enable prediction of interval breast cancer when compared with a model that used breast density alone, where the DL model achieved a C statistic of 0.64 (95% CI: 0.58, 0.71) and the BI-RADS density model achieved a C statistic of 0.71 (95% CI: 0.65, 0.77). This builds on our previous work (15) showing that DL can distinguish between mammograms of women who later developed either screening-detected or interval invasive breast cancer better than breast density alone. Through this work, we now understand that our previous model was making this distinction using additional information beyond breast density for screening-detected cancer risk. The specific characteristics of these features, however, remain to be determined.
In a previous study using this dataset, Kerlikowske et al (3) showed that breast density was associated with interval and screening-detected cancer risk using either clinical or automated (17) BI-RADS density. In our current study, our DL interval cancer predictor provided a similar magnitude of prediction as BI-RADS breast density alone, suggesting that breast density was the primary imaging feature used by the DL model. Others have shown that DL is an effective automated method to measure breast density (12,21,22), with comparable accuracy to that of human readers. Thus, our study adds to prior work showing that DL is an effective approach for quantifying breast density.
We have shown that DL can distinguish between mammograms of women who later developed breast cancer and those who do not better than use of breast density alone. This has implications for clinical practice as many management decisions today are guided by the use of breast density alone. A negative mammogram can be used to triage women by risk into unique and clinically relevant groups: low-risk breast cancer, elevated screening-detected risk, or elevated interval invasive cancer in the next 3 years (the average follow-up time for this study). For a woman with low cancer risk, her primary provider may opt for lower frequency mammographic screening. For a woman with high cancer risk, her primary provider may opt for an increased frequency of mammographic monitoring. Women in the high-risk DL group, who also have dense breasts and are at a higher risk for interval cancers, may benefit most from a monitoring strategy that includes supplemental imaging that retains sensitivity in dense breasts, including MRI (23), US (24), and molecular imaging (25).
Others have used DL for breast cancer risk. Yala et al (12–13) and Dembrower et al (14) achieved AUCs ranging from 0.68 to 0.81 for overall breast cancer risk on various datasets. They also used training and test dataset splits, as is current practice to safeguard against overfitting. This gives us confidence in the robustness of our findings. More mammograms could improve our two models, but it would likely be only a modest improvement. Kallenberg et al (26) used a convolutional sparse autoencoder to distinguish between full-field digital mammograms obtained in 394 patients with cancer and 1182 controls. They found an AUC of 0.61, which is similar to our results when our DL predictor is used alone for screening-detected cases. Li et al (27) evaluated a DL approach for distinguishing between 53 women who carry high-risk BCRA1/2 mutations, 75 women who developed unilateral breast cancer, and 328 women with low breast cancer risk. Their approach was similar to ours in that they started with transfer learning from ImageNet on their DL architecture and used full-field digital mammograms. However, they evaluated only a small retro-nipple patch (256 × 256 pixels) of the mammogram. They showed an AUC of 0.82 in the differentiation between low-risk and high-risk women. Interval cancer status was not known, so no direct comparison can be made regarding cancer type. Furthermore, the dataset was small, no validation or test datasets were used, and overfitting is a likely explanation for the high AUC. Our study with more than 5000 women had no evidence of an AUC this high in any of our subgroups.
The use of artificial intelligence algorithms to aid in the reading of screening mammograms has precedents. There are multiple artificial intelligence algorithms approved by the U.S. Food and Drug Administration for detecting breast cancer for both digital mammography and digital breast tomosynthesis (28,29). Three of these algorithms were compared in a recent external validation (30). The best of these algorithms was found to have a sensitivity of 82% in the detection of breast cancer, with a fixed specificity of 96.6%. The combination of this algorithm with a radiologist improved the sensitivity by 8%. With the risk algorithm presented in our current work, negative mammograms would be further interrogated by artificial intelligence to understand overall cancer risk and to make informed decisions on further screening frequency and technology.
The strengths of our current study include the total number of women available with breast density measures, cancer detection type, and clinical risk factors and linkage to cancer registries for complete capture of interval cancer status. Furthermore, we held out a sample where the DL investigators were blinded to the cancer status (case or control) with which to conduct our final evaluation, which increases the validity of our findings.
Our study has limitations. First, we lacked the For-Presentation Digital Imaging and Communications in Medicine images, the most common form of archived mammograms, in our dataset. However, the For-Presentation images are lossy representatives of the raw images We did convert the raw images to presentation images before training, however, to make our algorithm more generally applicable. Second, supplemental screening (US, MRI, molecular breast imaging) was used at clinic 2 for fewer than 25% of the women with high-density breasts; at clinic 1, however, supplemental screening was rarely if ever performed. This could lead to some breast cancers in women with a higher than average risk being labeled as interval cancers when they were not. No further examination of the mode of detection was performed since the potential number of impacted interval cancers was less than 10%.
We conclude that the use of deep learning to identify mammography features combined with traditional clinical risk factors increases our ability to predict risk of screening-detected breast cancers. However, our analysis did not show a significant improvement in the ability to predict interval invasive cancer, when compared with a model that used breast density alone.
Acknowledgments
Acknowledgments
Thanks to Kathleen Brandt, MD, in the Department of Radiology, Mayo Clinic, for her insightful comments on the clinical relevance of our results. We thank the participating women, mammography facilities, and radiologists for the data they provided for this study. A list of the Breast Cancer Surveillance Consortium investigators and procedures for requesting BCSC data for research purposes are provided at https://www.bcsc-research.org/.
Supported by the National Cancer Institute (grants P01CA154292, R01CA177150, and R01CA166269) and an investigator-initiated grant from GE Healthcare.
The design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute.
Disclosures of Conflicts of Interest: X.Z. disclosed no relevant relationships. T.K.W. disclosed no relevant relationships. L.L. disclosed no relevant relationships. M.J. disclosed no relevant relationships. C.S. disclosed no relevant relationships. S.W. disclosed no relevant relationships. P.S. disclosed no relevant relationships. C.V. institution has grants/grants pending from GRAIL; institution received travel/accommodations/meeting expenses unrelated to activities listed from GRAIL. K.K. disclosed no relevant relationships. J.A.S. disclosed no relevant relationships.
Abbreviations:
- AUC
- area under the receiver operating characteristic curve
- BI-RADS
- Breast Imaging Reporting and Data System
- DL
- deep learning
References
- 1. Nelson H , Cantor A , Humphrey L . US Preventive Services Task Force evidence syntheses, formerly systematic evidence reviews, screening for breast cancer: a systematic review to update the 2009 US Preventive Services Task Force recommendation . Rockville, Md: : Agency for Healthcare Research and Quality; , 2016. . [PubMed] [Google Scholar]
- 2. Autier P , Héry C , Haukka J , Boniol M , Byrnes G . Advanced breast cancer and breast cancer mortality in randomized controlled trials on mammography screening . J Clin Oncol 2009. ; 27 ( 35 ): 5919 – 5923 . [DOI] [PubMed] [Google Scholar]
- 3. Kerlikowske K , Zhu W , Tosteson AN , et al . Identifying women with dense breasts at high risk for interval cancer: a cohort study . Ann Intern Med 2015. ; 162 ( 10 ): 673 – 681 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Porter PL , El-Bastawissi AY , Mandelson MT , et al . Breast tumor characteristics as predictors of mammographic detection: comparison of interval- and screen-detected cancers . J Natl Cancer Inst 1999. ; 91 ( 23 ): 2020 – 2028 . [DOI] [PubMed] [Google Scholar]
- 5. Nederend J , Duijm LE , Louwman MW , et al . Impact of the transition from screen-film to digital screening mammography on interval cancer characteristics and treatment - a population based study from the Netherlands . Eur J Cancer 2014. ; 50 ( 1 ): 31 – 39 . [DOI] [PubMed] [Google Scholar]
- 6. Drukker CA , Schmidt MK , Rutgers EJ , et al . Mammographic screening detects low-risk tumor biology breast cancers . Breast Cancer Res Treat 2014. ; 144 ( 1 ): 103 – 111 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Mandelson MT , Oestreicher N , Porter PL , et al . Breast density as a predictor of mammographic detection: comparison of interval- and screen-detected cancers . J Natl Cancer Inst 2000. ; 92 ( 13 ): 1081 – 1087 . [DOI] [PubMed] [Google Scholar]
- 8. McCormack VA , dos Santos Silva I . Breast density and parenchymal patterns as markers of breast cancer risk: a meta-analysis . Cancer Epidemiol Biomarkers Prev 2006. ; 15 ( 6 ): 1159 – 1169 . [DOI] [PubMed] [Google Scholar]
- 9. Sickles EA , D’Orsi CJ , Bassett LW , Appleton CM , Berg WA , Burnside ES . ACR BI-RADS Atlas . Breast imaging reporting and data system . Reston, Va: : American College of Radiology; , 2013. ; 39 – 48 . [Google Scholar]
- 10. Lotter W , Diab AR , Haslam B , et al . Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach . arXiv preprint arXiv:1912.11027. https://arxiv.org/abs/1912.11027. Published 2019. Accessed February 3, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Akselrod-Ballin A , Chorev M , Shoshan Y , et al . Predicting breast cancer by applying deep learning to linked health records and mammograms . Radiology 2019. ; 292 ( 2 ): 331 – 342 . [DOI] [PubMed] [Google Scholar]
- 12. Yala A , Lehman C , Schuster T , Portnoi T , Barzilay R . A deep learning mammography-based model for improved breast cancer risk prediction . Radiology 2019. ; 292 ( 1 ): 60 – 66 . [DOI] [PubMed] [Google Scholar]
- 13. Yala A , Mikhael PG , Strand F , et al . Toward robust mammography-based models for breast cancer risk . Sci Transl Med 2021. ; 13 ( 578 ): eaba4373 . [DOI] [PubMed] [Google Scholar]
- 14. Dembrower K , Liu Y , Azizpour H , et al . Comparison of a deep learning risk score and standard mammographic density score for breast cancer risk prediction . Radiology 2020. ; 294 ( 2 ): 265 – 272 . [DOI] [PubMed] [Google Scholar]
- 15. Hinton B , Ma L , Mahmoudzadeh AP , et al . Deep learning networks find unique mammographic differences in previous negative mammograms between interval and screen-detected cancers: a case-case study . Cancer Imaging 2019. ; 19 ( 1 ): 41 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Brandt KR , Scott CG , Ma L , et al . Comparison of clinical and automated breast density measurements: implications for risk prediction and supplemental screening . Radiology 2016. ; 279 ( 3 ): 710 – 719 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kerlikowske K , Scott CG , Mahmoudzadeh AP , et al . Automated and clinical breast imaging reporting and data system density measures predict risk for screen-detected and interval cancers: a case–control study . Ann Intern Med 2018. ; 168 ( 11 ): 757 – 765 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.D’Orsi C, Bassett L, Feig S. Breast Imaging. In: Lee CI, Lehman CD, Bassett LW, eds.Breast imaging reporting and data system (BI-RADS). New York, NY:Oxford University Press,2018. [Google Scholar]
- 19.Highnam R, Brady M, Yaffe MJ, Karssemeijer N, Harvey J. Robust breast composition measurement-Volpara TM. International workshop on digital mammography.Berlin Heidelberg:Springer,2010;342–349. [Google Scholar]
- 20. Brentnall AR , Cuzick J , Field J , Duffy SW . A concordance index for matched case-control studies with applications in cancer risk . Stat Med 2015. ; 34 ( 3 ): 396 – 405 . [DOI] [PubMed] [Google Scholar]
- 21. Wu N , Geras KJ , Shen Y , et al . Breast density classification with deep convolutional neural networks . 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Calgary, AB, Canada , April 15–20, 2018 . Piscataway, NJ: : IEEE; , 2018. ; 6682 – 6686 . [Google Scholar]
- 22. Lehman CD , Yala A , Schuster T , et al . Mammographic breast density assessment using deep learning: clinical implementation . Radiology 2019. ; 290 ( 1 ): 52 – 58 . [DOI] [PubMed] [Google Scholar]
- 23. Biglia N , Bounous VE , Martincich L , et al . Role of MRI (magnetic resonance imaging) versus conventional imaging for breast cancer presurgical staging in young women or with dense breast . Eur J Surg Oncol 2011. ; 37 ( 3 ): 199 – 204 . [DOI] [PubMed] [Google Scholar]
- 24. Rebolj M , Assi V , Brentnall A , Parmar D , Duffy SW . Addition of ultrasound to mammography in the case of dense breast tissue: systematic review and meta-analysis . Br J Cancer 2018. ; 118 ( 12 ): 1559 – 1570 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Shermis RB , Wilson KD , Doyle MT , et al . Supplemental breast cancer screening with molecular breast imaging for women with dense breast tissue . AJR Am J Roentgenol 2016. ; 207 ( 2 ): 450 – 457 . [DOI] [PubMed] [Google Scholar]
- 26. Kallenberg M , Petersen K , Nielsen M , et al . Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring . IEEE Trans Med Imaging 2016. ; 35 ( 5 ): 1322 – 1331 . [DOI] [PubMed] [Google Scholar]
- 27. Li H , Giger ML , Huynh BQ , Antropova NO . Deep learning in breast cancer risk assessment: evaluation of convolutional neural networks on a clinical dataset of full-field digital mammograms . J Med Imaging (Bellingham) 2017. ; 4 ( 4 ): 041304 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Sechopoulos I , Teuwen J , Mann R . Artificial intelligence for breast cancer detection in mammography and digital breast tomosynthesis: State of the art . Semin Cancer Biol 2021. ; 72 ( 214 ): 225 . [DOI] [PubMed] [Google Scholar]
- 29. Conant EF , Toledano AY , Periaswamy S , et al . Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis . Radiol Artif Intell 2019. ; 1 ( 4 ): e180096 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Salim M , Wåhlin E , Dembrower K , et al . External Evaluation of 3 Commercial Artificial Intelligence Algorithms for Independent Assessment of Screening Mammograms . JAMA Oncol 2020. ; 6 ( 10 ): 1581 – 1588 . [DOI] [PMC free article] [PubMed] [Google Scholar]


![Flowchart shows structure of our data. Data came from two sources—clinic 1 and clinic 2. The clinics split the data into a training set (“train”), a testing set (validation set), and an external testing set (test set). During training, the “train” set is split into a training set and a validation set with a size ratio of 4:1. Images with technical irregularities (eg, those in which one or more views were missing or those that did not pass the preprocessing steps [Proc2Pres]) were removed before training. DICOM = Digital Imaging and Communications in Medicine.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/df1a/8630596/0b81494c97ae/radiol.2021203758fig1.jpg)



