Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2025 Apr 8;4(4):e0000811. doi: 10.1371/journal.pdig.0000811

Subgroup evaluation to understand performance gaps in deep learning-based classification of regions of interest on mammography

MinJae Woo 1,*,#, Linglin Zhang 2,‡,#, Beatrice Brown-Mulry 3, InChan Hwang 4, Judy Wawira Gichoya 5, Aimilia Gastounioti 6, Imon Banerjee 7,8, Laleh Seyyed-Kalantari 9, Hari Trivedi 5
Editor: Daniel A Hashimoto,10
PMCID: PMC11978028  PMID: 40198652

Abstract

This study evaluates a deep learning model for classifying normal versus potentially abnormal regions of interest (ROIs) on mammography, aiming to identify imaging, pathologic, and demographic characteristics that may induce suboptimal model performance in certain patient subgroups. We utilized the EMory BrEast imaging Dataset (EMBED), containing 3.4 million mammographic images from 115,931 patients. Full-field digital mammograms from women aged 18 years or older were used to create positive and negative patches with the patches matched based on size, location, patient demographics, and imaging features. Several convolutional neural network (CNN) architectures were tested, with ResNet152V2 demonstrating the best performance. The dataset was split into training (29,144 patches), validation (9,910 patches), and testing (13,390 patches) sets. Performance metrics included accuracy, AUC, recall, precision, F1 score, false negative rate, and false positive rate. Subgroup analysis was conducted using univariate and multivariate regression models to control for confounding effects. The classification model achieved an AUC of 0.975 and a recall of 0.927. False negative predictions were significantly associated with White patients (RR = 1.208; p = 0.050), those never biopsied (RR = 1.079; p = 0.011), and cases with architectural distortion (RR = 1.037; p < 0.001). Higher breast density significantly increased the risk of false positives, with BI-RADS density C (RR = 1.891; p < 0.001) and D (RR = 2.486; p < 0.001). Race and age were not significant predictors for false positives in multivariate analysis. These findings suggest that deep learning models for mammography may underperform in specific subgroups. The study underscores the need for more precise patient subgroup analysis and emphasizes the importance of considering confounding factors in deep learning model evaluations. These insights can help develop fair and interpretable decision-making models in mammography, ultimately enhancing the performance and equity of CADe and CADx applications.

Author summary

Our study reveals that AI models used in screening mammogram do not fail randomly, resulting in certain patient populations more prone to experiencing these failures. Specifically, deep learning models may underperform in specific patient subgroups such as those with higher breast density or certain demographic characteristics. To effectively address these failures, it is crucial to consider how patient characteristics (e.g., breast density, age, and race) are interrelated with one another. For example, while many researchers traditionally focus on augmenting datasets with images from minority racial groups to reduce model failures against them, our results indicate that increasing the representation of images from patients with higher breast density would be a more targeted and impactful approach after accounting for interaction between density and race. Ignoring the interconnected nature of these patient characteristics can lead to misguided interventions, possibly interfering with the efforts toward improving AI model effectiveness and fairness in clinical practice.

Introduction

Breast cancer is the most common cancer among women, leading to approximately 42,000 deaths annually in the United States [1]. Early detection via screening mammography has been shown to reduce morbidity and mortality of breast cancers by 38–48% [24]. However, there is significant resource usage for detecting a relatively small number of cancers; for every 1,000 women screened, approximately 100 are recalled, and 30–40 are biopsied for the detection of up to 1–5 cancers [5]. This is largely due to the similar imaging appearance of many benign and malignant lesions which necessitates repeat imaging and/or biopsy for definitive diagnosis. The challenge of distinguishing normal from suspected abnormal findings on screening mammography can lead to unnecessary recalls and biopsies. Improving the differentiation of these findings has the potential to reduce resource wastage and decrease patient morbidity [3,6,7].

Deep learning (DL) models are increasingly utilized to enhance the differentiation between normal and suspected abnormal regions of interest (ROIs) in mammograms [818]. Such models have strong potential for being directly implemented into computer-aided detection (CADe) and computer-aided diagnosis (CADx) [1924]. Moreover, models that classify normal or suspected abnormal regions of interest often serve as the backbone of broader AI applications in cancer screening and diagnosis, such as tumor object detection and automated radiology report generation [9,10,15,25,26]. However, these models are imperfect, generally achieving sensitivity and specificity between 85–92%, and it has been well-documented that their failure does not occur at random [2730]. Because breast cancer can have a heterogenous appearance [31], we hypothesize that various image features or other image confounders result in asymmetric model performance for subgroups of lesion appearance, pathology, or patient demographics, and that these confounders may be interrelated. The collective influence of multiple imaging features, breast densities, and patient demographics on the model failure remains largely unexamined due to the lack of available granular datasets in the field. This creates potential unrecognized gaps in DL model performance and failure, which may disproportionately affect specific patient subgroups and undermine the overall effectiveness as well as equity of diagnostic tools in the era of artificial intelligence.

In this paper, we describe a deep learning model for classifying normal versus potentially abnormal regions of interest on mammography, with purpose to present a robust, multivariate post-hoc analysis to identify imaging, pathologic, and demographic characteristics that degrade model performance. The presented analysis aims to determine which patient subgroups truly suffer from disparate impacts due to model failures. To this end, we leveraged the largest mammography dataset in the field, which is racially balanced and of high granularity and quality.

Materials and methods

Data preparation

We used the EMory BrEast imaging Dataset (EMBED), which contains clinical and imaging data for 3.4 million mammographic images from 383,379 screening and diagnostic exams of 115,931 patients, acquired between 2013 and 2020 at four different hospitals within a single academic institution [32]. With the approval of Emory University’s institutional review board, this dataset was curated retrospectively using automated and semi-automated curation techniques, and the need for written informed consent from patients was waived due to the use of de-identified data. The dataset provides comprehensive detailed demographics, imaging features, pathologic outcomes, and Breast Imaging Reporting and Data System (BI-RADS) tissue densities and scores for both screening and diagnostic exams. Additionally, it provides regions of interest (ROIs) annotated by radiologists during initial screenings, along with more definitive diagnostic information from follow-up diagnostic or pathologic studies.

An overview of the data preparation process is shown in Fig 1. This study exclusively utilized full-field digital mammograms (FFDM) from women aged 18 years or older with at least one available mammogram in the picture archiving and communication system. Positive patches were defined as any radiologist-annotated region of interest (ROI) from a Breast Imaging Reporting and Data System (BI-RADS) 0 image, indicating a potential abnormality requiring follow-up. Negative patches were generated from two sources: (1) regions within the BI-RADS 0 images that did not overlap with the annotated ROIs; and (2) regions from negative screening images classified as BI-RADS 1 and 2, as illustrated in Fig 2. We employed a stratified down-sampling approach to minimize confounding factors in patch selection. Given that the pool of patients available for negative patch extraction is much larger than that for positive patches, we carefully subsampled negative patches to ensure that the patient distribution used for their extraction closely matched that of positive patches. Specifically, this process was to control for key confounders, including patient demographics (e.g., race), imaging features, and breast density, preventing the model from exploiting unintended differences in patient composition rather than learning clinically relevant imaging patterns. The subsequent matching process first identified pairs of positive and negative FFDMs with the highest similarity in terms of pixel distributions, then used the shape and location of the ROI from the positive images as references, and finally extracted equivalent patches from the negative images. The extracted patch sizes varied from 53 × 76 – 2379 × 2940 pixels, with a median size of 360 × 437 pixels. Patches larger than 512 × 512 pixels were downsampled to fit within 512 × 512 pixels while preserving the aspect ratio. All patches were then zero-padded to a consistent input size of 512 × 512 pixels for compatibility with the convolutional neural network (CNN) architectures. The final dataset was randomly split at the patient level to prevent data leakage, comprising a total of 10,678 patients and 29,144 patches (55.6%) for training, 3,609 patients and 9,910 patches (18.9%) for validation, and 5,404 patients and 13,390 patches (25.5%) for testing.

Fig 1. Overview of data preparation for DL patch classification model training.

Fig 1

Positive patches were radiologist-annotated ROIs from BI-RADS 0 images, while negative patches were acquired from non-overlapping regions in BI-RADS 0 images and BIRADS 1 and 2 images. Negative patches were matched to positive ones by size, location, and patient demographics. DL, deep learning; BI-RADS, breast imaging reporting and data system.

Fig 2. Example of positive and negative patch generation.

Fig 2

The patch in the BI-RADS 0 image (left) shows a suspected abnormality originally annotated by the radiologist. The patches in the matched BI-RADS 1/2 image (right) represent an equivalent patch which serves as a negative counterpart. The matching process utilized row-wise and column-wise distribution along with overall pixel distribution to maximize similarity in the overall intensity distributions between negative and positive patches. All patches were cropped, centered, and padded with 0 pixels to be 512 × 512 pixels. BI-RADS, breast imaging reporting and data system.

Patch classifier model training

Guided by relevant studies, we tested four common CNN architectures and their corresponding pretrained weights as feature extractors in the binary classification of normal and potentially abnormal tissue patches: InceptionV3 [33], VGG16 [34], ResNet50V2 [35], and ResNet152V2 [36], using identical training, validation, and test sets. Preliminary results demonstrated the best performance demonstrated by ResNet152V2 (S1 Table), so this architecture was utilized throughout the study. The final architecture was equipped with input layers adjusted to accept 512 × 512-pixel resolution PNG format images, followed by the ResNet152V2 feature extractor pretrained on the ImageNet dataset [37], and additional layers including a trainable batch normalization layer; the following deep neural network constituents included a dense layer with ReLU activation, a dropout layer, three dense layers with ReLU, and final dense layer activated by Sigmoid function. The systematic optimization of hyperparameters and learning rate was performed using Bayesian optimization [38], resulting in optimal model validation performance at a learning rate of 0.0059. The hardware configuration used for the experiment was (1) 8 x NVIDIA A100 GPUs providing a combined total of 320 GB of GPU RAM, and (2) 3 x NVIDIA RTX3090 GPUs providing a combined total of 72 GB of GPU RAM, utilized as appropriate. All data preprocessing and deep learning model development were conducted in a Python environment using TensorFlow. Statistical analyses were performed using the statsmodels package in Python and cross-validated using StataIC 15.1 (StataCorp, College Station, TX, USA).

Statistical analysis

We evaluated overall patch classifier performance by accuracy, area under the receiver operating characteristics curve (AUC), recall, precision, F1 score, false negative rate, and false positive rate. To obtain 95% confidence intervals, we performed bootstrapping with two hundred iterations of the test set, guided by relevant studies [3941].

The experiment consisted of two steps: (1) the typical process of developing and testing a deep learning model for the binary classification of normal and potentially abnormal tissue patches, and (2) analysis to identify patient subgroups more likely to be misclassified by the model, Fig 3. For subgroup evaluation, we calculated AUC, recall, and precision by race – White, Black, or Other; age groups – < 50, 50–60, 60–70, and >70 years; BI-RADS tissue density – A, B, C, or D; pathology – never biopsied, cancer, or benign; and imaging features – mass, asymmetry, architectural distortion, or calcification. However, because multiple mammographic features are known to be interrelated (for example, breast density distribution varies by race and decreases with age), standard statistical evaluation of subgroups using a student’s t-test would yield spurious results as this test assumes independence of all variables. Instead, we used univariate and multivariate logistic regression models to evaluate the impact of various demographic, imaging, and pathologic subgroups on model performance. This allowed us to control for potential confounding effects between features and could help to explain the underlying contribution of each feature to model failure. False negative predictions were investigated on the potentially abnormal patches using race, age, tissue density, pathological outcome, and image findings. False positive predictions were evaluated on normal patches only using features that would apply to the whole mammogram - race, age group, and tissue density. The odds ratio (OR) for false positive and false negative predictions was calculated between each feature and a selected control group model, and the subsequent risk ratio (RR) was calculated by OR and the non-exposed (correct prediction result; positives within all positive patches and true negative within all negative patches) prevalence of the current feature.

Fig 3. Experimental overview with two main components.

Fig 3

Predictive Analytics involved training a deep learning model to differentiate between normal and potentially abnormal patches on mammography. Descriptive Analytics evaluated model performance by subgroups based on clinical, imaging, and demographic features. Descriptive Model 1 analyzed false positives, and Descriptive Model 2 analyzed false negatives. DL, deep learning; ROI, region of interest.

Results

Patient characteristics

Mean patient age was 59.02 ± 11.87 (standard deviation) years. The study included 21,273 (40.6%) White patients, 22,321 (42.5%) Black patients, and 8,850 (16.9%) patients from other racial backgrounds. 16,387 (31.2%) patients were under 50 years old, 14,775 (28.2%) were between 50–60 years old, 12,490 (23.8%) were between 60–70 years old, and 8,792 (16.8%) were over 70 years old. The tissue density distribution for BI-RADS categories A, B, C, and D was 6,050 (11.5%), 18,544 (35.4%), 23,218 (44.3%), and 4,632 (8.8%) respectively. The distribution of demographics, imaging features, and pathologic outcomes are summarized in Table 1. Pathologic outcomes stratified by type of imaging finding in the test set are shown in Table 2.

Table 1. Distribution of regions of interest (patches) utilized for model development.

Characteristics Count (percentage)
Total Training set Validation set Test set
All patches 52,444 (100%) 29,144 (100%) 9,910 (100%) 13,390 (100%)
Label
 Positive 24,166 (46.1%) 13,461 (46.2%) 4,563 (46.0%) 6,142 (45.9%)
 Negative 28,278 (53.9%) 15,683 (53.8%) 5,347 (54.0%) 7,248 (54.1%)
Race
 African American or Black 22,321 (42.5%) 12,483 (42.8%) 4,137 (41.7%) 5,701 (42.6%)
 Caucasian or White 21,273 (40.6%) 11,649 (40.0%) 4,140 (41.8%) 5,484 (40.9%)
 Other 8,850 (16.9%) 5,012 (17.2%) 1,633 (16.5%) 2,205 (16.5%)
Age group
  < 50 16,387 (31.2%) 9,184 (31.5%) 3,293 (33.2%) 3,910 (29.2%)
 50–60 14,775 (28.2%) 8,138 (27.9%) 2,811 (28.4%) 3,826 (28.6%)
 60–70 12,490 (23.8%) 6,965 (23.9%) 2,240 (22.6%) 3,285 (24.5%)
 >70 8,792 (16.8%) 4,857 (16.7%) 1,566 (15.8%) 2,369 (17.7%)
BI-RADS density
 A-Almost entirely fatty 6,050 (11.5%) 3,218 (11.0%) 1,166 (11.8%) 1,666 (12.4%)
 B-Scattered areas of fibro-glandular density 18,544 (35.4%) 10,275 (35.3%) 3,348 (33.8%) 4,921 (36.8%)
 C-Heterogeneously dense 23,218 (44.3%) 12,682 (43.5%) 4,310 (43.5%) 6,226 (46.5%)
 D-Extremely dense 4,632 (8.8%) 2,969 (10.2%) 1,086 (10.9%) 577 (4.3%)
Pathology
 Cancer 1,147 (2.2%) 739 (2.5%) 233 (2.3%) 175 (1.3%)
 Benign 3,495 (6.7%) 2,025 (7.0%) 662 (6.7%) 808 (6.0%)
 Never biopsied 47,802 (91.1%) 26,380 (90.5%) 9,015 (91.0%) 12,407 (92.7%)

Due to the structure of the dataset, the prevalence of imaging features could not be calculated for the training and validation datasets; they may contain multiple ROIs, which prevents the calculation of the exact prevalence.

BI-RADS, breast imaging reporting and data system.

Table 2. Distribution of pathologic outcomes stratified by image findings in the test set.

Image findings Cancer Benign Never biopsied Total
Mass 8 (0.45%) 75 (4.22%) 1,693 (95.33%) 1,776 (100%)
Asymmetry 97 (3.86%) 390 (15.54%) 2,023 (80.60%) 2,510 (100%)
AD 19 (3.28%) 50 (8.64%) 510 (88.08%) 579 (100%)
Calcification 82 (1.53%) 265 (4.93%) 5,025 (93.54%) 5,372 (100%)

AD, architectural distortion.

Patch classification performance analysis

On the test set of 13,390 patches from 5,404 patients, the patch classification model built with ResNet152V2 achieved an overall accuracy of 92.6% (95% CI: 92.0–93.2%), an AUC of 0.975 (95% CI: 0.972–0.978), a recall of 0.927 (95% CI: 0.919–0.935), a precision of 0.912 (95% CI: 0.902–0.922), and an F1 score of 0.919 (95% CI: 0.913–0.925) for classifying normal versus suspected abnormal patches, Table 3. A significant impact of breast density on classification performance was observed; detailed classification model performance stratified by breast density across subgroups of race, age, tissue density, pathology outcome, and image findings are described in S2 Table. The typical subgroup analysis in the field, represented by ROC curves by subgroups, is shown in Fig 4. Another predominant subgroup analysis, represented by histograms showing distribution of AUCs with bootstraps showing the separation between groups, is shown in Fig 5. Some patches were manually inspected to ensure the rigor of the failure analysis; examples of True Positive, False Negative, True Negative, and False Positive prediction outcomes by various subgroups are shown in Fig 6.

Table 3. Classification model performance both overall and by subgroup.

Group Accuracy AUC Recall Precision F1 score
Overall
 Train 0.984 0.998 0.984 0.983 0.983
 Validation 0.926 0.967 0.915 0.925 0.920
 Test 0.926 ± 0.006 0.975 ± 0.003 0.927 ± 0.008 0.912 ± 0.010 0.919 ± 0.006
Race
 White 0.922 ± 0.009 0.972 ± 0.005 0.918 ± 0.013 0.902 ± 0.016 0.910 ± 0.011
 Black 0.927 ± 0.009 0.976 ± 0.005 0.931 ± 0.012 0.914 ± 0.015 0.922 ± 0.010
 Other 0.928 ± 0.015 0.978 ± 0.007 0.938 ± 0.020 0.926 ± 0.020 0.932 ± 0.014
Age
  < 50 0.916 ± 0.012 0.970 ± 0.007 0.922 ± 0.014 0.919 ± 0.016 0.920 ± 0.011
 50–60 0.926 ± 0.011 0.977 ± 0.005 0.930 ± 0.017 0.915 ± 0.018 0.922 ± 0.012
 60–70 0.933 ± 0.013 0.976 ± 0.006 0.932 ± 0.017 0.916 ± 0.022 0.924 ± 0.016
 >70 0.930 ± 0.014 0.975 ± 0.008 0.928 ± 0.023 0.882 ± 0.031 0.904 ± 0.021
BI-RADS density
 A 0.950 ± 0.016 0.977 ± 0.013 0.924 ± 0.040 0.825 ± 0.069 0.871 ± 0.044
 B 0.942 ± 0.009 0.982 ± 0.004 0.942 ± 0.012 0.936 ± 0.014 0.939 ± 0.009
 C 0.910 ± 0.008 0.966 ± 0.005 0.920 ± 0.012 0.906 ± 0.013 0.913 ± 0.009
 D 0.877 ± 0.036 0.953 ± 0.021 0.899 ± 0.046 0.882 ± 0.047 0.891 ± 0.034
Pathology
 Never biopsied 0.925 ± 0.006 0.974 ± 0.003 0.925 ± 0.008 0.908 ± 0.010 0.916 ± 0.007
 Benign 0.937 ± 0.019 0.977 ± 0.010 0.955 ± 0.022 0.943 ± 0.027 0.949 ± 0.016
 Cancer 0.930 ± 0.049 0.980 ± 0.023 0.938 ± 0.063 0.957 ± 0.052 0.947 ± 0.042
Image findings
 Mass 0.931 ± 0.017 0.980 ± 0.008 0.949 ± 0.022 0.896 ± 0.029 0.922 ± 0.020
 Asymmetry 0.930 ± 0.009 0.977 ± 0.005 0.937 ± 0.010 0.942 ± 0.011 0.939 ± 0.008
 AD 0.826 ± 0.032 0.914 ± 0.034 0.810 ± 0.040 0.939 ± 0.027 0.869 ± 0.026
 Calcification 0.920 ± 0.016 0.974 ± 0.008 0.939 ± 0.018 0.904 ± 0.022 0.921 ± 0.016

The overall and subgroup AUC, recall, and precision averaged over 200 bootstrapped samples ± 95% confidence intervals.

AD, architectural distortion; AUC, area under the receiver operative characteristics curve; BI-RADS, breast imaging reporting and data system.

Fig 4. Receiver operating characteristic curves of (A) the whole test set, and stratified by subgroups: (B) race, (C) age,

Fig 4

(D) BI-RADS density, (E) pathology, and (F) image findings. AUC, area under the receiver operative characteristics curve; BI-RADS, breast imaging reporting and data system; NB, never biopsied; AD, architectural distortion.

Fig 5. Histogram demonstrating distribution of AUCs of the test set with 200 bootstraps showing the separation between subgroups within each category.

Fig 5

AD, architectural distortion; AUC, area under the receiver operating characteristics curve; BI-RADS, breast imaging reporting and data system; NB, never biopsied.

Fig 6. Example of true positive, false negative, true negative, and false positive patches.

Fig 6

Each row is selected to demonstrate a single feature while holding other features constant. (a–d) are patches selected from patients of Other race, age less than 50 years old, BI-RADS density C, and never biopsied; (e–h) are patches of patients with density D, race White, age less than 50 years old, and never biopsied; (i and j) are patches with cancer, White, age greater than 70 years old, density B, with calcifications; (k and l) are patches with architectural distortion, White, age less than 50 years old, density C, never biopsied.

The relative risk ratio of false negative (Table 4) and false positive (Table 5) predictions was calculated using univariate and multivariate logistic regression models. In sum, patients who faced false negatives were significantly more likely to be White (RR = 1.208; p = 0.050), never biopsied (RR = 1.079; p = 0.011), with no masses (RR = 1.086; p = 0.010), with no asymmetries (RR = 1.171; p = 0.040), and with architectural distortion (RR = 1.037; p < 0.001). Similarly, patients who faced false positives were significantly more likely to have higher breast density, specifically with density C (RR = 1.891; p < 0.001) and density D (RR = 2.486; p < 0.001). Patients who faced false negatives were significantly more likely to be White compared to other racial groups, but this association was not significant when specifically compared to Black patients. It is also noteworthy that race and age were no longer significant predictors for false positives when using multivariate analysis. This change in statistical significance was observed in both false negative and false positive analyses before and after addressing the potential confounding effects between demographic variables and image features. Generally, addressing confounding effects results in fewer significant predictors for both false positives and false negatives.

Table 4. Evaluation of model performance using multivariate regression to assess risk ratio of false negative predictions by subgroups, compared to versus univariate evaluation.

Variables OR RR Univariate p-value Multivariate p-value Number of patches Control group
Black 0.880
[0.709, 1.093]
0.922
[0.797, 1.056]
< 0.001* 0.248 2,630 White
Other* 0.749
[0.561, 1.000]
0.828
[0.673, 1.000]
<0.001* 0.050* 1,160 White
50–60 y/o 0.881
[0.688, 1.127]
0.918
[0.768, 1.081]
<0.001* 0.315 1,798 <50 y/o
60–70 y/o 0.823
[0.626, 1.082]
0.875
[0.715, 1.053]
<0.001* 0.163 1,434 <50 y/o
>70 y/o 0.89
[0.643, 1.231]
0.924
[0.731, 1.143]
<0.001* 0.482 840 <50 y/o
Density B 1.132
[0.421, 1.049]
1.060
[0.434, 1.047]
<0.001* 0.079 2,327 Density A
Density C 0.752
[0.564, 1.385]
0.862
[0.577, 1.359]
0.490 0.590 3,183 Density A
Density D 1.239
[0.618, 1.943]
1.103
[0.630, 1.854]
0.015* 0.756 321 Density A
Benign* 0.567
[0.366, 0.881]
0.927
[0.848, 0.986]
<0.001* 0.011* 499 Never biopsied
Cancer 0.778
[0.353, 1.714]
0.971
[0.841, 1.045]
<0.001* 0.533 118 Never biopsied
Mass* 0.596
[0.403, 0.882]
0.921
[0.842, 0.983]
<0.001* 0.010* 761 No mass
Asymmetry* 0.751
[0.571, 0.988]
0.854
[0.721, 0.994]
<0.001* 0.040* 3,127 No asymmetry
AD* 2.575
[1.824, 3.636]
1.037
[1.027, 1.044]
0.575 <0.001* 413 No AD
Calcification 0.744
[0.536, 1.030]
0.934
[0.849, 1.006]
<0.001* 0.075 1,248 No calcification

Univariate two-sample Student’s t-test was conducted to compare the difference in the false negative rate of bootstrap performance between subgroups and control groups. Demographic and clinical/imaging features were evaluated using a multivariate logistic regression model for descriptive analysis to control for confounding effects between the selected features. A total of 6,142 patches were inspected. Odds ratios (ORs) and Risk ratios (RRs) are presented with their corresponding 95% confidence intervals (CIs) in brackets.

*Statistically significant, p < 0.05.

AD, architectural distortion; BI-RADS, breast imaging reporting and data system; OR, odds ratio; RR, risk ratio.

Table 5. Evaluation of model performance using multivariate regression to assess risk ratio of false positive predictions by subgroups, compared to versus univariate evaluation.

Variables OR RR Univariate p-value Multivariate p-value Number of patches Control group
Black 1.107
[0.911, 1.344]
1.058
[0.948, 1.17]
0.153 0.306 3,071 White
Other 0.982
[0.751, 1.283]
0.990
[0.842, 1.143]
<0.001* 0.893 1,045 White
50–60 y/o 0.914
[0.725, 1.151]
0.934
[0.779, 1.11]
<0.001* 0.445 2,028 <50 y/o
60–70 y/o 0.870
[0.677, 1.12]
0.899
[0.736, 1.087]
<0.001* 0.279 1,851 <50 y/o
>70 y/o 0.920
[0.703, 1.202]
0.938
[0.759, 1.144]
<0.001* 0.540 1,529 <50 y/o
Density B 1.328
[0.976, 1.808]
1.249
[0.981, 1.563]
<0.001* 0.071 2,594 Density A
Density C* 2.406
[1.799, 3.216]
1.891
[1.558, 2.251]
<0.001* <0.001* 3,043 Density A
Density D* 3.863
[2.492, 5.989]
2.486
[1.934, 3.048]
0.015* <0.001* 256 Density A

Univariate two-sample Student’s t-test was conducted to compare the difference in the false positive rate of bootstrap performance between subgroups and control groups. Demographic and clinical/imaging features were evaluated using a multivariate logistic regression model for descriptive analysis to control for confounding effects between the selected features. Some features were not considered in this model because pathological outcomes and image findings do not exist in false positive cases. A total of 7,248 patches were inspected. Odds ratios (ORs) and Risk ratios (RRs) are presented with their corresponding 95% confidence intervals (CIs) in brackets.

*Statistically significant, p < 0.05.

AD, architectural distortion; BI-RADS, breast imaging reporting and data system; OR, odds ratio; RR, risk ratio.

Discussion

Our study addresses several key factors in the use of deep learning models in screening mammography. First – that deep convolutional network architectures for abnormality classification underperform in certain subgroups of patients or imaging findings. This discrepancy is also likely to be present in other broader applications such as tumor object detection and automated radiology report generation, as they rely on DL models for ROI classification as key components in their backbone architecture. Second – that the mechanisms driving model underperformance in specific subgroups of demographics and imaging features are more complex than previously understood. Our evaluation provides the first subgroup evaluation of abnormality classification at the patch level while also considering confounding between groups. Most subgroup evaluation considers classes like race, age, imaging features, or pathologic outcomes as independent variables [4244], when in reality, they are interrelated [2830]. These results lend insight into where these models truly fail during clinical deployment. For example, we found that architectural distortion conferred a higher risk of false negatives, even after considering confounding with breast density, race, and age. We also learned that neither tissue density nor race exhibits a significant association with model failure defined by false negatives, but that breast density alone does significantly affect false positives; BI-RADS density C and D, have a 1.89 and a 2.48 times higher increased risk of false positives compared to BI-RADS density A patches, respectively. Overall, these results are consistent with our initial hypothesis suggesting that patterns in model failure are (1) different between types of failure, (2) subject to potential confounding effects. Recent studies indicate that even modest misclassifications by deep learning models can disrupt radiologists’ diagnostic workflows and influence patient management decisions [45,46]. This work was part of ongoing efforts to enhance the clinical utility and fairness of deep learning-based mammography screening by identifying these error patterns in a rigorous way.

The findings should be interpreted considering the study’s methodological constraints. First, this research was conducted at a single institution, utilizing medical images collected from multiple hospitals. Consequently, the extent to which our results can be generalized to other healthcare systems requires further investigation, which requires a dataset with similar annotated lesion characteristics at the patch level. Secondly, our study did not conduct a direct failure analysis on end-user applications. Instead, we hypothesize that the patient subgroup facing performance degradation in patch classification may also face degradation in performance in the applications such as tumor object detection software. Lastly, we only included exams in subgroup analysis for which data was available and grouped imaging features and race into broad categories, which may obscure nuanced differences within these groups.

Conclusion

The study is the first to employ multivariate analysis to explore model failures for a breast cancer patch classification model across multiple subgroups. Our findings advance computational strategies for CADe and CADx applications leveraging deep learning, support the development of more precise and broadly applicable tools, and contribute to fairer and more interpretable decision-making models in mammography while identifying populations that may face disadvantages if no measures are taken. These are important steps towards improving the performance of fair and interpretable decision-making models powered by deep learning in mammography.

Supporting information

S1 Table. Comparison of performance of multiple standard convolutional neural network (CNN) models for binary patch classification on mammography.

(DOCX)

pdig.0000811.s001.docx (15KB, docx)
S2 Table. Classification performance in subgroups stratified by tissue density on the test set.

(DOCX)

pdig.0000811.s002.docx (21.5KB, docx)

Data Availability

The dataset used in this study contains real patient imaging data and is subject to ethical and legal restrictions to protect patient privacy. A portion of the dataset is publicly available through the AWS Open Data Registry, where researchers can request access following the established review process: https://registry.opendata.aws/emory-breast-imaging-dataset-embed/. For access to additional data beyond the publicly available portion, researchers should submit their request through the same form provided for the public dataset, clearly indicating the need for full data access. A member of the data management team will follow up to guide the requester through the official data access procedures, which are governed by institutional policies and data use agreements.

Funding Statement

This study was funded by the Equifax Ethical AI Grant (A23-0024) and the National Institutes of Health R01 Grant (R01CA286120). The funders had no role in the study design, data collection, analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ellington T, Miller J, Henley S, Wilson R, Wu M, Richardson L. Trends in breast cancer incidence, by race, ethnicity, and age among women aged ≥20 years - United States, 1999-2018. MMWR Morb Mortal Wkly Rep. 2022;71(2):43–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Broeders M, Moss S, Nyström L, Njor S, Jonsson H, Paap E, et al. The impact of mammographic screening on breast cancer mortality in Europe: a review of observational studies. J Med Screen. 2012;19(Suppl 1):14–25. doi: 10.1258/jms.2012.012078 [DOI] [PubMed] [Google Scholar]
  • 3.Lehman C, Arao R, Sprague B, Lee J, Buist D, Kerlikowske K, et al. National performance benchmarks for modern screening digital mammography: update from the breast cancer surveillance consortium. Radiology. 2017;283(1):49–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Balleyguier C, Ayadi S, Van Nguyen K, Vanel D, Dromain C, Sigal R. BIRADS classification in mammography. Eur J Radiol. 2007;61(2):192–4. doi: 10.1016/j.ejrad.2006.08.033 [DOI] [PubMed] [Google Scholar]
  • 5.Lee JM, Arao RF, Sprague BL, Kerlikowske K, Lehman CD, Smith RA, et al. Performance of screening ultrasonography as an adjunct to screening mammography in women across the spectrum of breast cancer risk. JAMA Intern Med. 2019;179(5):658–67. doi: 10.1001/jamainternmed.2018.8372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ribli D, Horváth A, Unger Z, Pollner P, Csabai I. Detecting and classifying lesions in mammograms with deep learning. Sci Rep. 2018;8(1):4165. doi: 10.1038/s41598-018-22437-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kim H-E, Kim HH, Han B-K, Kim KH, Han K, Nam H, et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health. 2020;2(3):e138–48. doi: 10.1016/S2589-7500(20)30003-0 [DOI] [PubMed] [Google Scholar]
  • 8.Jiang J, Peng J, Hu C, Jian W, Wang X, Liu W. Breast cancer detection and classification in mammogram using a three-stage deep learning framework based on PAA algorithm. Artif Intell Med. 2022;134:102419. doi: 10.1016/j.artmed.2022.102419 [DOI] [PubMed] [Google Scholar]
  • 9.Al-Masni MA, Al-Antari MA, Park J-M, Gi G, Kim T-Y, Rivera P, et al. Simultaneous detection and classification of breast masses in digital mammograms via a deep learning YOLO-based CAD system. Comput Methods Programs Biomed. 2018;157:85–94. doi: 10.1016/j.cmpb.2018.01.017 [DOI] [PubMed] [Google Scholar]
  • 10.Agarwal R, Diaz O, Yap M, Llado X, Marti R. Deep learning for mass detection in full field digital mammograms. Comput Biol Med. 2020;121:103774. [DOI] [PubMed] [Google Scholar]
  • 11.Yala A, Lehman C, Schuster T, Portnoi T, Barzilay R. A deep learning mammography-based model for improved breast cancer risk prediction. Radiology. 2019;292(1):60–6. doi: 10.1148/radiol.2019182716 [DOI] [PubMed] [Google Scholar]
  • 12.Lotter W, Diab AR, Haslam B, Kim JG, Grisot G, Wu E, et al. Robust breast cancer detection in mammography and digital breast tomosynthesis using an annotation-efficient deep learning approach. Nat Med. 2021;27(2):244–9. doi: 10.1038/s41591-020-01174-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Akselrod-Ballin A, Chorev M, Shoshan Y, Spiro A, Hazan A, Melamed R, et al. Predicting breast cancer by applying deep learning to linked health records and mammograms. Radiology. 2019;292(2):331–42. doi: 10.1148/radiol.2019182622 [DOI] [PubMed] [Google Scholar]
  • 14.Al-Antari MA, Al-Masni MA, Choi M-T, Han S-M, Kim T-S. A fully integrated computer-aided diagnosis system for digital X-ray mammograms via deep learning detection, segmentation, and classification. Int J Med Inform. 2018;117:44–54. doi: 10.1016/j.ijmedinf.2018.06.003 [DOI] [PubMed] [Google Scholar]
  • 15.Cao H, Pu S, Tan W, Tong J. Breast mass detection in digital mammography based on anchor-free architecture. Comput Methods Programs Biomed. 2021;205:106033. doi: 10.1016/j.cmpb.2021.106033 [DOI] [PubMed] [Google Scholar]
  • 16.Kooi T, Litjens G, van Ginneken B, Gubern-Mérida A, Sánchez CI, Mann R, et al. Large scale deep learning for computer aided detection of mammographic lesions. Med Image Anal. 2017;35:303–12. doi: 10.1016/j.media.2016.07.007 [DOI] [PubMed] [Google Scholar]
  • 17.Becker AS, Marcon M, Ghafoor S, Wurnig MC, Frauenfelder T, Boss A. Deep learning in mammography: diagnostic accuracy of a multipurpose image analysis software in the detection of breast cancer. Invest Radiol. 2017;52(7):434–40. doi: 10.1097/RLI.0000000000000358 [DOI] [PubMed] [Google Scholar]
  • 18.Wu N, Huang Z, Shen Y, Park J, Phang J, Makino T, et al. Reducing false-positive biopsies using deep neural networks that utilize both local and global image context of screening mammograms. J Digit Imaging. 2021;34(6):1414–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Suzuki K. A review of computer-aided diagnosis in thoracic and colonic imaging. Quant Imaging Med Surg. 2012;2(3):163–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Doi K. Current status and future potential of computer-aided diagnosis in medical imaging. Br J Radiol. 2005;78(Spec No 1):S3–19. doi: 10.1259/bjr/82933343 [DOI] [PubMed] [Google Scholar]
  • 21.Wang P, Liu X, Berzin TM, Glissen Brown JR, Liu P, Zhou C, et al. Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study. Lancet Gastroenterol Hepatol. 2020;5(4):343–51. doi: 10.1016/S2468-1253(19)30411-X [DOI] [PubMed] [Google Scholar]
  • 22.Astley SM, Gilbert FJ. Computer-aided detection in mammography. Clin Radiol. 2004;59(5):390–9. doi: 10.1016/j.crad.2003.11.017 [DOI] [PubMed] [Google Scholar]
  • 23.Birdwell RL, Bandodkar P, Ikeda DM. Computer-aided detection with screening mammography in a university hospital setting. Radiology. 2005;236(2):451–7. doi: 10.1148/radiol.2362040864 [DOI] [PubMed] [Google Scholar]
  • 24.Nishikawa RM, Schmidt RA, Linver MN, Edwards AV, Papaioannou J, Stull MA. Clinically missed cancer: how effectively can radiologists use computer-aided detection? AJR Am J Roentgenol. 2012;198(3):708–16. doi: 10.2214/AJR.11.6423 [DOI] [PubMed] [Google Scholar]
  • 25.Sechopoulos I, Teuwen J, Mann R. Artificial intelligence for breast cancer detection in mammography and digital breast tomosynthesis: state of the art. Semin Cancer Biol. 2021;72:214–25. doi: 10.1016/j.semcancer.2020.06.002 [DOI] [PubMed] [Google Scholar]
  • 26.Barnett AJ, Schwartz FR, Tao C, Chen C, Ren Y, Lo JY, et al. A case-based interpretable deep learning model for classification of mass lesions in digital mammography. Nat Mach Intell. 2021;3(12):1061–70. doi: 10.1038/s42256-021-00423-x [DOI] [Google Scholar]
  • 27.Ahluwalia M, Abdalla M, Sanayei J, Seyyed-Kalantari L, Hussain M, Ali A, et al. The subgroup imperative: chest radiograph classifier generalization gaps in patient, setting, and pathology subgroups. Radiol Artif Intell. 2023;5(5):e220270. doi: 10.1148/ryai.220270 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Gichoya JW, Banerjee I, Bhimireddy AR, Burns JL, Celi LA, Chen L-C, et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health. 2022;4(6):e406–14. doi: 10.1016/S2589-7500(22)00063-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Seyyed-Kalantari L, Liu G, McDermott M, Chen IY, Ghassemi M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Biocomputing. 2021;2021:232–43. [PubMed] [Google Scholar]
  • 30.Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021;27(12):2176–82. doi: 10.1038/s41591-021-01595-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Liberman L, Menell JH. Breast imaging reporting and data system (BI-RADS). Radiol Clin North Am. 2002;40(3):409–30, v. doi: 10.1016/s0033-8389(01)00017-3 [DOI] [PubMed] [Google Scholar]
  • 32.Jeong JJ, Vey BL, Bhimireddy A, Kim T, Santos T, Correa R, et al. The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammographic Images. Radiol Artif Intell. 2023;5(1):e220047. doi: 10.1148/ryai.220047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z, editors. Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. [Google Scholar]
  • 34.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Computational and Biological Learning Society. 2015;1:1–14. [Google Scholar]
  • 35.He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks.In: 2016 European Conference on Computer Vision (ECCV); Lecture Notes in Computer Science. Vol. 9908. Cham: Springer; 2016. [Google Scholar]
  • 36.He K, Zhang X, Ren S, Sun J, editors. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. [Google Scholar]
  • 37.Deng J, Dong W, Socher R, Li LJ, Kai L, Li F-F, editors. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009. [Google Scholar]
  • 38.Shahriari B, Swersky K, Wang Z, Adams RP, de Freitas N. Taking the human out of the loop: a review of bayesian optimization. Proc IEEE. 2016;104(1):148–75. doi: 10.1109/jproc.2015.2494218 [DOI] [Google Scholar]
  • 39.Thian Y, Ng D, Hallinan J, Jagmohan P, Sia S, Tan C. Deep learning systems for pneumothorax detection on chest radiographs: a multicenter external validation study. Radiol Artif Intell. 2021;3(4):e200190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gastounioti A, Eriksson M, Cohen EA, Mankowski W, Pantalone L, Ehsan S, et al. External validation of a mammography-derived ai-based risk model in a U.S. breast cancer screening cohort of white and black women. Cancers (Basel). 2022;14(19):4803. doi: 10.3390/cancers14194803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Conant EF, Toledano AY, Periaswamy S, Fotin SV, Go J, Boatsman JE, et al. Improving accuracy and efficiency with concurrent use of artificial intelligence for digital breast tomosynthesis. Radiol Artif Intell. 2019;1(4):e180096. doi: 10.1148/ryai.2019180096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Yagi R, Goto S, Katsumata Y, MacRae CA, Deo RC. Importance of external validation and subgroup analysis of artificial intelligence in the detection of low ejection fraction from electrocardiograms. Eur Heart J Digit Health. 2022;3(4):654–7. doi: 10.1093/ehjdh/ztac065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Xu H-L, Gong T-T, Liu F-H, Chen H-Y, Xiao Q, Hou Y, et al. Artificial intelligence performance in image-based ovarian cancer identification: a systematic review and meta-analysis. EClinicalMedicine. 2022;53:101662. doi: 10.1016/j.eclinm.2022.101662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Nguyen DL, Ren Y, Jones TM, Thomas SM, Lo JY, Grimm LJ. Patient characteristics impact performance of AI algorithm in interpreting negative screening digital breast tomosynthesis studies. Radiology. 2024;311(2):e232286. doi: 10.1148/radiol.232286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Prinster D, Mahmood A, Saria S, Jeudy J, Lin CT, Yi PH, et al. Care to explain? AI explanation types differentially impact chest radiograph diagnostic performance and physician trust in AI. Radiology. 2024;313(2):e233261. doi: 10.1148/radiol.233261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Bernstein MH, Atalay MK, Dibble EH, Maxwell AWP, Karam AR, Agarwal S, et al. Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography. Eur Radiol. 2023;33(11):8263–9. doi: 10.1007/s00330-023-09747-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000811.r002

Decision Letter 0

Chiara Corti

14 Jan 2025

PDIG-D-24-00374Subgroup Evaluation to Understand Performance Gaps in Deep Learning-Based Classification of Regions of Interest on MammographyPLOS Digital Health Dear Dr. Woo, Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript within 30 days Feb 13 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers '. This file does not need to include responses to any formatting updates and technical items listed in the 'Journal Requirements' section below.* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes '.* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript '. If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. We look forward to receiving your revised manuscript. Kind regards, Chiara Corti, MDAcademic EditorPLOS Digital Health Leo Anthony CeliEditor-in-ChiefPLOS Digital Healthorcid.org/0000-0001-6712-6626  Journal Requirements:

1. We have amended your Competing Interest statement to comply with journal style. We kindly ask that you double check the statement and let us know if anything is incorrect. 

2. Thank you for uploading your study's underlying data set. Unfortunately, the repository you have noted in your Data Availability statement does not qualify as an acceptable data repository according to PLOS's standards.

 At this time, please upload the minimal data set necessary to replicate your study's findings to a stable, public repository (such as figshare or Dryad) and provide us with the relevant URLs, DOIs, or accession numbers that may be used to access these data. For a list of recommended repositories and additional information on PLOS standards for data deposition, please see https://journals.plos.org/plosone/s/recommended-repositories.

 Additional Editor Comments (if provided):   [Note: HTML markup is below. Please do not edit.] Reviewers' Comments: Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria ? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I don't know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Very good manuscript exploring performance gaps in deep learning models used for classifying normal versus abnormal regions of interest (ROIs) in mammography.

It uses a large and racially representative dataset, allowing for robust model training and enabling thorough subgroup analysis across demographics. This data diversity strengthens the study’s generalizability and the use of multivariate regression models to control for confounding variables enhances the reliability of findings. Moreover, a comprehensive set of evaluation metrics, including AUC, recall, and false positive/negative rates gives a nuanced understanding of the model’s effectiveness. The fact that, however, all data originates from a single institution it correctly mentioned in the limitations of the study.

Major comments:

1. Methods

The manuscript could benefit from additional detail regarding data preprocessing (i.e., handling of demographics imbalances, if present) and description of softwares and tools, or programming languages used in data analysis.

2. Discussion

Including references that discuss practical implications for end-users (e.g., radiologists) on how these model errors impact real-world diagnostic decisions or patient outcomes could strengthen the study’s relevance.

Minor comments:

1. Consider reviewing the following reformulated sentences:

-Breast cancer is the most common cancer among women, leading to approximately 42,000 deaths annually in the United States (Introduction).

- Deep learning (DL) models are increasingly used to improve differentiation between normal and suspected abnormal regions on mammograms (Introduction).

- The dataset provides comprehensive details on demographics, imaging features, pathologic outcomes, and BI-RADS tissue densities for both screening and diagnostic exams (Methods).

- This study contributes to the advancement of fairer and more interpretable decision-making models in mammography (Conclusions).

- Insights from this research can support the development of diagnostic tools that are accurate and equitable (Conclusions).

2. In the caption of Figure 2, add a brief description of the matching process (mentioning the row- and column-wise distribution).

3. In line 9 of the Discussion paragraph, provide also literature examples that support your sentence (i.e., classes are interrelated).

4. Consider specifying that White class is significantly more likely for false negative with respect to other races, but not Black.

Reviewer #2: 1.Citation formatting appears inconsistent in a few places use uniform formatting.

2.Figures are critical to understanding such a technical study and should be improved with high resolution.

3.Confidence intervals for false positive/negative risk rates are not provided. Providing this, if possible, will increase the clarity and reliability of the results.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: Yes:  Elvina Almuradova

**********

 [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] Figure resubmission: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions. Reproducibility: To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLOS Digit Health. doi: 10.1371/journal.pdig.0000811.r004

Decision Letter 1

Daniel Hashimoto

4 Mar 2025

Subgroup Evaluation to Understand Performance Gaps in Deep Learning-Based Classification of Regions of Interest on Mammography

PDIG-D-24-00374R1

Dear Dr. Woo,

We are pleased to inform you that your manuscript 'Subgroup Evaluation to Understand Performance Gaps in Deep Learning-Based Classification of Regions of Interest on Mammography' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Daniel A Hashimoto, MD, MSTR

Section Editor

PLOS Digital Health

***********************************************************

Additional Editor Comments (if provided):

The authors have appropriately addressed the reviewer comments and request for minor revision. One sentence could benefit from a very minor grammatical correction.

- Author Summary: "Our study reveals that CNN classifier do not fail randomly in mammographic images, resulting in certain patient populations more prone to experiencing these failures." Likely should be "...that CNN classifiers do not fail..."

Reviewer Comments (if any, and for reference):

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Comparison of performance of multiple standard convolutional neural network (CNN) models for binary patch classification on mammography.

    (DOCX)

    pdig.0000811.s001.docx (15KB, docx)
    S2 Table. Classification performance in subgroups stratified by tissue density on the test set.

    (DOCX)

    pdig.0000811.s002.docx (21.5KB, docx)
    Attachment

    Submitted filename: Response_Table.pdf

    pdig.0000811.s004.pdf (168.4KB, pdf)

    Data Availability Statement

    The dataset used in this study contains real patient imaging data and is subject to ethical and legal restrictions to protect patient privacy. A portion of the dataset is publicly available through the AWS Open Data Registry, where researchers can request access following the established review process: https://registry.opendata.aws/emory-breast-imaging-dataset-embed/. For access to additional data beyond the publicly available portion, researchers should submit their request through the same form provided for the public dataset, clearly indicating the need for full data access. A member of the data management team will follow up to guide the requester through the official data access procedures, which are governed by institutional policies and data use agreements.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES