Abstract
See also the editorial by Nikolic in this issue.
Introduction
Deep learning (DL) models for medical imaging diagnosis can exhibit demographic biases (1). However, coarse race and ethnicity labels, such as “Asian,” are frequently used to categorize a diverse array of ethnic subgroups, such as “Indian,” “Korean,” and “Chinese” (1–3), which may conceal nontrivial medical differences (2). This study evaluated if coarse race and ethnicity labels could conceal granular ethnic disparities in underdiagnosis rates in DL models for the popular task of chest radiograph diagnosis.
Materials and Methods
This retrospective study was acknowledged by the University of Maryland institutional review board as nonhuman subject research. Following prior methods (1), DL classification models (DenseNet pretrained on ImageNet) were trained to diagnose 14 disease labels (including “no finding”) using all chest radiographs from two public data sets: MIMIC-CXR (hereafter, MIMIC) (4) (n = 377 095) and CheXpert (5) (n = 224 316), each randomly split into 80%/10%/10% training/validation/test sets. For each data set, five models were developed using five different random seeds (to evaluate result reproducibility) and the same training and validation splits (1). For model testing, MIMIC was used because it contains self-reported coarse and granular race and/or ethnicity labels (Table); MIMIC-trained and CheXpert-trained models were tested on a holdout test set (n = 25 888) and the entire MIMIC data set (n = 270 136), respectively, which excluded chest radiographs missing demographic information. Mean area under the receiver operating characteristic curve was calculated across all disease labels and each set of five models. Underdiagnosis rate (1) (false-positive rate for the “no finding” label) was the primary bias metric for measuring delayed access to care if a patient was incorrectly labeled as healthy (1); “no finding” frequencies varied between race and/or ethnicity labels (Table). False-positive rate was calculated using a maximum F1 score–defined threshold on the test set. Underdiagnosis rates were calculated as the mean of five models, with 95% CIs calculated using bootstrapping with 1000 resamples; one-sample t tests were used to determine if the distribution of false-positive rate differences between race or ethnicity labels was statistically significantly different (P < .05) from 0 (ie, no disparity). Experimental code is available at https://github.com/pree1199/GranularDisparitiesCXR.
Coarse and Granular Race or Ethnicity Labels in the MIMIC Test Data Sets
Results
MIMIC- and CheXpert-trained model mean areas under the receiver operating characteristic curves were 0.828 (95% CI: 0.826, 0.830) and 0.803 (95% CI: 0.801, 0.805), respectively, similar to state-of-the-art models (1,6). Both MIMIC and CheXpert models had higher mean underdiagnosis rates in Asian (23.5% [95% CI: 20.0, 26.9] and 28.4% [95% CI: 27.3, 29.4], respectively), Black (27.6% [95% CI: 26.0, 29.0] and 35.5% [95% CI: 35.0, 36.0]), and Hispanic/Latino (27.8% [95% CI: 24.6, 31.1] and 41.3% [95% CI: 40.2, 42.4]) patients compared with White patients (19.3% [95% CI: 18.6, 19.9] and 26.9% [95% CI: 26.7, 27.2]) (all P < .005) (Figure). Underdiagnosis rate variations for granular labels and models frequently exceeded variation between coarse labels (Figure). For example, in the CheXpert models, within the Asian label, underdiagnosis rates ranged from 23.1% (95% CI: 18.7, 27.6) (Korean) to 33.2% (95% CI: 29.1, 37.1) (Indian) (P < .001).
Forest plots show granular underdiagnosis rates (“no finding” label false-positive rate [FPR]) for models trained on MIMIC-CXR and CheXpert. Points show averages, and solid lines indicate 95% CIs for granular group false-positive rates in the results of the five models. Dashed lines and shaded regions show averages and 95% CIs, respectively, for coarse groups. Granular groups labeled with an asterisk are the patients who only reported a coarse race or ethnicity.
Discussion
Underdiagnosis biases (1) of state-of-the-art chest radiograph DL classification models (1,6) favoring White patients over Black and Hispanic patients were reproduced. However, these coarse labels concealed significant disparities between granular groups. These findings echo the notion of race as a social construct, as opposed to a biologic variable (3).
Limitations include use of testing data from a single U.S. hospital with small sample sizes for some granular groups. The self-reported granular ethnicity labels may vary between societies; this may explain some of the divergent granular biases observed between the MIMIC- and CheXpert-trained models. As social and environmental factors impact health and contribute to race constructs, our findings should be validated in larger, more diverse populations.
Biased DL models could worsen health care disparities, and it is critical that algorithmic biases be measured precisely. Data sets should be collected with granular demographic labels whenever possible. Measuring algorithmic biases using granular ethnicity labels instead of coarse ones is critical, lest these promising technologies mask hidden disparities.
Footnotes
Disclosures of conflicts of interest: P.B. No relevant relationships. S.P.G. No relevant relationships. P.K. No relevant relationships. A.K. No relevant relationships. J.S. No relevant relationships. V.S.P. No relevant relationships. P.H.Y. Grants to institution from the National Institutes of Health/National Cancer Institute, RSNA, American College of Radiology, and University of Maryland Medical System Innovation Challenge; consulting fees from Bunkerhill Health and FH Ortho; payment for lectures from the Formosa Association for the Surgery of Trauma and Chang Gung Memorial Hospital and the University of Texas Southwestern; support for attending meetings from the Society of Imaging Informatics in Medicine (SIIM), Society of Nuclear Medicine and Molecular Imaging, and Sociedade Paulista de Radiologia; vice chair of the SIIM Annual Meeting Program Planning Committee and associate editor of Radiology: Artificial Intelligence; stock or stock options in Bunkerhill Health.
References
- 1. Seyyed-Kalantari L , Zhang H , McDermott MBA , Chen IY , Ghassemi M . Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations . Nat Med 2021. ; 27 ( 12 ): 2176 – 2182 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Movva R , Shanmugam D , Hou K , et al . Coarse race data conceals disparities in clinical risk score performance . arXiv 2304.09270 [preprint] https://arxiv.org/abs/2304.09270. Posted April 18, 2023. Accessed June 2023. [Google Scholar]
- 3. Jorde LB , Wooding SP . Genetic variation, classification and ‘race’ . Nat Genet 2004. ; 36 ( 11 Suppl ): S28 – S33 . [DOI] [PubMed] [Google Scholar]
- 4. Johnson AEW , Pollard TJ , Berkowitz SJ , et al . MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports . Sci Data 2019. ; 6 ( 1 ): 317 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Irvin J , Rajpurkar P , Ko M , et al . CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison . arXiv 1901.07031 [preprint] http://arxiv.org/abs/1901.07031. Posted January 21, 2019. Accessed February 7, 2022. [Google Scholar]
- 6. Cohen JP , Hashir M , Brooks R , Bertrand H . On the limits of cross-domain generalization in automated X-ray prediction . arXiv 2002.02497 [preprint] https://arxiv.org/abs/2002.02497. Posted February 6, 2020. Accessed June 2023. [Google Scholar]



![Forest plots show granular underdiagnosis rates (“no finding” label false-positive rate [FPR]) for models trained on MIMIC-CXR and CheXpert. Points show averages, and solid lines indicate 95% CIs for granular group false-positive rates in the results of the five models. Dashed lines and shaded regions show averages and 95% CIs, respectively, for coarse groups. Granular groups labeled with an asterisk are the patients who only reported a coarse race or ethnicity.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d02/10698499/70ce9c9a0805/radiol.231693.fig1.jpg)