Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 15.
Published in final edited form as: Nat Med. 2021 Dec;27(12):2079–2081. doi: 10.1038/s41591-021-01577-2

Rising to the challenge of bias in health care AI

Mildred K Cho 1
PMCID: PMC11017306  NIHMSID: NIHMS1981692  PMID: 34893774

Abstract

Standfirst: AI-based models may amplify pre-existing human bias within datasets; addressing this problem will require fundamental a realignment of the culture of software development.

Graphical Abstract

graphic file with name nihms-1981692-f0001.jpg


In AI-based predictive models, bias – defined as unfair systematic error – is a growing source of concern, particularly in health care applications. Especially problematic is the unfairness that arises from unequal distribution of error among groups that are vulnerable to harm, historically subject to discrimination, or socially marginalized. In this issue of Nature Medicine, Seyyed-Kalantari and colleagues examine three large publicly-available radiology datasets to demonstrate a specific type of bias in AI-based chest X-ray prediction models.

The authors found that these models are more likely to falsely predict that patients are healthy if they are members of underserved populations, even when using classifiers that are based on state-of-the-art computer vision techniques. In other words, the authors identified an underdiagnosis bias, which is especially ethically problematic because it would wrongly categorize already underserved patients as not in need of treatment—thereby exacerbating existing health disparities. The authors found consistent underdiagnosis of females, patients under 20 years old, Blacks, Hispanics, and patients with Medicaid insurance (who are typically of lower socioeconomic status), as well as intersectional subgroups. They note that, while examples of underdiagnosis of underserved patients have already been identified in several areas of clinical care, predictive models are likely to amplify this bias. In addition, the shift towards automated natural language processing (NLP)-based labelling, which is also known to show bias against underrepresented populations, could contribute to differences in underdiagnosis among underserved groups. This study therefore sheds light on an important but relatively understudied type of bias in health care, and raises bigger questions of how such bias arises and how it can be minimized.

The authors make several recommendations for mitigating underdiagnosis through considerations in the AI development process. For example, they suggest that automatic labelling from radiology reports using NLP should be audited. They also note the tradeoffs between equity (through achieving equal false negative rates (FNR) and false positive rates (FPR)) and model performance. However, in asking the question whether “worsening overall model performance on one subgroup in order to achieve equality is ethically desirable”, the authors also explicitly frame the tradeoff as one of values as well as technical considerations.

Clinicians’ values are reflected in the choice of the binarized metrics FPR/FNR as opposed to area under the curve (AUC), by prioritizing the type of prediction that is most useful for clinical decision-making. For diagnostic tests, AUC is a single metric that represents the likelihood that the test will correctly rank a patient with a lesion and one without, across all diagnostic thresholds such as “benign” or “definitely cancer”. However, it averages across thresholds, even those that are not clinically relevant, and is uninformative about relative sensitivity and specificity, treating them as equally important. However, the dangers of optimizing AI models for the wrong task by failing to recognize or take patient values into account are very real in the health care setting. For example, for a patient, the implications of a false positive versus a false negative of a malignancy are not equivalent.1 Human diagnosticians recognize the difference in misclassification costs, and “err on the side of caution”.2 However, performance metrics that do not account for real-world impacts and what is important to patients and clinicians will be misleading. In addition, the clinician’s need for information on causal inference in order to take action and the limitations of data-driven models to provide such information must be acknowledged.3

The example of the Epic Sepsis Model (ESM) highlights some of the implications of decisions made at different stages of development The model is a tool included in Epic’s electronic health record platform that predicts the probability of a likelihood of sepsis. ESM has drawn criticism because of its poor performance in some health systems, which was characterized as “substantially worse” than what was reported by the developer, Epic Systems.4 However, the developer neither evaluated the product’s real-world performance nor tested it across demographic groups before release.5 Furthermore, its proprietary status makes it difficult for users to evaluate independently. Another critique is the ESM’s use of proxy variables such as ethnicity and marital status, which has known risks6 that require an explicit assessment of bias or confounding to detect.

What influences the values driving AI design choices? The work of Seyyed-Kalantari et al. reveals the importance of the health care context in understanding the implications of AI-driven decisions. Critical to contextual understanding is an awareness of known biases in health care practice and delivery. However, the potential for such understanding may be limited as the major players in the development of health care AI are increasingly from tech companies that lack health care expertise in key positions.7 In addition, in requiring collaboration of people from backgrounds in medicine, data sciences, and engineering, the development of AI for health care brings together people with diverse professional responsibilities and value systems.

The influence of professional norms of medical care and research, computer science and software engineering are thus in flux. The developer culture that evolves, including its values, norms and practices, is especially important given the lack of consensus around standards or a clear regulatory framework to guide or compel assessments of safety and efficacy. Will AI developers be up to the challenges of ensuring fairness and equity in AI development and of implementing the recommendations of Seyyed-Kalantari et al., such as robust auditing of deployed algorithms? How will professional norms of medical care and research interact with those of computer science and software engineering? Will AI development teams include people with deep and specific knowledge of the relevant clinical domains? What incentives are there for AI developers to move beyond reporting AUCs, to take clinical considerations into account in selection of performance metrics, or to conduct fairness checks?

A commitment to addressing underdiagnosis of underserved populations by AI-based models and adopting the recommendations of Seyyed-Kalantari et al. and others, however, will require more than technical solutions and modifications to development and evaluation processes. First, we must acknowledge that bias is not simply a feature of data that can be eliminated; it is defined and shaped by much deeper social and organizational forces.8,9 For example, the use of socially constructed and government mandated categories such as “Hispanic” and “Asian” for data classification is known to obscure a multitude of important health disparities,10,11 which would then be perpetuated by AI models using these categories. A fundamental realignment of professional norms of software development for health care applications that acknowledges developers’ responsibilities to patient health and welfare will be necessary. Values of speed, efficiency, and cost-control must not be prioritized over values of transparency, fairness, and beneficence. Just as important to addressing bias, however, will be the identification of social and organizational factors that lead to inequity and injustice in data and AI modeling processes, and the widespread adoption of norms and practices that correct them.

Footnotes

No conflicts of interest to report

References

  • 1.Challen R, et al. Artificial intelligence, bias and clinical safety. BMJ Quality & Safety 28, 231–237 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Stevens M, Wehrens R & de Bont A Epistemic virtues and data-driven dreams: On sameness and difference in the epistemic cultures of data science and psychiatry. Social Science & Medicine 258, 113116 (2020). [DOI] [PubMed] [Google Scholar]
  • 4.Wong A, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine 181, 1065–1070 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ross C Epic’s sepsis algorithm is going off the rails in the real world. The use of these variables may explain why. in STAT (Boston Globe Media, Boston, 2021). [Google Scholar]
  • 6.Obermeyer Z, Powers B, Vogeli C & Mullainathan S Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019). [DOI] [PubMed] [Google Scholar]
  • 7.Nichol AA, et al. A Typology of Existing Machine Learning–Based Predictive Analytic Tools Focused on Reducing Costs and Improving Quality in Health Care: Systematic Search and Content Analysis. J Med Internet Res 23, e26391 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Miceli M, Posada J & Yang T Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power? in arXiv.org (2021). [Google Scholar]
  • 9.Powles J & Nissenbaum H The Seductive Diversion of ‘Solving’ Bias in Artificial Intelligence. in OneZero (Medium, 2018). [Google Scholar]
  • 10.Quint J, Van Dyke M & Maeda H Disaggregating Data to Measure Racial Disparities in COVID-19 Outcomes and Guide Community Response. in Morbidity and Mortality Weekly Report (MMWR) (CDC, Washington, DC, 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kauh TJ, Read J.n.G. & Scheitler AJ The Critical Role of Racial/Ethnic Data Disaggregation for Health Equity. Popul Res Policy Rev, 1–7 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES